SimHash

This script is a simple implementation of the SimHash algorithm.

Quick Start

usage: text_dedup.simhash [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
                       [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
                       [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
                       [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--ngram NGRAM]
                       [--f F] [--bit_diff BIT_DIFF] [--num_bucket NUM_BUCKET]

Deduplicate text using simhash

options:
-h, --help

show this help message and exit

--path PATH

path in load_dataset

--name NAME

name in load_dataset

--data_dir DATA_DIR

data_dir in load_dataset

--data_files DATA_FILES

data_files in load_dataset

--split SPLIT

split in load_dataset

--cache_dir CACHE_DIR

cache_dir in load_dataset

--revision REVISION

revision in load_dataset

--use_auth_token, --no-use_auth_token

use_auth_token in load_dataset

--local, --no-local

Use local dataset (default: False)

--output OUTPUT

Path to deduplicated dataset output

--debug, --no-debug

Whether to run in debug mode (default: False)

--column COLUMN

Text column to use for deduplication. Concatenate desired columns beforehand if needed.

--batch_size BATCH_SIZE

Batch size to use for dataset iteration. Mainly for memory efficiency.

--ngram NGRAM

Ngram size to use in SimHash.

--f F

Simhash bit size

--bit_diff BIT_DIFF

Bit difference to use in SimHash

--num_bucket NUM_BUCKET

Number of buckets to use in SimHash, must be larger than bit_diff

Example

python -m text_dedup.simhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/simhash/oscar_gl_dedup" \
--column "text" \
--batch_size 10000 \
--use_auth_token true

API Reference

class text_dedup.simhash.Permutation(f: int, k: int, b: int, masks: list[tuple[bitarray, int, int, int]])
permute(x: bitarray) bitarray

Permute the fingerprint.

Parameters

x: bitarray

The fingerprint to be permuted

Returns

bitarray

The permuted fingerprint

reverse(x: int) int

Reverse the permutation.

Parameters

x: int

The fingerprint to be reversed

Returns

int

The reversed fingerprint

text_dedup.simhash.compute(hashes: list[bitarray]) bitarray

Compute the Simhash of a list of hashes.

Notes to myself: You tried porting this to Cython, but it didn’t improve the performance. Others have experimented with numpy types and operators, but it didn’t improve performance

Parameters

hashesList[int]

The list of hashes.

Returns

bitarray

The Simhash of the list of hashes.

Examples

>>> from bitarray.util import int2ba, ba2int
>>> res = compute([int2ba(13352372148217134600, length=64), int2ba(5020219685658847592, length=64)])
>>> ba2int(res)
74633958390507528
text_dedup.simhash.embed_func(content: str, idx: int, *, ngram: int, permutations: list[Permutation], hash_func: Callable) dict[str, Any]

Calculate the simhash signature of a text.

Parameters

contentstr

The text to be hashed.

idxint

The index of the text.

ngramint

The ngram size.

hash_funcCallable

hash function to use

Returns

Dict[str, Any]

The simhash signature and the index of the text as a dictionary.

Examples

>>> res = embed_func("hello world", 0, ngram=3, permutations=None, hash_func=xxh3_64_digest)
>>> res[INDEX_COLUMN]
0
>>> len(res[SIGNATURE_COLUMN])
8