SimHash¶

This script is a simple implementation of the SimHash algorithm.

Quick Start¶

usage: text_dedup.simhash [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
                       [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
                       [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
                       [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--ngram NGRAM]
                       [--f F] [--bit_diff BIT_DIFF] [--num_bucket NUM_BUCKET]

Deduplicate text using simhash

options:

-h, --help: show this help message and exit
--path PATH: path in load_dataset
--name NAME: name in load_dataset
--data_dir DATA_DIR: data_dir in load_dataset
--data_files DATA_FILES: data_files in load_dataset
--split SPLIT: split in load_dataset
--cache_dir CACHE_DIR: cache_dir in load_dataset
--revision REVISION: revision in load_dataset
--use_auth_token, --no-use_auth_token: use_auth_token in load_dataset
--local, --no-local: Use local dataset (default: False)
--output OUTPUT: Path to deduplicated dataset output
--debug, --no-debug: Whether to run in debug mode (default: False)
--column COLUMN: Text column to use for deduplication. Concatenate desired columns beforehand if needed.
--batch_size BATCH_SIZE: Batch size to use for dataset iteration. Mainly for memory efficiency.
--ngram NGRAM: Ngram size to use in SimHash.
--f F: Simhash bit size
--bit_diff BIT_DIFF: Bit difference to use in SimHash
--num_bucket NUM_BUCKET: Number of buckets to use in SimHash, must be larger than bit_diff

Example¶

python -m text_dedup.simhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/simhash/oscar_gl_dedup" \
--column "text" \
--batch_size 10000 \
--use_auth_token true

API Reference¶

class text_dedup.simhash.Permutation(f: int, k: int, b: int, masks: list[tuple[bitarray, int, int, int]])

permute(x: bitarray) → bitarray

Permute the fingerprint.

Parameters¶

x: bitarray: The fingerprint to be permuted

Returns¶

bitarray: The permuted fingerprint

reverse(x: int) → int

Reverse the permutation.

Parameters¶

x: int: The fingerprint to be reversed

Returns¶

int: The reversed fingerprint

text_dedup.simhash.compute(hashes: list[bitarray]) → bitarray

Compute the Simhash of a list of hashes.

Notes to myself: You tried porting this to Cython, but it didn’t improve the performance. Others have experimented with numpy types and operators, but it didn’t improve performance

Parameters¶

hashesList[int]: The list of hashes.

Returns¶

bitarray: The Simhash of the list of hashes.

Examples¶

>>> from bitarray.util import int2ba, ba2int
>>> res = compute([int2ba(13352372148217134600, length=64), int2ba(5020219685658847592, length=64)])
>>> ba2int(res)
74633958390507528

text_dedup.simhash.embed_func(content: str, idx: int, *, ngram: int, permutations: list[Permutation], hash_func: Callable) → dict[str, Any]

Calculate the simhash signature of a text.

Parameters¶

contentstr: The text to be hashed.
idxint: The index of the text.
ngramint: The ngram size.
hash_funcCallable: hash function to use

Returns¶

Dict[str, Any]: The simhash signature and the index of the text as a dictionary.

Examples¶

>>> res = embed_func("hello world", 0, ngram=3, permutations=None, hash_func=xxh3_64_digest)
>>> res[INDEX_COLUMN]
0
>>> len(res[SIGNATURE_COLUMN])
8

text-dedup

Documentation

SimHash¶

Quick Start¶

Example¶

API Reference¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Examples¶

Parameters¶

Returns¶

Examples¶