SimHash¶
This script is a simple implementation of the SimHash algorithm.
Quick Start¶
usage: text_dedup.simhash [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES] [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION] [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--ngram NGRAM] [--f F] [--bit_diff BIT_DIFF] [--num_bucket NUM_BUCKET]
Deduplicate text using simhash
- options:
- -h, --help
show this help message and exit
- --path PATH
path in load_dataset
- --name NAME
name in load_dataset
- --data_dir DATA_DIR
data_dir in load_dataset
- --data_files DATA_FILES
data_files in load_dataset
- --split SPLIT
split in load_dataset
- --cache_dir CACHE_DIR
cache_dir in load_dataset
- --revision REVISION
revision in load_dataset
- --use_auth_token, --no-use_auth_token
use_auth_token in load_dataset
- --local, --no-local
Use local dataset (default: False)
- --output OUTPUT
Path to deduplicated dataset output
- --debug, --no-debug
Whether to run in debug mode (default: False)
- --column COLUMN
Text column to use for deduplication. Concatenate desired columns beforehand if needed.
- --batch_size BATCH_SIZE
Batch size to use for dataset iteration. Mainly for memory efficiency.
- --ngram NGRAM
Ngram size to use in SimHash.
- --f F
Simhash bit size
- --bit_diff BIT_DIFF
Bit difference to use in SimHash
- --num_bucket NUM_BUCKET
Number of buckets to use in SimHash, must be larger than bit_diff
Example¶
python -m text_dedup.simhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/simhash/oscar_gl_dedup" \
--column "text" \
--batch_size 10000 \
--use_auth_token true
API Reference¶
- class text_dedup.simhash.Permutation(f: int, k: int, b: int, masks: list[tuple[bitarray, int, int, int]])
- text_dedup.simhash.compute(hashes: list[bitarray]) bitarray
Compute the Simhash of a list of hashes.
Notes to myself: You tried porting this to Cython, but it didn’t improve the performance. Others have experimented with numpy types and operators, but it didn’t improve performance
Parameters¶
- hashesList[int]
The list of hashes.
Returns¶
- bitarray
The Simhash of the list of hashes.
Examples¶
>>> from bitarray.util import int2ba, ba2int >>> res = compute([int2ba(13352372148217134600, length=64), int2ba(5020219685658847592, length=64)]) >>> ba2int(res) 74633958390507528
- text_dedup.simhash.embed_func(content: str, idx: int, *, ngram: int, permutations: list[Permutation], hash_func: Callable) dict[str, Any]
Calculate the simhash signature of a text.
Parameters¶
- contentstr
The text to be hashed.
- idxint
The index of the text.
- ngramint
The ngram size.
- hash_funcCallable
hash function to use
Returns¶
- Dict[str, Any]
The simhash signature and the index of the text as a dictionary.
Examples¶
>>> res = embed_func("hello world", 0, ngram=3, permutations=None, hash_func=xxh3_64_digest) >>> res[INDEX_COLUMN] 0 >>> len(res[SIGNATURE_COLUMN]) 8