MinHash + LSH¶
This script implements a simple MinHash + LSH algorithm for finding near duplicates. This is largely based on the implementation from datasketch (MIT).
Quick Start¶
usage: text_dedup.minhash [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
[--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
[--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
[--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--ngram NGRAM]
[--min_length MIN_LENGTH] [--seed SEED] [--num_perm NUM_PERM] [--threshold THRESHOLD]
[--b B] [--r R]
Deduplicate text using minhash
- options:
- -h, --help
show this help message and exit
- --path PATH
path in load_dataset
- --name NAME
name in load_dataset
- --data_dir DATA_DIR
data_dir in load_dataset
- --data_files DATA_FILES
data_files in load_dataset
- --split SPLIT
split in load_dataset
- --cache_dir CACHE_DIR
cache_dir in load_dataset
- --revision REVISION
revision in load_dataset
- --use_auth_token, --no-use_auth_token
use_auth_token in load_dataset
- --local, --no-local
Use local dataset (default: False)
- --output OUTPUT
Path to deduplicated dataset output
- --debug, --no-debug
Whether to run in debug mode (default: False)
- --column COLUMN
Text column to use for deduplication. Concatenate desired columns beforehand if needed.
- --batch_size BATCH_SIZE
Batch size to use for dataset iteration. Mainly for memory efficiency.
- --ngram NGRAM
Ngram size to use in MinHash.
- --min_length MIN_LENGTH
Minimum number of tokens to use in MinHash. Shorter documents will be filtered out/removed.
- --seed SEED
Seed to use in MinHash
- --num_perm NUM_PERM
Number of permutations to use in MinHash
- --threshold THRESHOLD
Jaccard similarity threshold to use in MinHashLSH
- --b B
Number of bands
- --r R
Number of rows per band
Example¶
python -m text_dedup.minhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/minhash/oscar_gl_dedup" \
--column "text" \
--batch_size 10000 \
--use_auth_token true
Intuitions on Parameters¶
In this section, I’ll cover the following parameters based on my own experience:
ngram
num_perm
threshold
b and r
An interactive demo can be found here.
ngram or tokenization in general is a way of describing the document. After all, MinHash is an approximation of the Jaccard similarity between two sets — two sets of ngrams in this case. The quality of the ngrams will impact your final results. For example, if you use character uni-gram, you will almost certainly get a lot of false positives for English text or DNA sequences. But it won’t necessarily be a problem for CJK data. I usually start with word-level tri-grams and adjust from there. Strangely enough, people almost always choose odd numbers for ngram (e.g. 3, 5, 7, etc.). I don’t know why, but I do it too.
num_perm is the number of permutations to use in MinHash. It is tightly related to the band and row numbers. The higher the permutation number, the more accurate the Jaccard similarity estimation will be. But it will also be slower and use more space (the space complexity is O(num_perm)). I usually start with something like 128 or 256 and adjust from there. One thing you could try is to play with the interactive demo to see how much false positive and false negative you are willing to tolerate, and then adjust those settings accordingly.
You might see something like 9000 permutations used in research papers [1], along with additional second-stage filtering (e.g. edit similarity) to reduce the false positives. If you modify the interactive demo code to select 9k permutations, 450 bands, and 20 rows per band, you will see that you will more likely to expect false positives, which necessitates the second-stage filtering. Of course you can choose something different when you have different priorities.
Time and Space Complexity¶
The time complexity of MinHash is O(num_docs * doc_length * num_perm / num_proc). The space complexity is O(num_docs * num_perm).
The time complexity of LSH is O(num_docs * b). The space complexity varies depending on how many duplicates are there. If all documents are unique, you would expect O(num_docs * b). If all documents are duplicates, you would expect O(b).
If you add secondary filtering in the process, the time complexity will be higher.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. 2022. arXiv:2107.06499.
API Reference¶
- text_dedup.minhash.embed_func(content: str, idx: int, *, num_perm: int, ngram_size: int, min_length: int, hashranges: list[tuple[int, int]], permutations: ndarray, hash_func: Callable, dtype: type, max_hash: uint64, modulo_prime: uint64) dict[str, Any]
Calculate hash values for the content.
Parameters¶
- contentstr
The content to be embedded.
- idxint
The index of the content.
- num_permint
The number of permutations.
- ngram_sizeint
The size of n-grams.
- min_lengthint
The minimum length of the document in terms of tokens.
- hashrangesList[Tuple[int, int]]
The ranges of hash values.
- permutationsnp.ndarray
The permutations for the minhash.
- hash_funcCallable
The hash function to use.
Returns¶
- Dict[str, Any]
The hash values in each range and the index.
Examples¶
>>> content = "hello world" >>> idx = 0 >>> num_perm = 250 >>> ngram_size = 1 >>> hashranges = [(i, i + 25) for i in range(0, 250, 25)] >>> max_hash = np.uint32((1 << 32) - 1) >>> modulo_prime = np.uint32((1 << 32) - 5) >>> PERMUTATIONS = (RNG.randint(1, modulo_prime, size=num_perm), RNG.randint(0, modulo_prime, size=num_perm)) >>> res = embed_func( ... content, ... idx, ... num_perm=num_perm, ... ngram_size=ngram_size, ... min_length=0, ... hashranges=hashranges, ... permutations=PERMUTATIONS, ... hash_func=xxh3_32hash, ... dtype=np.uint32, ... max_hash=max_hash, ... modulo_prime=modulo_prime, ... ) >>> len(res[SIGNATURE_COLUMN]) 10 >>> res[INDEX_COLUMN] 0