MinHash + LSH

This script implements a simple MinHash + LSH algorithm for finding near duplicates. This is largely based on the implementation from datasketch (MIT).

Quick Start

usage: text_dedup.minhash [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
                       [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
                       [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
                       [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--ngram NGRAM]
                       [--min_length MIN_LENGTH] [--seed SEED] [--num_perm NUM_PERM] [--threshold THRESHOLD]
                       [--b B] [--r R]

Deduplicate text using minhash

options:
-h, --help

show this help message and exit

--path PATH

path in load_dataset

--name NAME

name in load_dataset

--data_dir DATA_DIR

data_dir in load_dataset

--data_files DATA_FILES

data_files in load_dataset

--split SPLIT

split in load_dataset

--cache_dir CACHE_DIR

cache_dir in load_dataset

--revision REVISION

revision in load_dataset

--use_auth_token, --no-use_auth_token

use_auth_token in load_dataset

--local, --no-local

Use local dataset (default: False)

--output OUTPUT

Path to deduplicated dataset output

--debug, --no-debug

Whether to run in debug mode (default: False)

--column COLUMN

Text column to use for deduplication. Concatenate desired columns beforehand if needed.

--batch_size BATCH_SIZE

Batch size to use for dataset iteration. Mainly for memory efficiency.

--ngram NGRAM

Ngram size to use in MinHash.

--min_length MIN_LENGTH

Minimum number of tokens to use in MinHash. Shorter documents will be filtered out/removed.

--seed SEED

Seed to use in MinHash

--num_perm NUM_PERM

Number of permutations to use in MinHash

--threshold THRESHOLD

Jaccard similarity threshold to use in MinHashLSH

--b B

Number of bands

--r R

Number of rows per band

Example

python -m text_dedup.minhash \
   --path "oscar-corpus/OSCAR-2201" \
   --name "gl" \
   --split "train" \
   --cache_dir "./cache" \
   --output "output/minhash/oscar_gl_dedup" \
   --column "text" \
   --batch_size 10000 \
   --use_auth_token true

Intuitions on Parameters

In this section, I’ll cover the following parameters based on my own experience:

  • ngram

  • num_perm

  • threshold

  • b and r

An interactive demo can be found here.

ngram or tokenization in general is a way of describing the document. After all, MinHash is an approximation of the Jaccard similarity between two sets — two sets of ngrams in this case. The quality of the ngrams will impact your final results. For example, if you use character uni-gram, you will almost certainly get a lot of false positives for English text or DNA sequences. But it won’t necessarily be a problem for CJK data. I usually start with word-level tri-grams and adjust from there. Strangely enough, people almost always choose odd numbers for ngram (e.g. 3, 5, 7, etc.). I don’t know why, but I do it too.

num_perm is the number of permutations to use in MinHash. It is tightly related to the band and row numbers. The higher the permutation number, the more accurate the Jaccard similarity estimation will be. But it will also be slower and use more space (the space complexity is O(num_perm)). I usually start with something like 128 or 256 and adjust from there. One thing you could try is to play with the interactive demo to see how much false positive and false negative you are willing to tolerate, and then adjust those settings accordingly.

You might see something like 9000 permutations used in research papers [1], along with additional second-stage filtering (e.g. edit similarity) to reduce the false positives. If you modify the interactive demo code to select 9k permutations, 450 bands, and 20 rows per band, you will see that you will more likely to expect false positives, which necessitates the second-stage filtering. Of course you can choose something different when you have different priorities.

Time and Space Complexity

The time complexity of MinHash is O(num_docs * doc_length * num_perm / num_proc). The space complexity is O(num_docs * num_perm).

The time complexity of LSH is O(num_docs * b). The space complexity varies depending on how many duplicates are there. If all documents are unique, you would expect O(num_docs * b). If all documents are duplicates, you would expect O(b).

If you add secondary filtering in the process, the time complexity will be higher.

[1]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. 2022. arXiv:2107.06499.

API Reference

text_dedup.minhash.embed_func(content: str, idx: int, *, num_perm: int, ngram_size: int, min_length: int, hashranges: list[tuple[int, int]], permutations: ndarray, hash_func: Callable, dtype: type, max_hash: uint64, modulo_prime: uint64) dict[str, Any]

Calculate hash values for the content.

Parameters

contentstr

The content to be embedded.

idxint

The index of the content.

num_permint

The number of permutations.

ngram_sizeint

The size of n-grams.

min_lengthint

The minimum length of the document in terms of tokens.

hashrangesList[Tuple[int, int]]

The ranges of hash values.

permutationsnp.ndarray

The permutations for the minhash.

hash_funcCallable

The hash function to use.

Returns

Dict[str, Any]

The hash values in each range and the index.

Examples

>>> content = "hello world"
>>> idx = 0
>>> num_perm = 250
>>> ngram_size = 1
>>> hashranges = [(i, i + 25) for i in range(0, 250, 25)]
>>> max_hash = np.uint32((1 << 32) - 1)
>>> modulo_prime = np.uint32((1 << 32) - 5)
>>> PERMUTATIONS = (RNG.randint(1, modulo_prime, size=num_perm), RNG.randint(0, modulo_prime, size=num_perm))
>>> res = embed_func(
...     content,
...     idx,
...     num_perm=num_perm,
...     ngram_size=ngram_size,
...     min_length=0,
...     hashranges=hashranges,
...     permutations=PERMUTATIONS,
...     hash_func=xxh3_32hash,
...     dtype=np.uint32,
...     max_hash=max_hash,
...     modulo_prime=modulo_prime,
... )
>>> len(res[SIGNATURE_COLUMN])
10
>>> res[INDEX_COLUMN]
0