Utility Functions and Classes¶
- class text_dedup.utils.BloomFilterArgs(error_rate: float = 1e-06, hash_func: str = 'md5', initial_capacity: int = 100)
- error_rate: float = 1e-06
- hash_func: str = 'md5'
- initial_capacity: int = 100
- static option_group(func)
- class text_dedup.utils.DisableReferenceCount
A context manager to disable reference counting during the execution of a block.
- class text_dedup.utils.ExactHashArgs(hash_func: str = 'md5')
- hash_func: str = 'md5'
- static option_group(func)
- class text_dedup.utils.IOArgs(path: str, output: str, name: str | None = None, data_dir: str | None = None, data_files: str | None = None, split: str | None = None, cache_dir: str = '.cache', revision: str | None = None, use_auth_token: bool = False, local: bool = False, debug: bool = False, clean_cache: bool = False, num_proc: int = 4)
- cache_dir: str = '.cache'
- clean_cache: bool = False
- data_dir: str | None = None
- data_files: str | None = None
- debug: bool = False
- local: bool = False
- name: str | None = None
- num_proc: int = 4
- static option_group(func)
- output: str
- path: str
- revision: str | None = None
- split: str | None = None
- use_auth_token: bool = False
- class text_dedup.utils.MetaArgs(column: str, batch_size: int = 10000, idx_column: str | None = None)
- batch_size: int = 10000
- column: str
- idx_column: str | None = None
- static option_group(func)
- class text_dedup.utils.MinHashArgs(ngram: int = 5, min_length: int = 5, seed: int = 42, num_perm: int = 250, threshold: float = 0.7, b: int | None = None, r: int | None = None, hash_func: str = 'sha1', hash_bits: int = 64)
- b: int | None = None
- hash_bits: int = 64
- hash_func: str = 'sha1'
- min_length: int = 5
- ngram: int = 5
- num_perm: int = 250
- static option_group(func)
- r: int | None = None
- seed: int = 42
- threshold: float = 0.7
- class text_dedup.utils.SAArgs(google_repo_path: str, k: int = 100, strategy: str = 'overlapping')
- google_repo_path: str
- k: int = 100
- static option_group(func)
- strategy: str = 'overlapping'
- class text_dedup.utils.SimHashArgs(ngram: int = 3, f: int = 64, bit_diff: int = 3, num_bucket: int = 4)
- bit_diff: int = 3
- f: int = 64
- ngram: int = 3
- num_bucket: int = 4
- static option_group(func)
- class text_dedup.utils.Timer
A simple timer that tracks the elapsed time of each context.
Examples¶
>>> t = Timer() >>> with t("test"): ... time.sleep(1) >>> assert int(t.elapsed_times.get("test", 0)) >= 1, "The elapsed time should be 1 second."
- report(logger, pad: int)
- class text_dedup.utils.UniSimArgs(store_data: bool = False, index_type: str = 'approx', return_embeddings: bool = False, batch_size: int = 24, use_accelerator: bool = False, model_id: str = 'text/retsim/v1', index_params: dict[str, Any] | None = None, similarity_threshold: float = 0.9, verbose: int = 0)
- batch_size: int = 24
- index_params: dict[str, Any] | None = None
- index_type: str = 'approx'
- model_id: str = 'text/retsim/v1'
- static option_group(func)
- return_embeddings: bool = False
- similarity_threshold: float = 0.9
- store_data: bool = False
- use_accelerator: bool = False
- verbose: int = 0
- class text_dedup.utils.UnionFind
A data structure for maintaining disjoint sets. This helps build connected components for given duplicate pairs. This version uses both rank structure (Union by Rank) and path compression. Applying either union by rank or path compression results in a time complexity of O( log (n) ) each. Applying both further reduces this to O( inverse_ackermann (n) ) (inverse ackermann is a very slow growing function.)
Examples¶
>>> uf = UnionFind() >>> uf.union(1, 2) >>> uf.union(2, 3) >>> uf.union(4, 5) >>> uf.find(1) 1 >>> uf.find(2) 1 >>> uf.find(3) 1 >>> uf.find(4) 4 >>> uf.find(5) 4 >>> uf.rank[1] 1 >>> uf.rank[2] 0 >>> uf.union(3, 4) >>> uf.find(1) == uf.find(5) == 1 True >>> uf.find(7) 7 >>> uf.rank[7] 0
- dump(path: str | Path, id2id=None)
- find(x)
- reset()
- union(x, y)
- text_dedup.utils.load_hf_dataset(io_args: IOArgs, meta_args: MetaArgs) Dataset
A simple wraper to load a huggingface dataset.
Parameters¶
- io_argsIOArgs
The arguments for the dataset to load.
- meta_argsMetaArgs
The arguments for the meta parameters of the dataset to load.
Returns¶
- Dataset
The loaded dataset.
- text_dedup.utils.md5(string=b'', *, usedforsecurity=True)
Returns a md5 hash object; optionally initialized with a string
- text_dedup.utils.md5_digest(data: bytes) bytes
Generate a md5 hash in bytestring form from the given data.
Parameters¶
- databytes
The data to be hashed.
Returns¶
- bytes
The hash value in raw byte strings.
Examples¶
# raw byte strings cause problems on doctests >>> int.from_bytes(md5_digest(b”hello world”), “little”) 260265716838465564751810390803223393886 >>> len(md5_digest(b”hello world”)) 16
- text_dedup.utils.md5_hexdigest(data: bytes) str
Generate a md5 hex hash from the given data.
Parameters¶
- databytes
The data to be hashed.
Returns¶
- str
The hex hash value.
Examples¶
>>> md5_hexdigest(b"hello world") '5eb63bbbe01eeed093cb22bb8f5acdc3' >>> len(md5_hexdigest(b"hello world")) 32
- text_dedup.utils.news_copy_preprocessing(text: str) str
This is the same preprocessing code used NEWS-COPY benchmark.
Prameters¶
- textstr
The input text to be processed.
Returns¶
- str
The processed text.
- text_dedup.utils.ngrams(sequence: List[str], n: int, min_length: int = 5)
Return the ngrams generated from a sequence of items, as an iterator.
This is a modified version of nltk.util.ngrams.
Parameters¶
- sequenceList[Text]
The sequence of items.
- nint
The length of each ngram.
- min_lengthint, optional
The minimum length of each ngram, by default 5
Returns¶
- iterator
The ngrams.
Examples¶
>>> list(ngrams(["a", "b", "c", "d"], 2, min_length=1)) [('a', 'b'), ('b', 'c'), ('c', 'd')] >>> list(ngrams(["a", "b", "c", "d"], 2, min_length=5)) [] >>> list(ngrams(["a", "b"], 3, min_length=1)) [('a', 'b')]
- text_dedup.utils.normalize(line: str) str
Normalize a line of text. Source: https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/text_normalizer.py#L180
Parameters¶
- linestr
The line of text to normalize.
Returns¶
- str
The normalized line of text.
Examples¶
>>> normalize("Hello, world!") 'hello world' >>> normalize("Hello, 123!\n\t\b") 'hello 000'
- text_dedup.utils.optimal_param(threshold: float, num_perm: int, false_positive_weight: float = 0.5, false_negative_weight: float = 0.5)
Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative, taken from datasketch.
You can also refer to the interactive demo at https://huggingface.co/spaces/bigcode/near-deduplication.
Parameters¶
- thresholdfloat
The threshold for similarity.
- num_permint
The number of permutations.
- false_positive_weightfloat
The weight of false positive.
- false_negative_weightfloat
The weight of false negative.
Returns¶
- Tuple[int, int]
The optimal b (bands) and r (rows) parameters.
Examples¶
>>> optimal_param(0.75, 256) (21, 12) >>> optimal_param(0.75, 256, 0.1, 0.9) (28, 9)
- text_dedup.utils.random_samples(ds: Dataset, cluster_column: str, text_column: str, num_clusters: int = 10, num_examples_per_cluster: int = 5)
- text_dedup.utils.sha1_hash(data: bytes, d: int = 32) int
Generate a d-bit hash value from the given data.
Parameters¶
- databytes
The data to be hashed.
- dint
The number of bits of the hash value.
Returns¶
- int
The hash value.
Examples¶
>>> sha1_hash(b"hello world", 32) 896314922 >>> sha1_hash(b"hello world", 64) 13028719972609469994 >>> sha1_hash(b"hello world", 128) 310522945683037930239412421226792791594
- text_dedup.utils.sha256(string=b'', *, usedforsecurity=True)
Returns a sha256 hash object; optionally initialized with a string
- text_dedup.utils.sha256_digest(data: bytes) bytes
Generate a sha256 hash in bytestring form from the given data.
Parameters¶
- databytes
The data to be hashed.
Returns¶
- bytes
The hash value in raw byte strings.
Examples¶
# raw byte strings cause problems on doctests >>> int.from_bytes(sha256_digest(b”hello world”), “little”) 105752752996721010526070019734402373604975086831773275823333741804099920678329 >>> len(sha256_digest(b”hello world”)) 32
- text_dedup.utils.sha256_hexdigest(data: bytes) str
Generate a sha256 hex hash from the given data.
Parameters¶
- databytes
The data to be hashed.
Returns¶
- str
The hex hash value.
Examples¶
>>> sha256_hexdigest(b"hello world") 'b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9' >>> len(sha256_hexdigest(b"hello world")) 64
- class text_dedup.utils.xxh3_128
An xxh3_128 represents the object used to calculate the XXH3_128 hash of a string of information.
Methods:
update(input) – updates the current digest with an additional string digest() – return the current digest value hexdigest() – return the current digest as a string of hexadecimal digits intdigest() – return the current digest as an integer copy() – return a copy of the current xxh3_128 object
- block_size
Block size.
- copy() xxh3_128 object
Return a copy (``clone’’) of the xxh3_128 object.
- digest() string
Return the digest of the strings passed to the update() method so far. This is a 8-byte string which may contain non-ASCII characters, including null bytes.
- digest_size
Digest size.
- digestsize
Digest size.
- hexdigest() string
Like digest(), but returns the digest as a string of hexadecimal digits.
- intdigest() int
Like digest(), but returns the digest as an integer, which is the integer returned by xxhash C API
- name
Name. Always XXH3_128.
- reset()
Reset state.
- seed
Seed.
- update(input)
Update the xxh3_128 object with the string input. Repeated calls are equivalent to a single call with the concatenation of all the arguments.
- text_dedup.utils.xxh3_128_digest()
- text_dedup.utils.xxh3_16hash(data: bytes, seed: int = 0) int
Generate a 16-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a special purpose xxhash when you only want 16 bits. bit masked xxh3_64 hashes are faster than xxh32 in modern systems.
Parameters¶
- databytes
The data to be hashed.
- seedint
xxhashes can all be seeded. Default is int=0
Returns¶
- int
The hash value.
Examples¶
>>> xxh3_16hash(b"hello world") 39051 >>> xxh3_16hash(b"hello world", seed=42) 13198 >>> xxh3_16hash(b"hello world", seed=-42) 34281
- text_dedup.utils.xxh3_32hash(data: bytes, seed: int = 0) int
Generate a 32-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a special purpose xxhash when you only want 32bits. bit masked xxh3_64 hashes are faster than xxh32 in modern systems.
Parameters¶
- databytes
The data to be hashed.
- seedint
xxhashes can all be seeded. Default is int=0
Returns¶
- int
The hash value.
Examples¶
>>> xxh3_32hash(b"hello world") 1088854155 >>> xxh3_32hash(b"hello world", seed=42) 3913102222 >>> xxh3_32hash(b"hello world", seed=-42) 3721037289
- class text_dedup.utils.xxh3_64
An xxh3_64 represents the object used to calculate the XXH3_64 hash of a string of information.
Methods:
update(input) – updates the current digest with an additional string digest() – return the current digest value hexdigest() – return the current digest as a string of hexadecimal digits intdigest() – return the current digest as an integer copy() – return a copy of the current xxh64 object
- block_size
Block size.
- copy() xxh64 object
Return a copy (``clone’’) of the xxh64 object.
- digest() string
Return the digest of the strings passed to the update() method so far. This is a 8-byte string which may contain non-ASCII characters, including null bytes.
- digest_size
Digest size.
- digestsize
Digest size.
- hexdigest() string
Like digest(), but returns the digest as a string of hexadecimal digits.
- intdigest() int
Like digest(), but returns the digest as an integer, which is the integer returned by xxhash C API
- name
Name. Always XXH3_64.
- reset()
Reset state.
- seed
Seed.
- update(input)
Update the xxh3_64 object with the string input. Repeated calls are equivalent to a single call with the concatenation of all the arguments.
- text_dedup.utils.xxh3_64_digest()
- text_dedup.utils.xxh3_hash(data: bytes, d: int = 32) int
Generate a d-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a general purpose xxhash that can take multiple d values
Parameters¶
- databytes
The data to be hashed.
- dint
The number of bits of the hash value. According to this value, chooses empirically found best xxh3 hasher.
Returns¶
- int
The hash value.
Examples¶
>>> xxh3_hash(b"hello world", 32) 1088854155 >>> xxh3_hash(b"hello world", 64) 15296390279056496779 >>> xxh3_hash(b"hello world", 128) 297150157938599054391163723952090887879