Utility Functions and Classes

class text_dedup.utils.BloomFilterArgs(error_rate: float = 1e-06, hash_func: str = 'md5', initial_capacity: int = 100)
error_rate: float = 1e-06
hash_func: str = 'md5'
initial_capacity: int = 100
static option_group(func)
class text_dedup.utils.DisableReferenceCount

A context manager to disable reference counting during the execution of a block.

class text_dedup.utils.ExactHashArgs(hash_func: str = 'md5')
hash_func: str = 'md5'
static option_group(func)
class text_dedup.utils.IOArgs(path: str, output: str, name: str | None = None, data_dir: str | None = None, data_files: str | None = None, split: str | None = None, cache_dir: str = '.cache', revision: str | None = None, use_auth_token: bool = False, local: bool = False, debug: bool = False, clean_cache: bool = False, num_proc: int = 4)
cache_dir: str = '.cache'
clean_cache: bool = False
data_dir: str | None = None
data_files: str | None = None
debug: bool = False
local: bool = False
name: str | None = None
num_proc: int = 4
static option_group(func)
output: str
path: str
revision: str | None = None
split: str | None = None
use_auth_token: bool = False
class text_dedup.utils.MetaArgs(column: str, batch_size: int = 10000, idx_column: str | None = None)
batch_size: int = 10000
column: str
idx_column: str | None = None
static option_group(func)
class text_dedup.utils.MinHashArgs(ngram: int = 5, min_length: int = 5, seed: int = 42, num_perm: int = 250, threshold: float = 0.7, b: int | None = None, r: int | None = None, hash_func: str = 'sha1', hash_bits: int = 64)
b: int | None = None
hash_bits: int = 64
hash_func: str = 'sha1'
min_length: int = 5
ngram: int = 5
num_perm: int = 250
static option_group(func)
r: int | None = None
seed: int = 42
threshold: float = 0.7
class text_dedup.utils.SAArgs(google_repo_path: str, k: int = 100, strategy: str = 'overlapping')
google_repo_path: str
k: int = 100
static option_group(func)
strategy: str = 'overlapping'
class text_dedup.utils.SimHashArgs(ngram: int = 3, f: int = 64, bit_diff: int = 3, num_bucket: int = 4)
bit_diff: int = 3
f: int = 64
ngram: int = 3
num_bucket: int = 4
static option_group(func)
class text_dedup.utils.Timer

A simple timer that tracks the elapsed time of each context.

Examples

>>> t = Timer()
>>> with t("test"):
...     time.sleep(1)
>>> assert int(t.elapsed_times.get("test", 0)) >= 1, "The elapsed time should be 1 second."
report(logger, pad: int)
class text_dedup.utils.UniSimArgs(store_data: bool = False, index_type: str = 'approx', return_embeddings: bool = False, batch_size: int = 24, use_accelerator: bool = False, model_id: str = 'text/retsim/v1', index_params: dict[str, Any] | None = None, similarity_threshold: float = 0.9, verbose: int = 0)
batch_size: int = 24
index_params: dict[str, Any] | None = None
index_type: str = 'approx'
model_id: str = 'text/retsim/v1'
static option_group(func)
return_embeddings: bool = False
similarity_threshold: float = 0.9
store_data: bool = False
use_accelerator: bool = False
verbose: int = 0
class text_dedup.utils.UnionFind

A data structure for maintaining disjoint sets. This helps build connected components for given duplicate pairs. This version uses both rank structure (Union by Rank) and path compression. Applying either union by rank or path compression results in a time complexity of O( log (n) ) each. Applying both further reduces this to O( inverse_ackermann (n) ) (inverse ackermann is a very slow growing function.)

Examples

>>> uf = UnionFind()
>>> uf.union(1, 2)
>>> uf.union(2, 3)
>>> uf.union(4, 5)
>>> uf.find(1)
1
>>> uf.find(2)
1
>>> uf.find(3)
1
>>> uf.find(4)
4
>>> uf.find(5)
4
>>> uf.rank[1]
1
>>> uf.rank[2]
0
>>> uf.union(3, 4)
>>> uf.find(1) == uf.find(5) == 1
True
>>> uf.find(7)
7
>>> uf.rank[7]
0
dump(path: str | Path, id2id=None)
find(x)
reset()
union(x, y)
text_dedup.utils.load_hf_dataset(io_args: IOArgs, meta_args: MetaArgs) Dataset

A simple wraper to load a huggingface dataset.

Parameters

io_argsIOArgs

The arguments for the dataset to load.

meta_argsMetaArgs

The arguments for the meta parameters of the dataset to load.

Returns

Dataset

The loaded dataset.

text_dedup.utils.md5(string=b'', *, usedforsecurity=True)

Returns a md5 hash object; optionally initialized with a string

text_dedup.utils.md5_digest(data: bytes) bytes

Generate a md5 hash in bytestring form from the given data.

Parameters

databytes

The data to be hashed.

Returns

bytes

The hash value in raw byte strings.

Examples

# raw byte strings cause problems on doctests >>> int.from_bytes(md5_digest(b”hello world”), “little”) 260265716838465564751810390803223393886 >>> len(md5_digest(b”hello world”)) 16

text_dedup.utils.md5_hexdigest(data: bytes) str

Generate a md5 hex hash from the given data.

Parameters

databytes

The data to be hashed.

Returns

str

The hex hash value.

Examples

>>> md5_hexdigest(b"hello world")
'5eb63bbbe01eeed093cb22bb8f5acdc3'
>>> len(md5_hexdigest(b"hello world"))
32
text_dedup.utils.news_copy_preprocessing(text: str) str

This is the same preprocessing code used NEWS-COPY benchmark.

Prameters

textstr

The input text to be processed.

Returns

str

The processed text.

text_dedup.utils.ngrams(sequence: List[str], n: int, min_length: int = 5)

Return the ngrams generated from a sequence of items, as an iterator.

This is a modified version of nltk.util.ngrams.

Parameters

sequenceList[Text]

The sequence of items.

nint

The length of each ngram.

min_lengthint, optional

The minimum length of each ngram, by default 5

Returns

iterator

The ngrams.

Examples

>>> list(ngrams(["a", "b", "c", "d"], 2, min_length=1))
[('a', 'b'), ('b', 'c'), ('c', 'd')]
>>> list(ngrams(["a", "b", "c", "d"], 2, min_length=5))
[]
>>> list(ngrams(["a", "b"], 3, min_length=1))
[('a', 'b')]
text_dedup.utils.normalize(line: str) str

Normalize a line of text. Source: https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/text_normalizer.py#L180

Parameters

linestr

The line of text to normalize.

Returns

str

The normalized line of text.

Examples

>>> normalize("Hello, world!")
'hello world'
>>> normalize("Hello, 123!\n\t\b")
'hello 000'
text_dedup.utils.optimal_param(threshold: float, num_perm: int, false_positive_weight: float = 0.5, false_negative_weight: float = 0.5)

Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative, taken from datasketch.

You can also refer to the interactive demo at https://huggingface.co/spaces/bigcode/near-deduplication.

Parameters

thresholdfloat

The threshold for similarity.

num_permint

The number of permutations.

false_positive_weightfloat

The weight of false positive.

false_negative_weightfloat

The weight of false negative.

Returns

Tuple[int, int]

The optimal b (bands) and r (rows) parameters.

Examples

>>> optimal_param(0.75, 256)
(21, 12)
>>> optimal_param(0.75, 256, 0.1, 0.9)
(28, 9)
text_dedup.utils.random_samples(ds: Dataset, cluster_column: str, text_column: str, num_clusters: int = 10, num_examples_per_cluster: int = 5)
text_dedup.utils.sha1_hash(data: bytes, d: int = 32) int

Generate a d-bit hash value from the given data.

Parameters

databytes

The data to be hashed.

dint

The number of bits of the hash value.

Returns

int

The hash value.

Examples

>>> sha1_hash(b"hello world", 32)
896314922
>>> sha1_hash(b"hello world", 64)
13028719972609469994
>>> sha1_hash(b"hello world", 128)
310522945683037930239412421226792791594
text_dedup.utils.sha256(string=b'', *, usedforsecurity=True)

Returns a sha256 hash object; optionally initialized with a string

text_dedup.utils.sha256_digest(data: bytes) bytes

Generate a sha256 hash in bytestring form from the given data.

Parameters

databytes

The data to be hashed.

Returns

bytes

The hash value in raw byte strings.

Examples

# raw byte strings cause problems on doctests >>> int.from_bytes(sha256_digest(b”hello world”), “little”) 105752752996721010526070019734402373604975086831773275823333741804099920678329 >>> len(sha256_digest(b”hello world”)) 32

text_dedup.utils.sha256_hexdigest(data: bytes) str

Generate a sha256 hex hash from the given data.

Parameters

databytes

The data to be hashed.

Returns

str

The hex hash value.

Examples

>>> sha256_hexdigest(b"hello world")
'b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9'
>>> len(sha256_hexdigest(b"hello world"))
64
class text_dedup.utils.xxh3_128

An xxh3_128 represents the object used to calculate the XXH3_128 hash of a string of information.

Methods:

update(input) – updates the current digest with an additional string digest() – return the current digest value hexdigest() – return the current digest as a string of hexadecimal digits intdigest() – return the current digest as an integer copy() – return a copy of the current xxh3_128 object

block_size

Block size.

copy() xxh3_128 object

Return a copy (``clone’’) of the xxh3_128 object.

digest() string

Return the digest of the strings passed to the update() method so far. This is a 8-byte string which may contain non-ASCII characters, including null bytes.

digest_size

Digest size.

digestsize

Digest size.

hexdigest() string

Like digest(), but returns the digest as a string of hexadecimal digits.

intdigest() int

Like digest(), but returns the digest as an integer, which is the integer returned by xxhash C API

name

Name. Always XXH3_128.

reset()

Reset state.

seed

Seed.

update(input)

Update the xxh3_128 object with the string input. Repeated calls are equivalent to a single call with the concatenation of all the arguments.

text_dedup.utils.xxh3_128_digest()
text_dedup.utils.xxh3_16hash(data: bytes, seed: int = 0) int

Generate a 16-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a special purpose xxhash when you only want 16 bits. bit masked xxh3_64 hashes are faster than xxh32 in modern systems.

Parameters

databytes

The data to be hashed.

seedint

xxhashes can all be seeded. Default is int=0

Returns

int

The hash value.

Examples

>>> xxh3_16hash(b"hello world")
39051
>>> xxh3_16hash(b"hello world", seed=42)
13198
>>> xxh3_16hash(b"hello world", seed=-42)
34281
text_dedup.utils.xxh3_32hash(data: bytes, seed: int = 0) int

Generate a 32-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a special purpose xxhash when you only want 32bits. bit masked xxh3_64 hashes are faster than xxh32 in modern systems.

Parameters

databytes

The data to be hashed.

seedint

xxhashes can all be seeded. Default is int=0

Returns

int

The hash value.

Examples

>>> xxh3_32hash(b"hello world")
1088854155
>>> xxh3_32hash(b"hello world", seed=42)
3913102222
>>> xxh3_32hash(b"hello world", seed=-42)
3721037289
class text_dedup.utils.xxh3_64

An xxh3_64 represents the object used to calculate the XXH3_64 hash of a string of information.

Methods:

update(input) – updates the current digest with an additional string digest() – return the current digest value hexdigest() – return the current digest as a string of hexadecimal digits intdigest() – return the current digest as an integer copy() – return a copy of the current xxh64 object

block_size

Block size.

copy() xxh64 object

Return a copy (``clone’’) of the xxh64 object.

digest() string

Return the digest of the strings passed to the update() method so far. This is a 8-byte string which may contain non-ASCII characters, including null bytes.

digest_size

Digest size.

digestsize

Digest size.

hexdigest() string

Like digest(), but returns the digest as a string of hexadecimal digits.

intdigest() int

Like digest(), but returns the digest as an integer, which is the integer returned by xxhash C API

name

Name. Always XXH3_64.

reset()

Reset state.

seed

Seed.

update(input)

Update the xxh3_64 object with the string input. Repeated calls are equivalent to a single call with the concatenation of all the arguments.

text_dedup.utils.xxh3_64_digest()
text_dedup.utils.xxh3_hash(data: bytes, d: int = 32) int

Generate a d-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a general purpose xxhash that can take multiple d values

Parameters

databytes

The data to be hashed.

dint

The number of bits of the hash value. According to this value, chooses empirically found best xxh3 hasher.

Returns

int

The hash value.

Examples

>>> xxh3_hash(b"hello world", 32)
1088854155
>>> xxh3_hash(b"hello world", 64)
15296390279056496779
>>> xxh3_hash(b"hello world", 128)
297150157938599054391163723952090887879