Utility Functions and Classes¶

class text_dedup.utils.BloomFilterArgs(error_rate: float = 1e-06, hash_func: str = 'md5', initial_capacity: int = 100)

error_rate: float = 1e-06

hash_func: str = 'md5'

initial_capacity: int = 100

static option_group(func)

class text_dedup.utils.DisableReferenceCount: A context manager to disable reference counting during the execution of a block.

class text_dedup.utils.ExactHashArgs(hash_func: str = 'md5')

hash_func: str = 'md5'

static option_group(func)

class text_dedup.utils.IOArgs(path: str, output: str, name: str | None = None, data_dir: str | None = None, data_files: str | None = None, split: str | None = None, cache_dir: str = '.cache', revision: str | None = None, use_auth_token: bool = False, local: bool = False, debug: bool = False, clean_cache: bool = False, num_proc: int = 4)

cache_dir: str = '.cache'

clean_cache: bool = False

data_dir: str | None = None

data_files: str | None = None

debug: bool = False

local: bool = False

name: str | None = None

num_proc: int = 4

static option_group(func)

output: str

path: str

revision: str | None = None

split: str | None = None

use_auth_token: bool = False

class text_dedup.utils.MetaArgs(column: str, batch_size: int = 10000, idx_column: str | None = None)

batch_size: int = 10000

column: str

idx_column: str | None = None

static option_group(func)

class text_dedup.utils.MinHashArgs(ngram: int = 5, min_length: int = 5, seed: int = 42, num_perm: int = 250, threshold: float = 0.7, b: int | None = None, r: int | None = None, hash_func: str = 'sha1', hash_bits: int = 64)

b: int | None = None

hash_bits: int = 64

hash_func: str = 'sha1'

min_length: int = 5

ngram: int = 5

num_perm: int = 250

static option_group(func)

r: int | None = None

seed: int = 42

threshold: float = 0.7

class text_dedup.utils.SAArgs(google_repo_path: str, k: int = 100, strategy: str = 'overlapping')

google_repo_path: str

k: int = 100

static option_group(func)

strategy: str = 'overlapping'

class text_dedup.utils.SimHashArgs(ngram: int = 3, f: int = 64, bit_diff: int = 3, num_bucket: int = 4)

bit_diff: int = 3

f: int = 64

ngram: int = 3

num_bucket: int = 4

static option_group(func)

class text_dedup.utils.Timer

A simple timer that tracks the elapsed time of each context.

Examples¶

>>> t = Timer()
>>> with t("test"):
...     time.sleep(1)
>>> assert int(t.elapsed_times.get("test", 0)) >= 1, "The elapsed time should be 1 second."

report(logger, pad: int)

class text_dedup.utils.UniSimArgs(store_data: bool = False, index_type: str = 'approx', return_embeddings: bool = False, batch_size: int = 24, use_accelerator: bool = False, model_id: str = 'text/retsim/v1', index_params: dict[str, Any] | None = None, similarity_threshold: float = 0.9, verbose: int = 0)

batch_size: int = 24

index_params: dict[str, Any] | None = None

index_type: str = 'approx'

model_id: str = 'text/retsim/v1'

static option_group(func)

return_embeddings: bool = False

similarity_threshold: float = 0.9

store_data: bool = False

use_accelerator: bool = False

verbose: int = 0

class text_dedup.utils.UnionFind

A data structure for maintaining disjoint sets. This helps build connected components for given duplicate pairs. This version uses both rank structure (Union by Rank) and path compression. Applying either union by rank or path compression results in a time complexity of O( log (n) ) each. Applying both further reduces this to O( inverse_ackermann (n) ) (inverse ackermann is a very slow growing function.)

Examples¶

>>> uf = UnionFind()
>>> uf.union(1, 2)
>>> uf.union(2, 3)
>>> uf.union(4, 5)
>>> uf.find(1)
1
>>> uf.find(2)
1
>>> uf.find(3)
1
>>> uf.find(4)
4
>>> uf.find(5)
4
>>> uf.rank[1]
1
>>> uf.rank[2]
0
>>> uf.union(3, 4)
>>> uf.find(1) == uf.find(5) == 1
True
>>> uf.find(7)
7
>>> uf.rank[7]
0

dump(path: str | Path, id2id=None)

find(x)

reset()

union(x, y)

text_dedup.utils.load_hf_dataset(io_args: IOArgs, meta_args: MetaArgs) → Dataset

A simple wraper to load a huggingface dataset.

Parameters¶

io_argsIOArgs: The arguments for the dataset to load.
meta_argsMetaArgs: The arguments for the meta parameters of the dataset to load.

Returns¶

Dataset: The loaded dataset.

text_dedup.utils.md5(string=b'', *, usedforsecurity=True): Returns a md5 hash object; optionally initialized with a string

text_dedup.utils.md5_digest(data: bytes) → bytes

Generate a md5 hash in bytestring form from the given data.

Parameters¶

databytes: The data to be hashed.

Returns¶

bytes: The hash value in raw byte strings.

Examples¶

# raw byte strings cause problems on doctests >>> int.from_bytes(md5_digest(b”hello world”), “little”) 260265716838465564751810390803223393886 >>> len(md5_digest(b”hello world”)) 16

text_dedup.utils.md5_hexdigest(data: bytes) → str

Generate a md5 hex hash from the given data.

Parameters¶

databytes: The data to be hashed.

Returns¶

str: The hex hash value.

Examples¶

>>> md5_hexdigest(b"hello world")
'5eb63bbbe01eeed093cb22bb8f5acdc3'
>>> len(md5_hexdigest(b"hello world"))
32

text_dedup.utils.news_copy_preprocessing(text: str) → str

This is the same preprocessing code used NEWS-COPY benchmark.

Prameters¶

textstr: The input text to be processed.

Returns¶

str: The processed text.

text_dedup.utils.ngrams(sequence: List[str], n: int, min_length: int = 5)

Return the ngrams generated from a sequence of items, as an iterator.

This is a modified version of nltk.util.ngrams.

Parameters¶

sequenceList[Text]: The sequence of items.
nint: The length of each ngram.
min_lengthint, optional: The minimum length of each ngram, by default 5

Returns¶

iterator: The ngrams.

Examples¶

>>> list(ngrams(["a", "b", "c", "d"], 2, min_length=1))
[('a', 'b'), ('b', 'c'), ('c', 'd')]
>>> list(ngrams(["a", "b", "c", "d"], 2, min_length=5))
[]
>>> list(ngrams(["a", "b"], 3, min_length=1))
[('a', 'b')]

text_dedup.utils.normalize(line: str) → str

Normalize a line of text. Source: https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/text_normalizer.py#L180

Parameters¶

linestr: The line of text to normalize.

Returns¶

str: The normalized line of text.

Examples¶

>>> normalize("Hello, world!")
'hello world'
>>> normalize("Hello, 123!\n\t\b")
'hello 000'

text_dedup.utils.optimal_param(threshold: float, num_perm: int, false_positive_weight: float = 0.5, false_negative_weight: float = 0.5)

Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative, taken from datasketch.

You can also refer to the interactive demo at https://huggingface.co/spaces/bigcode/near-deduplication.

Parameters¶

thresholdfloat: The threshold for similarity.
num_permint: The number of permutations.
false_positive_weightfloat: The weight of false positive.
false_negative_weightfloat: The weight of false negative.

Returns¶

Tuple[int, int]: The optimal b (bands) and r (rows) parameters.

Examples¶

>>> optimal_param(0.75, 256)
(21, 12)
>>> optimal_param(0.75, 256, 0.1, 0.9)
(28, 9)

text_dedup.utils.random_samples(ds: Dataset, cluster_column: str, text_column: str, num_clusters: int = 10, num_examples_per_cluster: int = 5)

text_dedup.utils.sha1_hash(data: bytes, d: int = 32) → int

Generate a d-bit hash value from the given data.

Parameters¶

databytes: The data to be hashed.
dint: The number of bits of the hash value.

Returns¶

int: The hash value.

Examples¶

>>> sha1_hash(b"hello world", 32)
896314922
>>> sha1_hash(b"hello world", 64)
13028719972609469994
>>> sha1_hash(b"hello world", 128)
310522945683037930239412421226792791594

text_dedup.utils.sha256(string=b'', *, usedforsecurity=True): Returns a sha256 hash object; optionally initialized with a string

text_dedup.utils.sha256_digest(data: bytes) → bytes

Generate a sha256 hash in bytestring form from the given data.

Parameters¶

databytes: The data to be hashed.

Returns¶

bytes: The hash value in raw byte strings.

Examples¶

# raw byte strings cause problems on doctests >>> int.from_bytes(sha256_digest(b”hello world”), “little”) 105752752996721010526070019734402373604975086831773275823333741804099920678329 >>> len(sha256_digest(b”hello world”)) 32

text_dedup.utils.sha256_hexdigest(data: bytes) → str

Generate a sha256 hex hash from the given data.

Parameters¶

databytes: The data to be hashed.

Returns¶

str: The hex hash value.

Examples¶

>>> sha256_hexdigest(b"hello world")
'b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9'
>>> len(sha256_hexdigest(b"hello world"))
64

class text_dedup.utils.xxh3_128

An xxh3_128 represents the object used to calculate the XXH3_128 hash of a string of information.

Methods:

update(input) – updates the current digest with an additional string digest() – return the current digest value hexdigest() – return the current digest as a string of hexadecimal digits intdigest() – return the current digest as an integer copy() – return a copy of the current xxh3_128 object

block_size: Block size.

copy() → xxh3_128 object: Return a copy (``clone’’) of the xxh3_128 object.

digest() → string: Return the digest of the strings passed to the update() method so far. This is a 8-byte string which may contain non-ASCII characters, including null bytes.

digest_size: Digest size.

digestsize: Digest size.

hexdigest() → string: Like digest(), but returns the digest as a string of hexadecimal digits.

intdigest() → int: Like digest(), but returns the digest as an integer, which is the integer returned by xxhash C API

name: Name. Always XXH3_128.

reset(): Reset state.

seed: Seed.

update(input): Update the xxh3_128 object with the string input. Repeated calls are equivalent to a single call with the concatenation of all the arguments.

text_dedup.utils.xxh3_128_digest()

text_dedup.utils.xxh3_16hash(data: bytes, seed: int = 0) → int

Generate a 16-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a special purpose xxhash when you only want 16 bits. bit masked xxh3_64 hashes are faster than xxh32 in modern systems.

Parameters¶

databytes: The data to be hashed.
seedint: xxhashes can all be seeded. Default is int=0

Returns¶

int: The hash value.

Examples¶

>>> xxh3_16hash(b"hello world")
39051
>>> xxh3_16hash(b"hello world", seed=42)
13198
>>> xxh3_16hash(b"hello world", seed=-42)
34281

text_dedup.utils.xxh3_32hash(data: bytes, seed: int = 0) → int

Generate a 32-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a special purpose xxhash when you only want 32bits. bit masked xxh3_64 hashes are faster than xxh32 in modern systems.

Parameters¶

databytes: The data to be hashed.
seedint: xxhashes can all be seeded. Default is int=0

Returns¶

int: The hash value.

Examples¶

>>> xxh3_32hash(b"hello world")
1088854155
>>> xxh3_32hash(b"hello world", seed=42)
3913102222
>>> xxh3_32hash(b"hello world", seed=-42)
3721037289

class text_dedup.utils.xxh3_64

An xxh3_64 represents the object used to calculate the XXH3_64 hash of a string of information.

Methods:

block_size: Block size.

copy() → xxh64 object: Return a copy (``clone’’) of the xxh64 object.

digest() → string: Return the digest of the strings passed to the update() method so far. This is a 8-byte string which may contain non-ASCII characters, including null bytes.

digest_size: Digest size.

digestsize: Digest size.

hexdigest() → string: Like digest(), but returns the digest as a string of hexadecimal digits.

intdigest() → int: Like digest(), but returns the digest as an integer, which is the integer returned by xxhash C API

name: Name. Always XXH3_64.

reset(): Reset state.

seed: Seed.

update(input): Update the xxh3_64 object with the string input. Repeated calls are equivalent to a single call with the concatenation of all the arguments.

text_dedup.utils.xxh3_64_digest()

text_dedup.utils.xxh3_hash(data: bytes, d: int = 32) → int

Generate a d-bit xxhash based hash value from the given data. As of python xxhash 3.3.0 (and since 0.3.0) outputs in big-endian. This is useful as a general purpose xxhash that can take multiple d values

Parameters¶

databytes: The data to be hashed.
dint: The number of bits of the hash value. According to this value, chooses empirically found best xxh3 hasher.

Returns¶

int: The hash value.

Examples¶

>>> xxh3_hash(b"hello world", 32)
1088854155
>>> xxh3_hash(b"hello world", 64)
15296390279056496779
>>> xxh3_hash(b"hello world", 128)
297150157938599054391163723952090887879