Line-level Exact Hash

text_dedup.ccnet.compute_hashes(batch: Dict[str, Any], idx: List[int] | None, column: str, hash_func: Callable, idx_column: str | None = None) Dict[str, Any]

Compute a hash for each line in the document.

Parameters

batchDict[str, Any]

A batch of one example.

idxList[int] | None

The index of the example in the dataset.

columnstr

The column name of the text.

hash_funcCallable

The hash function to use.

idx_columnstr | None

The column name of the index.

Returns

Dict[str, Any]

A dictionary containing the hashes, the index of the example, and the index of the lines.

text_dedup.ccnet.dedup(record: Dict[str, Any], idx: int | None, column: str, lookup: Dict, idx_column: str | None = None) Dict[str, Any]

Remove duplicated lines from the document.

Parameters

recordDict[str, Any]

A record of one example.

idxint | None

The index of the example in the dataset.

columnstr

The column name of the text.

lookupDict

A dictionary containing duplicated (example index, line index) pairs.

idx_columnstr | None

The column name of the index.

Returns

Dict[str, Any]

A dictionary containing the deduplicated record.