Line-level Exact Hash¶
- text_dedup.ccnet.compute_hashes(batch: Dict[str, Any], idx: List[int] | None, column: str, hash_func: Callable, idx_column: str | None = None) Dict[str, Any]
Compute a hash for each line in the document.
Parameters¶
- batchDict[str, Any]
A batch of one example.
- idxList[int] | None
The index of the example in the dataset.
- columnstr
The column name of the text.
- hash_funcCallable
The hash function to use.
- idx_columnstr | None
The column name of the index.
Returns¶
- Dict[str, Any]
A dictionary containing the hashes, the index of the example, and the index of the lines.
- text_dedup.ccnet.dedup(record: Dict[str, Any], idx: int | None, column: str, lookup: Dict, idx_column: str | None = None) Dict[str, Any]
Remove duplicated lines from the document.
Parameters¶
- recordDict[str, Any]
A record of one example.
- idxint | None
The index of the example in the dataset.
- columnstr
The column name of the text.
- lookupDict
A dictionary containing duplicated (example index, line index) pairs.
- idx_columnstr | None
The column name of the index.
Returns¶
- Dict[str, Any]
A dictionary containing the deduplicated record.