Suffix Array Substring¶
Warning
Currently, there is an issue with merge command from the original repo, which might cause the processing to be single-threaded. You can apply this fix to the original repo to fix the issue.
This is a wrapper around deduplicate-text-datasets to deduplicate text datasets using suffix array substring matching. Based on the recommendation from the original research, duplicated substrings will be removed from the dataset.
“In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example “Alice wanted to go to the store” and we deduplicated at the level of 10 characters, we might completely strike “ to go to the “ and be left with “Alice wantedstore”. In practice we have found this doesn’t break the language model because we remove relatively little text, and so these breaks don’t cause harm.”
This wrapper adds one more step to the original script by restoring the text back into their document boundaries. This way, you still get the original documents, but with the duplicated substrings removed, instead of one long string of text. However, this boundary-respecting step is not perfect, and might not remove all the byte sequence since the original script yields byte offsets and normal text uses unicode characters. In this case, erroneous bytes around the offsets or boundaries will be ignored.
Quick Start¶
python -m text_dedup.suffix_array --help
usage: text-dedup.suffixarray [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
[--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
[--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
[--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--k K]
[--strategy {overlapping,longest}] --google_repo_path GOOGLE_REPO_PATH
Deduplicate text using Suffix Array Deduplication
- options:
- -h, --help
show this help message and exit
- --path PATH
path in load_dataset (default: None)
- --name NAME
name in load_dataset (default: None)
- --data_dir DATA_DIR
data_dir in load_dataset (default: None)
- --data_files DATA_FILES
data_files in load_dataset (default: None)
- --split SPLIT
split in load_dataset (default: None)
- --cache_dir CACHE_DIR
cache_dir in load_dataset (default: .cache)
- --revision REVISION
revision in load_dataset (default: None)
- --use_auth_token, --no-use_auth_token
use_auth_token in load_dataset (default: None)
- --local, --no-local
Use local dataset (default: False)
- --output OUTPUT
Path to deduplicated dataset output (default: None)
- --debug, --no-debug
Whether to run in debug mode (default: False)
- --column COLUMN
Text column to use for deduplication. Concatenate desired columns beforehand if needed. (default: None)
- --batch_size BATCH_SIZE
Batch size to use for dataset iteration. Mainly for memory efficiency. (default: 10000)
- --k K
Minimum byte length of a duplicate substring in Suffix Array Deduplication (default: 100)
- –strategy {overlapping,longest}
Strategy when there are overlapping duplicate substrings (default: overlapping)
- --google_repo_path GOOGLE_REPO_PATH
Path to google-research-deduplication codebase (default: None)
Example¶
python -m text_dedup.suffix_array \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--use_auth_token true --cache_dir "./cache" \
--output "output" \
--column "text" \
--google_repo_path "deduplicate-text-datasets"
API Reference¶
- text_dedup.suffix_array.clean_up(text: str, slices: list[slice]) str
Remove duplicate substrings from the text.
Parameters¶
- textstr
Text to remove duplicate substrings from.
- slicesList[slice]
List of slices to remove.
Returns¶
- str
Text with duplicate substrings removed.
Examples¶
>>> clean_up("This is a test.", [slice(0, 4, None), slice(5, 7, None)]) ' a test.'
- text_dedup.suffix_array.merge_intervals(intervals: list[slice], merge_strategy: Literal['longest', 'overlapping'] = 'longest') list[slice]
Merge overlapping intervals.
Parameters¶
- intervalsList[slice]
List of intervals
- merge_strategyLiteral[“longest”, “overlapping”]
Strategy to merge intervals, by default “longest” “overlapping”: merge overlapping intervals “longest”: only ignore duplicate substrings, this is useful because when [2, 4] and [3, 5] are duplicates, [2, 5] might be not
Returns¶
- List[slice]
List of merged intervals
Examples¶
>>> merge_intervals( ... [ ... slice(0, 10, None), ... slice(1, 11, None), ... slice(2, 12, None), ... slice(3, 13, None), ... slice(4, 14, None), ... slice(5, 15, None), ... slice(6, 16, None), ... slice(7, 21, None), ... ], ... merge_strategy="overlapping", ... ) [slice(0, 21, None)] >>> merge_intervals( ... [ ... slice(0, 10, None), ... slice(1, 11, None), ... slice(2, 12, None), ... slice(3, 13, None), ... slice(4, 14, None), ... slice(5, 15, None), ... slice(6, 16, None), ... slice(7, 21, None), ... ], ... merge_strategy="longest", ... ) [slice(0, 10, None), slice(1, 11, None), slice(2, 12, None), ... slice(7, 21, None)] >>> merge_intervals([slice(0, 2), slice(2, 4), slice(4, 5)], "overlapping") [slice(0, 5, None)] >>> merge_intervals([slice(0, 4), slice(2, 4), slice(4, 5)], "longest") [slice(0, 4, None), slice(4, 5, None)] >>> merge_intervals( ... [slice(0, 10, None), slice(0, 10, None), slice(0, 10, None), slice(0, 10, None), slice(0, 10, None)] ... ) [slice(0, 10, None)]
- text_dedup.suffix_array.restore(boundaries: Sequence[slice], segments: str | Path | Sequence[slice]) Generator
Restore the duplicate slices from seg_file to their original document boundaries.
Parameters¶
- boundariesList[slice]
List of slices document boundary offsets.
- segments: Union[str, List[slice]]
Path to the segmented file with duplicate offsets or a list of duplicate slices.
Yields¶
- int, slice
index and offset
Examples¶
>>> list( ... restore( ... [slice(0, 10, None), slice(10, 20, None)], ... [slice(0, 5, None), slice(5, 10, None), slice(5, 15, None), slice(5, 19, None)], ... ) ... ) [(0, slice(0, 5, None)), (0, slice(5, 10, None)), (1, slice(0, 5, None)), (1, slice(0, 9, None))]
- text_dedup.suffix_array.restore_and_merge(boundaries: Sequence[slice], segments: str | Path | Sequence[slice], k: int, merge_strategy: Literal['longest', 'overlapping'] = 'longest') tuple[list[list[slice]], int]
Restore the duplicate slices from seg_file to their original document boundaries and merge them.
Parameters¶
- boundariesList[slice]
List of slices document boundary offsets.
- segments: Union[str, List[slice]]
Path to the segmented file with duplicate offsets or a list of duplicate slices.
- kint
The minimum duplicate substring byte length.
- merge_strategyLiteral[“longest”, “overlapping”], optional
The merge strategy to use, by default “longest”
Returns¶
- Tuple[List[List[slice]], int]
List of merged slices for each document and the duplicate size.
Examples¶
>>> restore_and_merge( ... [slice(0, 10, None), slice(10, 20, None)], ... [slice(0, 5, None), slice(5, 10, None), slice(12, 19, None)], ... 5, ... "longest", ... ) ([[slice(0, 5, None), slice(5, 10, None)], [slice(2, 9, None)]], 17) >>> restore_and_merge( ... [slice(0, 10, None), slice(10, 20, None)], ... [slice(0, 5, None), slice(5, 10, None), slice(12, 19, None)], ... 5, ... "overlapping", ... ) ([[slice(0, 10, None)], [slice(2, 9, None)]], 17)