Suffix Array Substring¶

Warning

Currently, there is an issue with merge command from the original repo, which might cause the processing to be single-threaded. You can apply this fix to the original repo to fix the issue.

This is a wrapper around deduplicate-text-datasets to deduplicate text datasets using suffix array substring matching. Based on the recommendation from the original research, duplicated substrings will be removed from the dataset.

“In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example “Alice wanted to go to the store” and we deduplicated at the level of 10 characters, we might completely strike “ to go to the “ and be left with “Alice wantedstore”. In practice we have found this doesn’t break the language model because we remove relatively little text, and so these breaks don’t cause harm.”

This wrapper adds one more step to the original script by restoring the text back into their document boundaries. This way, you still get the original documents, but with the duplicated substrings removed, instead of one long string of text. However, this boundary-respecting step is not perfect, and might not remove all the byte sequence since the original script yields byte offsets and normal text uses unicode characters. In this case, erroneous bytes around the offsets or boundaries will be ignored.

Quick Start¶

python -m text_dedup.suffix_array --help

usage: text-dedup.suffixarray [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
                           [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
                           [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
                           [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--k K]
                           [--strategy {overlapping,longest}] --google_repo_path GOOGLE_REPO_PATH

Deduplicate text using Suffix Array Deduplication

options:

-h, --help: show this help message and exit
--path PATH: path in load_dataset (default: None)
--name NAME: name in load_dataset (default: None)
--data_dir DATA_DIR: data_dir in load_dataset (default: None)
--data_files DATA_FILES: data_files in load_dataset (default: None)
--split SPLIT: split in load_dataset (default: None)
--cache_dir CACHE_DIR: cache_dir in load_dataset (default: .cache)
--revision REVISION: revision in load_dataset (default: None)
--use_auth_token, --no-use_auth_token: use_auth_token in load_dataset (default: None)
--local, --no-local: Use local dataset (default: False)
--output OUTPUT: Path to deduplicated dataset output (default: None)
--debug, --no-debug: Whether to run in debug mode (default: False)
--column COLUMN: Text column to use for deduplication. Concatenate desired columns beforehand if needed. (default: None)
--batch_size BATCH_SIZE: Batch size to use for dataset iteration. Mainly for memory efficiency. (default: 10000)
--k K: Minimum byte length of a duplicate substring in Suffix Array Deduplication (default: 100)

–strategy {overlapping,longest}: Strategy when there are overlapping duplicate substrings (default: overlapping)

--google_repo_path GOOGLE_REPO_PATH: Path to google-research-deduplication codebase (default: None)

Example¶

python -m text_dedup.suffix_array \
 --path "oscar-corpus/OSCAR-2201" \
 --name "gl" \
 --split "train" \
 --use_auth_token true    --cache_dir "./cache" \
 --output "output" \
 --column "text" \
 --google_repo_path "deduplicate-text-datasets"

API Reference¶

text_dedup.suffix_array.clean_up(text: str, slices: list[slice]) → str

Remove duplicate substrings from the text.

Parameters¶

textstr: Text to remove duplicate substrings from.
slicesList[slice]: List of slices to remove.

Returns¶

str: Text with duplicate substrings removed.

Examples¶

>>> clean_up("This is a test.", [slice(0, 4, None), slice(5, 7, None)])
'  a test.'

text_dedup.suffix_array.merge_intervals(intervals: list[slice], merge_strategy: Literal['longest', 'overlapping'] = 'longest') → list[slice]

Merge overlapping intervals.

Parameters¶

intervalsList[slice]: List of intervals
merge_strategyLiteral[“longest”, “overlapping”]: Strategy to merge intervals, by default “longest” “overlapping”: merge overlapping intervals “longest”: only ignore duplicate substrings, this is useful because when [2, 4] and [3, 5] are duplicates, [2, 5] might be not

Returns¶

List[slice]: List of merged intervals

Examples¶

>>> merge_intervals(
...     [
...         slice(0, 10, None),
...         slice(1, 11, None),
...         slice(2, 12, None),
...         slice(3, 13, None),
...         slice(4, 14, None),
...         slice(5, 15, None),
...         slice(6, 16, None),
...         slice(7, 21, None),
...     ],
...     merge_strategy="overlapping",
... )
[slice(0, 21, None)]
>>> merge_intervals(
...     [
...         slice(0, 10, None),
...         slice(1, 11, None),
...         slice(2, 12, None),
...         slice(3, 13, None),
...         slice(4, 14, None),
...         slice(5, 15, None),
...         slice(6, 16, None),
...         slice(7, 21, None),
...     ],
...     merge_strategy="longest",
... )  
[slice(0, 10, None), slice(1, 11, None), slice(2, 12, None), ... slice(7, 21, None)]
>>> merge_intervals([slice(0, 2), slice(2, 4), slice(4, 5)], "overlapping")
[slice(0, 5, None)]
>>> merge_intervals([slice(0, 4), slice(2, 4), slice(4, 5)], "longest")
[slice(0, 4, None), slice(4, 5, None)]
>>> merge_intervals(
...     [slice(0, 10, None), slice(0, 10, None), slice(0, 10, None), slice(0, 10, None), slice(0, 10, None)]
... )
[slice(0, 10, None)]

text_dedup.suffix_array.restore(boundaries: Sequence[slice], segments: str | Path | Sequence[slice]) → Generator

Restore the duplicate slices from seg_file to their original document boundaries.

Parameters¶

boundariesList[slice]: List of slices document boundary offsets.
segments: Union[str, List[slice]]: Path to the segmented file with duplicate offsets or a list of duplicate slices.

Yields¶

int, slice: index and offset

Examples¶

>>> list(
...     restore(
...         [slice(0, 10, None), slice(10, 20, None)],
...         [slice(0, 5, None), slice(5, 10, None), slice(5, 15, None), slice(5, 19, None)],
...     )
... )
[(0, slice(0, 5, None)), (0, slice(5, 10, None)), (1, slice(0, 5, None)), (1, slice(0, 9, None))]

text_dedup.suffix_array.restore_and_merge(boundaries: Sequence[slice], segments: str | Path | Sequence[slice], k: int, merge_strategy: Literal['longest', 'overlapping'] = 'longest') → tuple[list[list[slice]], int]

Restore the duplicate slices from seg_file to their original document boundaries and merge them.

Parameters¶

boundariesList[slice]: List of slices document boundary offsets.
segments: Union[str, List[slice]]: Path to the segmented file with duplicate offsets or a list of duplicate slices.
kint: The minimum duplicate substring byte length.
merge_strategyLiteral[“longest”, “overlapping”], optional: The merge strategy to use, by default “longest”

Returns¶

Tuple[List[List[slice]], int]: List of merged slices for each document and the duplicate size.

Examples¶

>>> restore_and_merge(
...     [slice(0, 10, None), slice(10, 20, None)],
...     [slice(0, 5, None), slice(5, 10, None), slice(12, 19, None)],
...     5,
...     "longest",
... )
([[slice(0, 5, None), slice(5, 10, None)], [slice(2, 9, None)]], 17)
>>> restore_and_merge(
...     [slice(0, 10, None), slice(10, 20, None)],
...     [slice(0, 5, None), slice(5, 10, None), slice(12, 19, None)],
...     5,
...     "overlapping",
... )
([[slice(0, 10, None)], [slice(2, 9, None)]], 17)