Suffix Array Substring

Warning

Currently, there is an issue with merge command from the original repo, which might cause the processing to be single-threaded. You can apply this fix to the original repo to fix the issue.

This is a wrapper around deduplicate-text-datasets to deduplicate text datasets using suffix array substring matching. Based on the recommendation from the original research, duplicated substrings will be removed from the dataset.

“In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example “Alice wanted to go to the store” and we deduplicated at the level of 10 characters, we might completely strike “ to go to the “ and be left with “Alice wantedstore”. In practice we have found this doesn’t break the language model because we remove relatively little text, and so these breaks don’t cause harm.”

This wrapper adds one more step to the original script by restoring the text back into their document boundaries. This way, you still get the original documents, but with the duplicated substrings removed, instead of one long string of text. However, this boundary-respecting step is not perfect, and might not remove all the byte sequence since the original script yields byte offsets and normal text uses unicode characters. In this case, erroneous bytes around the offsets or boundaries will be ignored.

Quick Start

python -m text_dedup.suffix_array --help

usage: text-dedup.suffixarray [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES]
                           [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION]
                           [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT
                           [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--k K]
                           [--strategy {overlapping,longest}] --google_repo_path GOOGLE_REPO_PATH

Deduplicate text using Suffix Array Deduplication

options:
-h, --help

show this help message and exit

--path PATH

path in load_dataset (default: None)

--name NAME

name in load_dataset (default: None)

--data_dir DATA_DIR

data_dir in load_dataset (default: None)

--data_files DATA_FILES

data_files in load_dataset (default: None)

--split SPLIT

split in load_dataset (default: None)

--cache_dir CACHE_DIR

cache_dir in load_dataset (default: .cache)

--revision REVISION

revision in load_dataset (default: None)

--use_auth_token, --no-use_auth_token

use_auth_token in load_dataset (default: None)

--local, --no-local

Use local dataset (default: False)

--output OUTPUT

Path to deduplicated dataset output (default: None)

--debug, --no-debug

Whether to run in debug mode (default: False)

--column COLUMN

Text column to use for deduplication. Concatenate desired columns beforehand if needed. (default: None)

--batch_size BATCH_SIZE

Batch size to use for dataset iteration. Mainly for memory efficiency. (default: 10000)

--k K

Minimum byte length of a duplicate substring in Suffix Array Deduplication (default: 100)

–strategy {overlapping,longest}

Strategy when there are overlapping duplicate substrings (default: overlapping)

--google_repo_path GOOGLE_REPO_PATH

Path to google-research-deduplication codebase (default: None)

Example

python -m text_dedup.suffix_array \
 --path "oscar-corpus/OSCAR-2201" \
 --name "gl" \
 --split "train" \
 --use_auth_token true    --cache_dir "./cache" \
 --output "output" \
 --column "text" \
 --google_repo_path "deduplicate-text-datasets"

API Reference

text_dedup.suffix_array.clean_up(text: str, slices: list[slice]) str

Remove duplicate substrings from the text.

Parameters

textstr

Text to remove duplicate substrings from.

slicesList[slice]

List of slices to remove.

Returns

str

Text with duplicate substrings removed.

Examples

>>> clean_up("This is a test.", [slice(0, 4, None), slice(5, 7, None)])
'  a test.'
text_dedup.suffix_array.merge_intervals(intervals: list[slice], merge_strategy: Literal['longest', 'overlapping'] = 'longest') list[slice]

Merge overlapping intervals.

Parameters

intervalsList[slice]

List of intervals

merge_strategyLiteral[“longest”, “overlapping”]

Strategy to merge intervals, by default “longest” “overlapping”: merge overlapping intervals “longest”: only ignore duplicate substrings, this is useful because when [2, 4] and [3, 5] are duplicates, [2, 5] might be not

Returns

List[slice]

List of merged intervals

Examples

>>> merge_intervals(
...     [
...         slice(0, 10, None),
...         slice(1, 11, None),
...         slice(2, 12, None),
...         slice(3, 13, None),
...         slice(4, 14, None),
...         slice(5, 15, None),
...         slice(6, 16, None),
...         slice(7, 21, None),
...     ],
...     merge_strategy="overlapping",
... )
[slice(0, 21, None)]
>>> merge_intervals(
...     [
...         slice(0, 10, None),
...         slice(1, 11, None),
...         slice(2, 12, None),
...         slice(3, 13, None),
...         slice(4, 14, None),
...         slice(5, 15, None),
...         slice(6, 16, None),
...         slice(7, 21, None),
...     ],
...     merge_strategy="longest",
... )  
[slice(0, 10, None), slice(1, 11, None), slice(2, 12, None), ... slice(7, 21, None)]
>>> merge_intervals([slice(0, 2), slice(2, 4), slice(4, 5)], "overlapping")
[slice(0, 5, None)]
>>> merge_intervals([slice(0, 4), slice(2, 4), slice(4, 5)], "longest")
[slice(0, 4, None), slice(4, 5, None)]
>>> merge_intervals(
...     [slice(0, 10, None), slice(0, 10, None), slice(0, 10, None), slice(0, 10, None), slice(0, 10, None)]
... )
[slice(0, 10, None)]
text_dedup.suffix_array.restore(boundaries: Sequence[slice], segments: str | Path | Sequence[slice]) Generator

Restore the duplicate slices from seg_file to their original document boundaries.

Parameters

boundariesList[slice]

List of slices document boundary offsets.

segments: Union[str, List[slice]]

Path to the segmented file with duplicate offsets or a list of duplicate slices.

Yields

int, slice

index and offset

Examples

>>> list(
...     restore(
...         [slice(0, 10, None), slice(10, 20, None)],
...         [slice(0, 5, None), slice(5, 10, None), slice(5, 15, None), slice(5, 19, None)],
...     )
... )
[(0, slice(0, 5, None)), (0, slice(5, 10, None)), (1, slice(0, 5, None)), (1, slice(0, 9, None))]
text_dedup.suffix_array.restore_and_merge(boundaries: Sequence[slice], segments: str | Path | Sequence[slice], k: int, merge_strategy: Literal['longest', 'overlapping'] = 'longest') tuple[list[list[slice]], int]

Restore the duplicate slices from seg_file to their original document boundaries and merge them.

Parameters

boundariesList[slice]

List of slices document boundary offsets.

segments: Union[str, List[slice]]

Path to the segmented file with duplicate offsets or a list of duplicate slices.

kint

The minimum duplicate substring byte length.

merge_strategyLiteral[“longest”, “overlapping”], optional

The merge strategy to use, by default “longest”

Returns

Tuple[List[List[slice]], int]

List of merged slices for each document and the duplicate size.

Examples

>>> restore_and_merge(
...     [slice(0, 10, None), slice(10, 20, None)],
...     [slice(0, 5, None), slice(5, 10, None), slice(12, 19, None)],
...     5,
...     "longest",
... )
([[slice(0, 5, None), slice(5, 10, None)], [slice(2, 9, None)]], 17)
>>> restore_and_merge(
...     [slice(0, 10, None), slice(10, 20, None)],
...     [slice(0, 5, None), slice(5, 10, None), slice(12, 19, None)],
...     5,
...     "overlapping",
... )
([[slice(0, 10, None)], [slice(2, 9, None)]], 17)