SimHash ======= This script is a simple implementation of the SimHash algorithm. Quick Start ----------- .. code-block:: bash usage: text_dedup.simhash [-h] --path PATH [--name NAME] [--data_dir DATA_DIR] [--data_files DATA_FILES] [--split SPLIT] [--cache_dir CACHE_DIR] [--revision REVISION] [--use_auth_token | --no-use_auth_token] [--local | --no-local] --output OUTPUT [--debug | --no-debug] --column COLUMN [--batch_size BATCH_SIZE] [--ngram NGRAM] [--f F] [--bit_diff BIT_DIFF] [--num_bucket NUM_BUCKET] Deduplicate text using simhash options: -h, --help show this help message and exit --path PATH `path` in load_dataset --name NAME `name` in load_dataset --data_dir DATA_DIR `data_dir` in load_dataset --data_files DATA_FILES `data_files` in load_dataset --split SPLIT `split` in load_dataset --cache_dir CACHE_DIR `cache_dir` in load_dataset --revision REVISION `revision` in load_dataset --use_auth_token, --no-use_auth_token `use_auth_token` in load_dataset --local, --no-local Use local dataset (default: False) --output OUTPUT Path to deduplicated dataset output --debug, --no-debug Whether to run in debug mode (default: False) --column COLUMN Text column to use for deduplication. Concatenate desired columns beforehand if needed. --batch_size BATCH_SIZE Batch size to use for dataset iteration. Mainly for memory efficiency. --ngram NGRAM Ngram size to use in SimHash. --f F Simhash bit size --bit_diff BIT_DIFF Bit difference to use in SimHash --num_bucket NUM_BUCKET Number of buckets to use in SimHash, must be larger than bit_diff Example ------- .. code-block:: bash python -m text_dedup.simhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/simhash/oscar_gl_dedup" \ --column "text" \ --batch_size 10000 \ --use_auth_token true API Reference ------------- .. automodule:: text_dedup.simhash :members: :undoc-members: :noindex: