dedup.py
script executes:
- MinHash generation
- duplicate pairs generation
- duplicate graph construction
- generating the final list of duplicates
Before You Begin
Ensure the Cerebras Model Zoo and its dependencies are installed if you’re running on a Cerebras Wafer-Scale cluster (specifically the prerequisites listed here). If you’d like to run this locally, do the following:Run the Script
To run the deduplication pipeline, execute the following command:.jsonl.zst
files. By default, each compressed file will be of size 16 MB, and the jsonl_key
will be text
.
(Optional) Manual Instructions
While thededup.py
script runs the deduplication pipeline end-to-end, you can also run each stage individually if needed.
MinHash Generation
MinHash generation can be a very slow process. We recommend running it separately before creating a MinHashLSH index. To calculate MinHash object for each document, we strip, lowercase, remove punctuation, consecutive spaces, newlines and tabs from each document. Afterwards, we construct a list of 13-grams that are later used as features to create a document signature to add into MinHashLSH index. (More details about MinHash can be found at Identifying and Filtering Near-Duplicate Documents.) We also apply NFC normalization, as well as filter out short documents, before yielding the documents.For custom datasets, you also need to specify the
jsonl_key
as well as the format of the dataset. By default, the jsonl_key
is set to be text
and the format to be jsonl
.This assumes that the dataset is present at a common place that is accessible by all the machines.
Duplicate Pairs Generation
In this step, we build a MinHashLSH index and query it to locate near duplicates (More reading here: Chapter 3, Mining of Massive Datasets. ) We use a Jaccard similarity threshold of 0.8 by default to determine whether a pair of documents should be considered as a duplicate, but you can specify it according to your own needs.Duplicate Graph Construction & Search for Connected Components
After locating duplicate pairs, we need to find connected components containing documents that are duplicates with each other. To make it more illustrative, consider these pairs: (A, B), (A, C), (A, E). We are going to form a cluster of (A, B, C, E) and keep only one document from the component. We evaluated the performance and memory consumption of networkx, graphtool, and networkit.networkit
offered most efficient implementation as it is designed to work with large graphs and features great parallelism.
Below you can find an example command how to construct a graph from documents pairs:
Generate Final List of Duplicates
In this step, we generate the final deduplicated dataset. We dump the original dataset in fixed-size files, in thejsonl.zst
format, whose size is configurable in the deduplicate_dataset.py
file. By default, we create files of 16 MB.
Below you can find an example command on how to generate the final deduplicated dataset: