Model Description

Dense Passage Retriever (DPR) is a technique introduced by Facebook Research that marked one of the biggest early successes in applying neural networks for retrieval. The goal of retrieval is to find passages that will help in answering a question. To accomplish this, DPR is composed of two sub-models: a question encoder, and a passage encoder. The idea is that questions and passages have different properties, so we can optimize each model (usually a BERT-based model) for each sub-domain.

During training, DPR incentives the two encoders to create embeddings of questions and passages, where useful passages are close to a question in the embedding-space, and less useful passages are farther away from the question. The model uses contrastive loss to maximize the similarity of a question with its corresponding passage, and minimizes the similarity of questions with non-matching passages.

We currently support DPR training on CS-X, and inference can be run on GPUs. After training, you can create embeddings for all the passages in your data using the passage encoder, and combine these into a vector database such as FAISS. The retrieval inference process is then to take a new question, run inference using the question encoder to get a question embedding, and retrieve the most similar passage embeddings from the vector database.

Contrastive loss

Contrastive loss: Contrastive loss has existed for decades, but gained popularity with the landmark paper by OpenAI paper on Contrastive Language-Image Pretraining, or CLIP. DPR uses the same technique as CLIP, but between questions and passages instead of images and captions. Since the introduction of DPR, models trained with contrastive loss have become standard for retrieval. The current state-of-the-art retrievers have remained remarkably similar to the original recipe outlined by DPR.

Hard negatives: Recall that the contrastive loss paradigm tries to do two things simultaneously: (1) maximize similarity between matching question-passage pairs (positive pairs), and (2) minimize similarity between non-matching question-passage pairs (negative pairs). Neural networks use batches of data for computational efficiency, so it is common practice to exploit this in creating non-matching pairs, by comparing a question with the passage of a different question from the same batch (in-batch negatives).

Some datasets will additionally add hard-negatives for each question. Creating negatives within a batch is easy and efficient; however to obtain best performance, it is best to find passages that are similar to the positive passage, but do not contain the information reqired. These passages are called hard-negatives, as it is much more difficult to discern between this negative and the true positive passage.

Code Structure

The code for DPR is structured similarly to other dual-encoder architectures and reuses BERT-style components.

  • configs/: YAML configuration files for training DPR models.
  • model.py: Defines the DPR model architecture including question and passage encoders.
  • run.py: Script for training DPR on Cerebras systems or GPU.
  • utils.py: Utilities for config parsing and distributed training support.

Available Configurations

ConfigurationDescription
params_dpr_base_nq.yamlBase DPR configuration trained on the Natural Questions dataset.

Model Input Tensor Specifications

Input NameShapeData TypeDescription
questions_input_ids(batch_size, max_seq_len)torch.int32Token IDs for the input questions.
questions_attention_mask(batch_size, max_seq_len)torch.int32Attention mask for question tokens.
questions_token_type_ids(batch_size, max_seq_len)torch.int32Token type IDs for questions (typically all zeros).
ctx_input_ids(batch_size, 2, max_seq_len)torch.int32Token IDs for one positive and one hard-negative passage per question.
ctx_attention_mask(batch_size, 2, max_seq_len)torch.int32Attention mask for the passage tokens.
ctx_token_type_ids(batch_size, 2, max_seq_len)torch.int32Token type IDs for passages (typically all zeros).

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References