Text Embedding with dpr

author: Kyle He

Desription

This operator uses Dense Passage Retrieval (DPR) to convert long text to embeddings.

Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was introduced in Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih[1].

DPR models were proposed in “[Dense Passage Retrieval for Open-Domain Question Answering][2].

In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework[2].

Reference

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/abs/2004.04906

from towhee import ops

text_encoder = ops.text_embedding.dpr(model_name="allenai/longformer-base-4096")
text_embedding = text_encoder("Hello, world.")

Factory Constructor

Create the operator via the following factory method

ops.text_embedding.dpr(model_name)

Interface

A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements.

Parameters:

text: str

The text in string.

Returns: numpy.ndarray

The text embedding extracted by model.

Code Example

Use the pretrained model ('allenai/longformer-base-4096') to generate a text embedding for the sentence "Hello, world.".

Write the pipeline in simplified style:

import towhee.DataCollection as dc

dc.glob("Hello, world.")
  .text_embedding.dpr('longformer-base-4096')
  .show()

Write a same pipeline with explicit inputs/outputs name specifications:

from towhee import DataCollection as dc

dc.glob['text']('Hello, world.')
  .text_embedding.dpr['text', 'vec']('longformer-base-4096')
  .select('vec')
  .show()

Junxen ba5de36512 refactor dpr			2 Commits
.gitattributes	1.1 KiB	Initial commit	4 years ago
README.md	2.1 KiB	refactor dpr	4 years ago
nlp_dpr.py	1.9 KiB	refactor dpr	4 years ago
requirements.txt	42 B	refactor dpr	4 years ago