logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

Updated 2 years ago

text-embedding

Text Embedding with dpr

author: Kyle He

Desription

This operator uses Dense Passage Retrieval (DPR) to convert long text to embeddings.

Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was introduced in Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih[1].

DPR models were proposed in “[Dense Passage Retrieval for Open-Domain Question Answering][2].

In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework[2].

Reference

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/abs/2004.04906

from towhee import ops

text_encoder = ops.text_embedding.dpr(model_name="allenai/longformer-base-4096")
text_embedding = text_encoder("Hello, world.")

Factory Constructor

Create the operator via the following factory method

ops.text_embedding.dpr(model_name)

Interface

A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements.

Parameters:

text: str

​ The text in string.

Returns: numpy.ndarray

​ The text embedding extracted by model.

Code Example

Use the pretrained model ('allenai/longformer-base-4096') to generate a text embedding for the sentence "Hello, world.".

Write the pipeline in simplified style:

import towhee.DataCollection as dc

dc.glob("Hello, world.")
  .text_embedding.dpr('longformer-base-4096')
  .show()

Write a same pipeline with explicit inputs/outputs name specifications:

from towhee import DataCollection as dc

dc.glob['text']('Hello, world.')
  .text_embedding.dpr['text', 'vec']('longformer-base-4096')
  .select('vec')
  .show()
Junxen ba5de36512 refactor dpr 2 Commits
file-icon .gitattributes
1.1 KiB
download-icon
Initial commit 2 years ago
file-icon README.md
2.1 KiB
download-icon
refactor dpr 2 years ago
file-icon nlp_dpr.py
1.9 KiB
download-icon
refactor dpr 2 years ago
file-icon requirements.txt
42 B
download-icon
refactor dpr 2 years ago