logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

1.9 KiB

Text Embedding with Realm

author: Jael Gu


Description

A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements. This operator uses the REALM model, which is a retrieval-augmented language model that firstly retrieves documents from a textual knowledge corpus and then utilizes retrieved documents to process question answering tasks. [1] The original model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.[2]

References

[1].https://huggingface.co/docs/transformers/model_doc/realm

[2].https://arxiv.org/abs/2002.08909


Code Example

Use the pre-trained model "google/realm-cc-news-pretrained-embedder" to generate a text embedding for the sentence "Hello, world.".

Write the pipeline:

from towhee.dc2 import pipe, ops, DataCollection

p = (
    pipe.input('text')
        .map('text', 'vec', ops.text_embedding.realm(model_name="google/realm-cc-news-pretrained-embedder"))
        .output('text', 'vec')
)

DataCollection(p('Hello, world.')).show()     


Factory Constructor

Create the operator via the following factory method:

text_embedding.transformers(model_name="google/realm-cc-news-pretrained-embedder")

Parameters:

model_name: str

The model name in string. The default value is "google/realm-cc-news-pretrained-embedder".

Supported model name:

  • google/realm-cc-news-pretrained-embedder


Interface

The operator takes a piece of text in string as input. It loads tokenizer and pre-trained model using model name and then return text embedding in ndarray.

Parameters:

text: str

The text in string.

Returns:

numpy.ndarray

​ The text embedding extracted by model.

1.9 KiB

Text Embedding with Realm

author: Jael Gu


Description

A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements. This operator uses the REALM model, which is a retrieval-augmented language model that firstly retrieves documents from a textual knowledge corpus and then utilizes retrieved documents to process question answering tasks. [1] The original model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.[2]

References

[1].https://huggingface.co/docs/transformers/model_doc/realm

[2].https://arxiv.org/abs/2002.08909


Code Example

Use the pre-trained model "google/realm-cc-news-pretrained-embedder" to generate a text embedding for the sentence "Hello, world.".

Write the pipeline:

from towhee.dc2 import pipe, ops, DataCollection

p = (
    pipe.input('text')
        .map('text', 'vec', ops.text_embedding.realm(model_name="google/realm-cc-news-pretrained-embedder"))
        .output('text', 'vec')
)

DataCollection(p('Hello, world.')).show()     


Factory Constructor

Create the operator via the following factory method:

text_embedding.transformers(model_name="google/realm-cc-news-pretrained-embedder")

Parameters:

model_name: str

The model name in string. The default value is "google/realm-cc-news-pretrained-embedder".

Supported model name:

  • google/realm-cc-news-pretrained-embedder


Interface

The operator takes a piece of text in string as input. It loads tokenizer and pre-trained model using model name and then return text embedding in ndarray.

Parameters:

text: str

The text in string.

Returns:

numpy.ndarray

​ The text embedding extracted by model.