Readme

Files and versions

2.0 KiB

Raw Blame History

Text Embedding with Longformer

author: Kyle He

Desription

This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer[2].

References

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf

Code Example

Use the pretrained model "facebook/dpr-ctx_encoder-single-nq-base" to generate a text embedding for the sentence "Hello, world.".

Write the pipeline:

import towhee

towhee.dc(["Hello, world."]) \
      .text_embedding.longformer(model_name=c"allenai/longformer-base-4096")

Factory Constructor

Create the operator via the following factory method

text_embedding.dpr(model_name="allenai/longformer-base-4096")

Parameters:

model_name: str

The model name in string. The default value is "allenai/longformer-base-4096".

Supported model names:

allenai/longformer-base-4096
allenai/longformer-large-4096
allenai/longformer-large-4096-finetuned-triviaqa
allenai/longformer-base-4096-extra.pos.embd.only
allenai/longformer-large-4096-extra.pos.embd.only

Interface

The operator takes a text in string as input. It loads tokenizer and pre-trained model using model name. and then return text embedding in ndarray.

Parameters:

text: str

The text in string.

Returns:

numpy.ndarray

The text embedding extracted by model.

2.0 KiB

Raw Blame History

Text Embedding with Longformer

author: Kyle He

Desription

This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].

References

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf

Code Example

Use the pretrained model "facebook/dpr-ctx_encoder-single-nq-base" to generate a text embedding for the sentence "Hello, world.".

Write the pipeline:

import towhee

towhee.dc(["Hello, world."]) \
      .text_embedding.longformer(model_name=c"allenai/longformer-base-4096")

Factory Constructor

Create the operator via the following factory method

text_embedding.dpr(model_name="allenai/longformer-base-4096")

Parameters:

model_name: str

The model name in string. The default value is "allenai/longformer-base-4096".

Supported model names:

allenai/longformer-base-4096
allenai/longformer-large-4096
allenai/longformer-large-4096-finetuned-triviaqa
allenai/longformer-base-4096-extra.pos.embd.only
allenai/longformer-large-4096-extra.pos.embd.only

Interface

The operator takes a text in string as input. It loads tokenizer and pre-trained model using model name. and then return text embedding in ndarray.

Parameters:

text: str

The text in string.

Returns:

numpy.ndarray

The text embedding extracted by model.

Readme

Files and versions

2.0 KiB Raw Blame History

Text Embedding with Longformer

Desription

References

Code Example

Factory Constructor

Interface

2.0 KiB Raw Blame History

Text Embedding with Longformer

Desription

References

Code Example

Factory Constructor

Interface

2.0 KiB

Raw Blame History

2.0 KiB

Raw Blame History