logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

Updated 8 months ago

text-embedding

Text Embedding with Longformer

author: Kyle He


Desription

This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer[2].

References

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf


Code Example

Use the pre-trained model "allenai/longformer-base-4096" to generate a text embedding for the sentence "Hello, world.".

Write the pipeline:

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text')
        .map('text', 'vec', ops.text_embedding.longformer(model_name="allenai/longformer-base-4096"))
        .output('text', 'vec')
)

DataCollection(p('Hello, world.')).show()


Factory Constructor

Create the operator via the following factory method:

text_embedding.longformer(model_name="allenai/longformer-base-4096")

Parameters:

model_name: str

The model name in string. The default value is "allenai/longformer-base-4096".

Supported model names:

  • allenai/longformer-base-4096
  • allenai/longformer-large-4096
  • allenai/longformer-large-4096-finetuned-triviaqa
  • allenai/longformer-base-4096-extra.pos.embd.only
  • allenai/longformer-large-4096-extra.pos.embd.only

global_attention_mask: torch.Tensor

defaults to None.

pooler_output: bool

The flag controllling whether to return outputs with pooled features, defaults to False. The default output is in shape of (num_tokens, dim) for each input text. If True, then the output will be a vector in (dim,) for each input text.


Interface

The operator takes a text in string as input. It loads tokenizer and pre-trained model using model name and then return text embedding in ndarray.

Parameters:

text: str

The text in string.

Returns:

numpy.ndarray

​ The text embedding extracted by model.

More Resources

Jael Gu 8c6c95079b Add more resources 20 Commits
file-icon .gitattributes
1.1 KiB
download-icon
Initial commit 3 years ago
file-icon README.md
3.5 KiB
download-icon
Add more resources 8 months ago
file-icon __init__.py
688 B
download-icon
Update README 3 years ago
file-icon longformer.py
3.2 KiB
download-icon
Debug 3 years ago
file-icon requirements.txt
55 B
download-icon
add requirement 3 years ago
file-icon result.png
5.8 KiB
download-icon
Update 2 years ago