longformer/README.md

# Text Embedding with Longformer

*author: Kyle He*

<br />

## Desription

This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2].

> Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer[2].

### References

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf

<br />

## Code Example

Use the pre-trained model "allenai/longformer-base-4096"
to generate a text embedding for the sentence "Hello, world.". 

*Write the pipeline*:

```python
from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text')
        .map('text', 'vec', ops.text_embedding.longformer(model_name="allenai/longformer-base-4096"))
        .output('text', 'vec')
)

DataCollection(p('Hello, world.')).show()
```

<img src="./result.png" width="800px"/>

<br />

## Factory Constructor

Create the operator via the following factory method:

***text_embedding.longformer(model_name="allenai/longformer-base-4096")***

**Parameters:**

***model_name***: *str*

The model name in string.
The default value is "allenai/longformer-base-4096".

Supported model names:
- allenai/longformer-base-4096
- allenai/longformer-large-4096
- allenai/longformer-large-4096-finetuned-triviaqa
- allenai/longformer-base-4096-extra.pos.embd.only
- allenai/longformer-large-4096-extra.pos.embd.only

***global_attention_mask***: *torch.Tensor*

defaults to None.


***pooler_output***: *bool*

The flag controllling whether to return outputs with pooled features, defaults to False.
The default output is in shape of (num_tokens, dim) for each input text.
If True, then the output will be a vector in (dim,) for each input text.

<br />

## Interface

The operator takes a text in string as input.
It loads tokenizer and pre-trained model using model name
and then return text embedding in ndarray.

**Parameters:**

***text***: *str*

The text in string.


**Returns**:

    *numpy.ndarray*

	The text embedding extracted by model.


# More Resources

- [Sentence Transformers for Long-Form Text - Zilliz blog](https://zilliz.com/learn/Sentence-Transformers-for-Long-Form-Text): Deep diving into modern transformer-based embeddings for long-form text.
- [OpenAI text-embedding-3-large | Zilliz](https://zilliz.com/ai-models/text-embedding-3-large): Building GenAI applications with text-embedding-3-large model and Zilliz Cloud / Milvus
- [The guide to jina-embeddings-v2-base-en | Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-base-en): jina-embeddings-v2-base-en: specialized embedding model for English text and long documents; support sequences of up to 8192 tokens
- [The guide to text-embedding-3-small | OpenAI](https://zilliz.com/ai-models/text-embedding-3-small): text-embedding-3-small: OpenAIâs small text embedding model optimized for accuracy and efficiency with a lower cost.
- [The guide to jina-embeddings-v2-small-en | Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-small-en): jina-embeddings-v2-small-en: specialized text embedding model for long English documents; up to 8192 tokens.
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`# Text Embedding with Longformer`
Initial commit 3 years ago
refactor longformer operator 3 years ago			`author: Kyle He`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`<br />`
refactor longformer operator 3 years ago
			`## Desription`

			`This operator uses Longformer to convert long text to embeddings.`

			`The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].`

			`Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].`

[DOC] Refine Readme Signed-off-by: LocoRichard <lichen.wang@zilliz.com> 3 years ago			`> Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer[2].`
refactor longformer operator 3 years ago
Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`### References`
refactor longformer operator 3 years ago
			`[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig`

			`[2].https://arxiv.org/pdf/2004.05150.pdf`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`<br />`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`## Code Example`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`Use the pre-trained model "allenai/longformer-base-4096"`
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`to generate a text embedding for the sentence "Hello, world.".`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`Write the pipeline:`
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago
refactor longformer operator 3 years ago			```python
remove dc2 Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`from towhee import pipe, ops, DataCollection`
refactor longformer operator 3 years ago
Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`p = (`
			`pipe.input('text')`
			`.map('text', 'vec', ops.text_embedding.longformer(model_name="allenai/longformer-base-4096"))`
			`.output('text', 'vec')`
			`)`

			`DataCollection(p('Hello, world.')).show()`
refactor longformer operator 3 years ago			```

Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`<img src="./result.png" width="800px"/>`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`<br />`

refactor longformer operator 3 years ago			`## Factory Constructor`

[DOC] Refine Readme Signed-off-by: LocoRichard <lichen.wang@zilliz.com> 3 years ago			`Create the operator via the following factory method:`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`*text_embedding.longformer(model_name="allenai/longformer-base-4096")*`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`Parameters:`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`*model_name: str*`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`The model name in string.`
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`The default value is "allenai/longformer-base-4096".`
Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago
			`Supported model names:`
			`- allenai/longformer-base-4096`
			`- allenai/longformer-large-4096`
			`- allenai/longformer-large-4096-finetuned-triviaqa`
			`- allenai/longformer-base-4096-extra.pos.embd.only`
			`- allenai/longformer-large-4096-extra.pos.embd.only`
refactor longformer operator 3 years ago
Debug Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`*global_attention_mask: torch.Tensor*`

			`defaults to None.`


			`*pooler_output: bool*`

			`The flag controllling whether to return outputs with pooled features, defaults to False.`
			`The default output is in shape of (num_tokens, dim) for each input text.`
			`If True, then the output will be a vector in (dim,) for each input text.`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`<br />`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`## Interface`

			`The operator takes a text in string as input.`
[DOC] Refine Readme Signed-off-by: LocoRichard <lichen.wang@zilliz.com> 3 years ago			`It loads tokenizer and pre-trained model using model name`
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`and then return text embedding in ndarray.`
refactor longformer operator 3 years ago
			`Parameters:`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`*text: str*`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`The text in string.`
refactor longformer operator 3 years ago

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`Returns:`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`numpy.ndarray`
refactor longformer operator 3 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`The text embedding extracted by model.`
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 8 months ago

			`# More Resources`

			`- [Sentence Transformers for Long-Form Text - Zilliz blog](https://zilliz.com/learn/Sentence-Transformers-for-Long-Form-Text): Deep diving into modern transformer-based embeddings for long-form text.`
			`- [OpenAI text-embedding-3-large \| Zilliz](https://zilliz.com/ai-models/text-embedding-3-large): Building GenAI applications with text-embedding-3-large model and Zilliz Cloud / Milvus`
			`- [The guide to jina-embeddings-v2-base-en \| Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-base-en): jina-embeddings-v2-base-en: specialized embedding model for English text and long documents; support sequences of up to 8192 tokens`
			`- [The guide to text-embedding-3-small \| OpenAI](https://zilliz.com/ai-models/text-embedding-3-small): text-embedding-3-small: OpenAIâs small text embedding model optimized for accuracy and efficiency with a lower cost.`
			`- [The guide to jina-embeddings-v2-small-en \| Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-small-en): jina-embeddings-v2-small-en: specialized text embedding model for long English documents; up to 8192 tokens.`