longformer/README.md

# Text Embedding with longformer

*author: Kyle He*


## Desription

This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2].

Transformer-based models are unable to process long sequences due to their self-attention
operation, which scales quadratically with the sequence length. To address this limitation,
we introduce the Longformer with an attention mechanism that scales linearly with sequence
length, making it easy to process documents of thousands of tokens or longer[2].

## Reference

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf

```python
from towhee import ops

text_encoder = ops.text_embedding.longformer(model_name="allenai/longformer-base-4096")
text_embedding = text_encoder("Hello, world.")
```

## Factory Constructor

Create the operator via the following factory method

***ops.text_embedding.longformer(model_name)***


## Interface

A text embedding operator takes a sentence, paragraph, or document in string as an input
and output an embedding vector in ndarray which captures the input's core semantic elements.


**Parameters:**

	***text***: *str*

	The text in string.


**Returns**: *numpy.ndarray*

	The text embedding extracted by model.


## Code Example

Use the pretrained model ('allenai/longformer-base-4096')
to generate a text embedding for the sentence "Hello, world.". 

 *Write the pipeline in simplified style*:

```python
import towhee.DataCollection as dc

dc.glob("Hello, world.")
  .text_embedding.longformer('longformer-base-4096')
  .show()
```

*Write a same pipeline with explicit inputs/outputs name specifications:*

```python
from towhee import DataCollection as dc

dc.glob['text']('Hello, world.')
  .text_embedding.longformer['text', 'vec']('longformer-base-4096')
  .select('vec')
  .show()
```
refactor longformer operator 3 years ago			`# Text Embedding with longformer`
Initial commit 3 years ago
refactor longformer operator 3 years ago			`author: Kyle He`



			`## Desription`

			`This operator uses Longformer to convert long text to embeddings.`

			`The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].`

			`Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].`

			`Transformer-based models are unable to process long sequences due to their self-attention`
			`operation, which scales quadratically with the sequence length. To address this limitation,`
			`we introduce the Longformer with an attention mechanism that scales linearly with sequence`
			`length, making it easy to process documents of thousands of tokens or longer[2].`

			`## Reference`

			`[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig`

			`[2].https://arxiv.org/pdf/2004.05150.pdf`

			```python
			`from towhee import ops`

			`text_encoder = ops.text_embedding.longformer(model_name="allenai/longformer-base-4096")`
			`text_embedding = text_encoder("Hello, world.")`
			```

			`## Factory Constructor`

			`Create the operator via the following factory method`

			`*ops.text_embedding.longformer(model_name)*`



			`## Interface`

			`A text embedding operator takes a sentence, paragraph, or document in string as an input`
			`and output an embedding vector in ndarray which captures the input's core semantic elements.`


			`Parameters:`

			`*text: str*`

			`The text in string.`



			`Returns: numpy.ndarray`

			`The text embedding extracted by model.`



			`## Code Example`

			`Use the pretrained model ('allenai/longformer-base-4096')`
			`to generate a text embedding for the sentence "Hello, world.".`

			`Write the pipeline in simplified style:`

			```python
			`import towhee.DataCollection as dc`

			`dc.glob("Hello, world.")`
			`.text_embedding.longformer('longformer-base-4096')`
			`.show()`
			```

			`Write a same pipeline with explicit inputs/outputs name specifications:`

			```python
			`from towhee import DataCollection as dc`

			`dc.glob['text']('Hello, world.')`
			`.text_embedding.longformer['text', 'vec']('longformer-base-4096')`
			`.select('vec')`
			`.show()`
			```