longformer/README.md

# Text Embedding with dpr

*author: Kyle He*


## Desription

This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2].

Transformer-based models are unable to process long sequences due to their self-attention
operation, which scales quadratically with the sequence length. To address this limitation,
we introduce the Longformer with an attention mechanism that scales linearly with sequence
length, making it easy to process documents of thousands of tokens or longer[2].

### References

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf

## Code Example

Use the pretrained model "facebook/dpr-ctx_encoder-single-nq-base"
to generate a text embedding for the sentence "Hello, world.". 

 *Write the pipeline*:

```python
from towhee import dc


dc.stream(["Hello, world."]) \
  .text_embedding.longformer(model_name=c"allenai/longformer-base-4096") \
  .show()
```

## Factory Constructor

Create the operator via the following factory method

***text_embedding.dpr(model_name="allenai/longformer-base-4096")***

**Parameters:**

	***model_name***: *str*

	The model name in string.
The default value is "allenai/longformer-base-4096".
You can get the list of supported model names by calling `get_model_list` from [longformer.py](https://towhee.io/text-embedding/longformer/src/branch/main/longformer.py).


## Interface

The operator takes a text in string as input.
It loads tokenizer and pre-trained model using model name.
and then return text embedding in ndarray.

**Parameters:**

	***text***: *str*

	The text in string.


**Returns**:

    *numpy.ndarray*

	The text embedding extracted by model.
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`# Text Embedding with dpr`
Initial commit 4 years ago
refactor longformer operator 4 years ago			`author: Kyle He`


			`## Desription`

			`This operator uses Longformer to convert long text to embeddings.`

			`The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].`

			`Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].`

			`Transformer-based models are unable to process long sequences due to their self-attention`
			`operation, which scales quadratically with the sequence length. To address this limitation,`
			`we introduce the Longformer with an attention mechanism that scales linearly with sequence`
			`length, making it easy to process documents of thousands of tokens or longer[2].`

Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`### References`
refactor longformer operator 4 years ago
			`[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig`

			`[2].https://arxiv.org/pdf/2004.05150.pdf`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`## Code Example`

			`Use the pretrained model "facebook/dpr-ctx_encoder-single-nq-base"`
			`to generate a text embedding for the sentence "Hello, world.".`

			`Write the pipeline:`

refactor longformer operator 4 years ago			```python
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`from towhee import dc`
refactor longformer operator 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`dc.stream(["Hello, world."]) \`
			`.text_embedding.longformer(model_name=c"allenai/longformer-base-4096") \`
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`.show()`
refactor longformer operator 4 years ago			```

			`## Factory Constructor`

			`Create the operator via the following factory method`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`*text_embedding.dpr(model_name="allenai/longformer-base-4096")*`
refactor longformer operator 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Parameters:`
refactor longformer operator 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`*model_name: str*`
refactor longformer operator 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The model name in string.`
			`The default value is "allenai/longformer-base-4096".`
			You can get the list of supported model names by calling `get_model_list` from [longformer.py](https://towhee.io/text-embedding/longformer/src/branch/main/longformer.py).
refactor longformer operator 4 years ago

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`## Interface`

			`The operator takes a text in string as input.`
			`It loads tokenizer and pre-trained model using model name.`
			`and then return text embedding in ndarray.`
refactor longformer operator 4 years ago
			`Parameters:`

			`*text: str*`

			`The text in string.`



Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Returns:`
refactor longformer operator 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`numpy.ndarray`
refactor longformer operator 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The text embedding extracted by model.`