nlp-longformer/README.md

# NLP embedding: Longformer Operator

Authors: Kyle He, Jael Gu

## Overview
This operator uses Longformer to convert long text to embeddings.

The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].

**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2].

Transformer-based models are unable to process long sequences due to their self-attention
operation, which scales quadratically with the sequence length. To address this limitation,
we introduce the Longformer with an attention mechanism that scales linearly with sequence
length, making it easy to process documents of thousands of tokens or longer[2].

## Interface

```python
__init__(self, model_name: str, framework: str = 'pytorch')
```

**Args:**

- model_name:
  - the model name for embedding
  - supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096'
- framework:
  - the framework of the model
  - supported types: `str`, default is 'pytorch'

```python
__call__(self,  txt: str)
```

**Args:**

 txt:
  - the input text content
  - supported types: str


**Returns:**

The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields:

- feature_vector:
  - the embedding of the text
  - data type: `numpy.ndarray`
  - shape: (dim,)

## Requirements

You can get the required python package by [requirements.txt](./requirements.txt).

## How it works

The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline.

## Reference

[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig

[2].https://arxiv.org/pdf/2004.05150.pdf


# More Resources

- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. 

At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
- [Sentence Transformers for Long-Form Text - Zilliz blog](https://zilliz.com/learn/Sentence-Transformers-for-Long-Form-Text): Deep diving into modern transformer-based embeddings for long-form text.
- [OpenAI text-embedding-3-large | Zilliz](https://zilliz.com/ai-models/text-embedding-3-large): Building GenAI applications with text-embedding-3-large model and Zilliz Cloud / Milvus
- [The guide to jina-embeddings-v2-base-en | Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-base-en): jina-embeddings-v2-base-en: specialized embedding model for English text and long documents; support sequences of up to 8192 tokens
- [Neural Networks and Embeddings for Language Models - Zilliz blog](https://zilliz.com/learn/Neural-Networks-and-Embeddings-for-Language-Models): Exploring neural network language models, specifically recurrent neural networks, and taking a sneak peek at how embeddings are generated.
- [The guide to jina-embeddings-v2-small-en | Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-small-en): jina-embeddings-v2-small-en: specialized text embedding model for long English documents; up to 8192 tokens.
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`# NLP embedding: Longformer Operator`
Initial commit 4 years ago
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Authors: Kyle He, Jael Gu`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`## Overview`
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`This operator uses Longformer to convert long text to embeddings.`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Longformer models were proposed in “[Longformer: The Long-Document Transformer][2].`

			`Transformer-based models are unable to process long sequences due to their self-attention`
			`operation, which scales quadratically with the sequence length. To address this limitation,`
			`we introduce the Longformer with an attention mechanism that scales linearly with sequence`
			`length, making it easy to process documents of thousands of tokens or longer[2].`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`## Interface`

			```python
			`__init__(self, model_name: str, framework: str = 'pytorch')`
			```

update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Args:`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`- model_name:`
			`- the model name for embedding`
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			- supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096'
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`- framework:`
			`- the framework of the model`
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			- supported types: `str`, default is 'pytorch'
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			```python
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`__call__(self, txt: str)`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			```

update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Args:`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`txt:`
			`- the input text content`
Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`- supported types: str`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago

update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Returns:`

			The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields:
Update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`- feature_vector:`
			`- the embedding of the text`
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			- data type: `numpy.ndarray`
			`- shape: (dim,)`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`## Requirements`

update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`You can get the required python package by [requirements.txt](./requirements.txt).`
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`## How it works`

update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline.
Add operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`## Reference`
update Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig`

			`[2].https://arxiv.org/pdf/2004.05150.pdf`
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago

			`# More Resources`

			- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations.

			`At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.`
			`- [Sentence Transformers for Long-Form Text - Zilliz blog](https://zilliz.com/learn/Sentence-Transformers-for-Long-Form-Text): Deep diving into modern transformer-based embeddings for long-form text.`
			`- [OpenAI text-embedding-3-large \| Zilliz](https://zilliz.com/ai-models/text-embedding-3-large): Building GenAI applications with text-embedding-3-large model and Zilliz Cloud / Milvus`
			`- [The guide to jina-embeddings-v2-base-en \| Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-base-en): jina-embeddings-v2-base-en: specialized embedding model for English text and long documents; support sequences of up to 8192 tokens`
			`- [Neural Networks and Embeddings for Language Models - Zilliz blog](https://zilliz.com/learn/Neural-Networks-and-Embeddings-for-Language-Models): Exploring neural network language models, specifically recurrent neural networks, and taking a sneak peek at how embeddings are generated.`
			`- [The guide to jina-embeddings-v2-small-en \| Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-small-en): jina-embeddings-v2-small-en: specialized text embedding model for long English documents; up to 8192 tokens.`