|
|
|
# NLP embedding: Longformer Operator
|
|
|
|
|
|
|
|
Authors: Kyle He, Jael Gu
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
This operator uses Longformer to convert long text to embeddings.
|
|
|
|
|
|
|
|
The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].
|
|
|
|
|
|
|
|
**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2].
|
|
|
|
|
|
|
|
Transformer-based models are unable to process long sequences due to their self-attention
|
|
|
|
operation, which scales quadratically with the sequence length. To address this limitation,
|
|
|
|
we introduce the Longformer with an attention mechanism that scales linearly with sequence
|
|
|
|
length, making it easy to process documents of thousands of tokens or longer[2].
|
|
|
|
|
|
|
|
## Interface
|
|
|
|
|
|
|
|
```python
|
|
|
|
__init__(self, model_name: str, framework: str = 'pytorch')
|
|
|
|
```
|
|
|
|
|
|
|
|
**Args:**
|
|
|
|
|
|
|
|
- model_name:
|
|
|
|
- the model name for embedding
|
|
|
|
- supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096'
|
|
|
|
- framework:
|
|
|
|
- the framework of the model
|
|
|
|
- supported types: `str`, default is 'pytorch'
|
|
|
|
|
|
|
|
```python
|
|
|
|
__call__(self, txt: str)
|
|
|
|
```
|
|
|
|
|
|
|
|
**Args:**
|
|
|
|
|
|
|
|
txt:
|
|
|
|
- the input text content
|
|
|
|
- supported types: str
|
|
|
|
|
|
|
|
|
|
|
|
**Returns:**
|
|
|
|
|
|
|
|
The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields:
|
|
|
|
|
|
|
|
- feature_vector:
|
|
|
|
- the embedding of the text
|
|
|
|
- data type: `numpy.ndarray`
|
|
|
|
- shape: (dim,)
|
|
|
|
|
|
|
|
## Requirements
|
|
|
|
|
|
|
|
You can get the required python package by [requirements.txt](./requirements.txt).
|
|
|
|
|
|
|
|
## How it works
|
|
|
|
|
|
|
|
The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline.
|
|
|
|
|
|
|
|
## Reference
|
|
|
|
|
|
|
|
[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig
|
|
|
|
|
|
|
|
[2].https://arxiv.org/pdf/2004.05150.pdf
|