|
@ -1,10 +1,18 @@ |
|
|
# Operator: nlp-longformer |
|
|
|
|
|
|
|
|
# NLP embedding: Longformer Operator |
|
|
|
|
|
|
|
|
Author: Kyle He, Jael Gu |
|
|
|
|
|
|
|
|
Authors: Kyle He, Jael Gu |
|
|
|
|
|
|
|
|
## Overview |
|
|
## Overview |
|
|
|
|
|
This operator uses Longformer to convert long text to embeddings. |
|
|
|
|
|
|
|
|
|
|
|
The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1]. |
|
|
|
|
|
|
|
|
|
|
|
**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2]. |
|
|
|
|
|
|
|
|
|
|
|
Transformer-based models are unable to process long sequences due to their self-attention |
|
|
|
|
|
operation, which scales quadratically with the sequence length. To address this limitation, |
|
|
|
|
|
we introduce the Longformer with an attention mechanism that scales linearly with sequence |
|
|
|
|
|
length, making it easy to process documents of thousands of tokens or longer[2]. |
|
|
|
|
|
|
|
|
## Interface |
|
|
## Interface |
|
|
|
|
|
|
|
@ -12,40 +20,45 @@ Author: Kyle He, Jael Gu |
|
|
__init__(self, model_name: str, framework: str = 'pytorch') |
|
|
__init__(self, model_name: str, framework: str = 'pytorch') |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
Args: |
|
|
|
|
|
|
|
|
**Args:** |
|
|
|
|
|
|
|
|
- model_name: |
|
|
- model_name: |
|
|
- the model name for embedding |
|
|
- the model name for embedding |
|
|
- supported types: str, for example 'xxx' or 'xxx' |
|
|
|
|
|
|
|
|
- supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096' |
|
|
- framework: |
|
|
- framework: |
|
|
- the framework of the model |
|
|
- the framework of the model |
|
|
- supported types: str, default is 'pytorch' |
|
|
|
|
|
|
|
|
- supported types: `str`, default is 'pytorch' |
|
|
|
|
|
|
|
|
```python |
|
|
```python |
|
|
__call__(self, call_arg_1: xxx) |
|
|
|
|
|
|
|
|
__call__(self, txt: str) |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
Args: |
|
|
|
|
|
|
|
|
**Args:** |
|
|
|
|
|
|
|
|
- txt: |
|
|
|
|
|
- input text in words, sentences, or paragraphs |
|
|
|
|
|
|
|
|
txt: |
|
|
|
|
|
- the input text content |
|
|
- supported types: str |
|
|
- supported types: str |
|
|
|
|
|
|
|
|
Returns: |
|
|
|
|
|
|
|
|
|
|
|
The Operator returns a tuple Tuple[('feature_vector', numpy.ndarray)] containing following fields: |
|
|
|
|
|
|
|
|
**Returns:** |
|
|
|
|
|
|
|
|
|
|
|
The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields: |
|
|
|
|
|
|
|
|
- feature_vector: |
|
|
- feature_vector: |
|
|
- the embedding of the text |
|
|
- the embedding of the text |
|
|
- data type: numpy.ndarray |
|
|
|
|
|
- shape: (x, dim) where x is number of vectors and dim is dimension of vector depending on model_name |
|
|
|
|
|
|
|
|
- data type: `numpy.ndarray` |
|
|
|
|
|
- shape: (dim,) |
|
|
|
|
|
|
|
|
## Requirements |
|
|
## Requirements |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
You can get the required python package by [requirements.txt](./requirements.txt). |
|
|
|
|
|
|
|
|
## How it works |
|
|
## How it works |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline. |
|
|
|
|
|
|
|
|
## Reference |
|
|
## Reference |
|
|
|
|
|
|
|
|
|
|
|
[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig |
|
|
|
|
|
|
|
|
|
|
|
[2].https://arxiv.org/pdf/2004.05150.pdf |
|
|