logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

80 lines
2.0 KiB

# Text Embedding with Longformer
3 years ago
*author: Kyle He*
<br />
## Desription
This operator uses Longformer to convert long text to embeddings.
The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1].
**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2].
> Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer[2].
### References
[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig
[2].https://arxiv.org/pdf/2004.05150.pdf
<br />
## Code Example
Use the pre-trained model "facebook/dpr-ctx_encoder-single-nq-base"
to generate a text embedding for the sentence "Hello, world.".
*Write the pipeline*:
```python
import towhee
towhee.dc(["Hello, world."]) \
.text_embedding.longformer(model_name=c"allenai/longformer-base-4096")
```
<br />
## Factory Constructor
Create the operator via the following factory method:
***text_embedding.dpr(model_name="allenai/longformer-base-4096")***
**Parameters:**
***model_name***: *str*
The model name in string.
The default value is "allenai/longformer-base-4096".
Supported model names:
- allenai/longformer-base-4096
- allenai/longformer-large-4096
- allenai/longformer-large-4096-finetuned-triviaqa
- allenai/longformer-base-4096-extra.pos.embd.only
- allenai/longformer-large-4096-extra.pos.embd.only
<br />
## Interface
The operator takes a text in string as input.
It loads tokenizer and pre-trained model using model name
and then return text embedding in ndarray.
**Parameters:**
***text***: *str*
The text in string.
**Returns**:
*numpy.ndarray*
​ The text embedding extracted by model.