logo
Browse Source

Update README

Signed-off-by: Jael Gu <mengjia.gu@zilliz.com>
main
Jael Gu 3 years ago
parent
commit
4e323b03ef
  1. 65
      README.md
  2. 4
      __init__.py
  3. 24
      longformer.py

65
README.md

@ -1,9 +1,8 @@
# Text Embedding with longformer
# Text Embedding with dpr
*author: Kyle He* *author: Kyle He*
## Desription ## Desription
This operator uses Longformer to convert long text to embeddings. This operator uses Longformer to convert long text to embeddings.
@ -23,26 +22,42 @@ length, making it easy to process documents of thousands of tokens or longer[2].
[2].https://arxiv.org/pdf/2004.05150.pdf [2].https://arxiv.org/pdf/2004.05150.pdf
## Code Example
Use the pretrained model "facebook/dpr-ctx_encoder-single-nq-base"
to generate a text embedding for the sentence "Hello, world.".
*Write the pipeline*:
```python ```python
from towhee import ops
from towhee import dc
text_encoder = ops.text_embedding.longformer(model_name="allenai/longformer-base-4096")
text_embedding = text_encoder("Hello, world.")
dc.stream(["Hello, world."])
.text_embedding.longformer("allenai/longformer-base-4096")
.show()
``` ```
## Factory Constructor ## Factory Constructor
Create the operator via the following factory method Create the operator via the following factory method
***ops.text_embedding.longformer(model_name)***
***text_embedding.dpr(model_name="allenai/longformer-base-4096")***
**Parameters:**
***model_name***: *str*
## Interface
​ The model name in string.
The default value is "allenai/longformer-base-4096".
You can get the list of supported model names by calling `get_model_list` from [longformer.py](https://towhee.io/text-embedding/longformer/src/branch/main/longformer.py).
A text embedding operator takes a sentence, paragraph, or document in string as an input
and output an embedding vector in ndarray which captures the input's core semantic elements.
## Interface
The operator takes a text in string as input.
It loads tokenizer and pre-trained model using model name.
and then return text embedding in ndarray.
**Parameters:** **Parameters:**
@ -52,34 +67,8 @@ and output an embedding vector in ndarray which captures the input's core semant
**Returns**: *numpy.ndarray*
**Returns**:
The text embedding extracted by model.
*numpy.ndarray*
## Code Example
Use the pretrained model ('allenai/longformer-base-4096')
to generate a text embedding for the sentence "Hello, world.".
*Write the pipeline in simplified style*:
```python
import towhee.DataCollection as dc
dc.glob("Hello, world.")
.text_embedding.longformer('longformer-base-4096')
.show()
```
*Write a same pipeline with explicit inputs/outputs name specifications:*
```python
from towhee import DataCollection as dc
dc.glob['text']('Hello, world.')
.text_embedding.longformer['text', 'vec']('longformer-base-4096')
.select('vec')
.show()
```
​ The text embedding extracted by model.

4
__init__.py

@ -15,5 +15,5 @@
from .longformer import Longformer from .longformer import Longformer
def longformer(model_name: str):
return Longformer(model_name)
def longformer(**kwargs):
return Longformer(**kwargs)

24
longformer.py

@ -1,14 +1,16 @@
import numpy import numpy
import torch import torch
from transformers import LongformerTokenizer, LongformerModel from transformers import LongformerTokenizer, LongformerModel
import logging
from towhee.operator import NNOperator from towhee.operator import NNOperator
from towhee import register from towhee import register
import warnings import warnings
import logging
warnings.filterwarnings('ignore') warnings.filterwarnings('ignore')
logging.getLogger("transformers").setLevel(logging.ERROR)
log = logging.getLogger() log = logging.getLogger()
@ -24,7 +26,7 @@ class Longformer(NNOperator):
model_name (`str`): model_name (`str`):
Which model to use for the embeddings. Which model to use for the embeddings.
""" """
def __init__(self, model_name: str) -> None:
def __init__(self, model_name: str = 'allenai/longformer-base-4096') -> None:
super().__init__() super().__init__()
self.model_name = model_name self.model_name = model_name
try: try:
@ -55,5 +57,17 @@ class Longformer(NNOperator):
except Exception as e: except Exception as e:
log.error(f'Fail to extract features by model: {self.model_name}') log.error(f'Fail to extract features by model: {self.model_name}')
raise e raise e
feature_vector = feature_vector.detach().numpy()
return feature_vector
vec = feature_vector.detach().numpy()
return vec
def get_model_list():
full_list = [
"allenai/longformer-base-4096",
"allenai/longformer-large-4096",
"allenai/longformer-large-4096-finetuned-triviaqa",
"allenai/longformer-base-4096-extra.pos.embd.only",
"allenai/longformer-large-4096-extra.pos.embd.only",
]
full_list.sort()
return full_list

Loading…
Cancel
Save