|
|
|
# Text Embedding with Transformers
|
|
|
|
|
|
|
|
*author: [Jael Gu](https://github.com/jaelgu)*
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Description
|
|
|
|
|
|
|
|
A text embedding operator takes a sentence, paragraph, or document in string as an input
|
|
|
|
and output an embedding vector in ndarray which captures the input's core semantic elements.
|
|
|
|
This operator is implemented with pre-trained models from [Huggingface Transformers](https://huggingface.co/docs/transformers).
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Code Example
|
|
|
|
|
|
|
|
Use the pre-trained model 'distilbert-base-cased'
|
|
|
|
to generate a text embedding for the sentence "Hello, world.".
|
|
|
|
|
|
|
|
*Write the pipeline*:
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc(["Hello, world."])
|
|
|
|
.text_embedding.transformers(model_name="distilbert-base-cased")
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
*Write a same pipeline with explicit inputs/outputs name specifications:*
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc['text'](["Hello, world."])
|
|
|
|
.text_embedding.transformers['text', 'vec'](model_name="distilbert-base-cased")
|
|
|
|
.show()
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
<img src="./result.png" width="800px"/>
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Factory Constructor
|
|
|
|
|
|
|
|
Create the operator via the following factory method:
|
|
|
|
|
|
|
|
***text_embedding.transformers(model_name=None)***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***model_name***: *str*
|
|
|
|
|
|
|
|
The model name in string, defaults to None.
|
|
|
|
If None, the operator will be initialized without specified model.
|
|
|
|
|
|
|
|
Supported model names:
|
|
|
|
|
|
|
|
<details><summary>Albert</summary>
|
|
|
|
|
|
|
|
- albert-base-v1
|
|
|
|
- albert-large-v1
|
|
|
|
- albert-xlarge-v1
|
|
|
|
- albert-xxlarge-v1
|
|
|
|
- albert-base-v2
|
|
|
|
- albert-large-v2
|
|
|
|
- albert-xlarge-v2
|
|
|
|
- albert-xxlarge-v2
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Bart</summary>
|
|
|
|
|
|
|
|
- facebook/bart-large
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Bert</summary>
|
|
|
|
|
|
|
|
- bert-base-cased
|
|
|
|
- bert-base-uncased
|
|
|
|
- bert-large-cased
|
|
|
|
- bert-large-uncased
|
|
|
|
- bert-base-multilingual-uncased
|
|
|
|
- bert-base-multilingual-cased
|
|
|
|
- bert-base-chinese
|
|
|
|
- bert-base-german-cased
|
|
|
|
- bert-large-uncased-whole-word-masking
|
|
|
|
- bert-large-cased-whole-word-masking
|
|
|
|
- bert-large-uncased-whole-word-masking-finetuned-squad
|
|
|
|
- bert-large-cased-whole-word-masking-finetuned-squad
|
|
|
|
- bert-base-cased-finetuned-mrpc
|
|
|
|
- bert-base-german-dbmdz-cased
|
|
|
|
- bert-base-german-dbmdz-uncased
|
|
|
|
- cl-tohoku/bert-base-japanese-whole-word-masking
|
|
|
|
- cl-tohoku/bert-base-japanese-char
|
|
|
|
- cl-tohoku/bert-base-japanese-char-whole-word-masking
|
|
|
|
- TurkuNLP/bert-base-finnish-cased-v1
|
|
|
|
- TurkuNLP/bert-base-finnish-uncased-v1
|
|
|
|
- wietsedv/bert-base-dutch-cased
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>BertGeneration</summary>
|
|
|
|
|
|
|
|
- google/bert_for_seq_generation_L-24_bbc_encoder
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>BigBird</summary>
|
|
|
|
|
|
|
|
- google/bigbird-roberta-base
|
|
|
|
- google/bigbird-roberta-large
|
|
|
|
- google/bigbird-base-trivia-itc
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>BigBirdPegasus</summary>
|
|
|
|
|
|
|
|
- google/bigbird-pegasus-large-arxiv
|
|
|
|
- google/bigbird-pegasus-large-pubmed
|
|
|
|
- google/bigbird-pegasus-large-bigpatent
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>CamemBert</summary>
|
|
|
|
|
|
|
|
- camembert-base
|
|
|
|
- Musixmatch/umberto-commoncrawl-cased-v1
|
|
|
|
- Musixmatch/umberto-wikipedia-uncased-v1
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Canine</summary>
|
|
|
|
|
|
|
|
- google/canine-s
|
|
|
|
- google/canine-c
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Convbert</summary>
|
|
|
|
|
|
|
|
- YituTech/conv-bert-base
|
|
|
|
- YituTech/conv-bert-medium-small
|
|
|
|
- YituTech/conv-bert-small
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>CTRL</summary>
|
|
|
|
- ctrl
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>DeBERTa</summary>
|
|
|
|
|
|
|
|
- microsoft/deberta-base
|
|
|
|
- microsoft/deberta-large
|
|
|
|
- microsoft/deberta-xlarge
|
|
|
|
- microsoft/deberta-base-mnli
|
|
|
|
- microsoft/deberta-large-mnli
|
|
|
|
- microsoft/deberta-xlarge-mnli
|
|
|
|
- microsoft/deberta-v2-xlarge
|
|
|
|
- microsoft/deberta-v2-xxlarge
|
|
|
|
- microsoft/deberta-v2-xlarge-mnli
|
|
|
|
- microsoft/deberta-v2-xxlarge-mnli
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>DistilBert</summary>
|
|
|
|
|
|
|
|
- distilbert-base-uncased
|
|
|
|
- distilbert-base-uncased-distilled-squad
|
|
|
|
- distilbert-base-cased
|
|
|
|
- distilbert-base-cased-distilled-squad
|
|
|
|
- distilbert-base-german-cased
|
|
|
|
- distilbert-base-multilingual-cased
|
|
|
|
- distilbert-base-uncased-finetuned-sst-2-english
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Electral</summary>
|
|
|
|
|
|
|
|
- google/electra-small-generator
|
|
|
|
- google/electra-base-generator
|
|
|
|
- google/electra-large-generator
|
|
|
|
- google/electra-small-discriminator
|
|
|
|
- google/electra-base-discriminator
|
|
|
|
- google/electra-large-discriminator
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Flaubert</summary>
|
|
|
|
|
|
|
|
- flaubert/flaubert_small_cased
|
|
|
|
- flaubert/flaubert_base_uncased
|
|
|
|
- flaubert/flaubert_base_cased
|
|
|
|
- flaubert/flaubert_large_cased
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>FNet</summary>
|
|
|
|
|
|
|
|
- google/fnet-base
|
|
|
|
- google/fnet-large
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>FSMT</summary>
|
|
|
|
|
|
|
|
- facebook/wmt19-ru-en
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Funnel</summary>
|
|
|
|
|
|
|
|
- funnel-transformer/small
|
|
|
|
- funnel-transformer/small-base
|
|
|
|
- funnel-transformer/medium
|
|
|
|
- funnel-transformer/medium-base
|
|
|
|
- funnel-transformer/intermediate
|
|
|
|
- funnel-transformer/intermediate-base
|
|
|
|
- funnel-transformer/large
|
|
|
|
- funnel-transformer/large-base
|
|
|
|
- funnel-transformer/xlarge-base
|
|
|
|
- funnel-transformer/xlarge
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>GPT</summary>
|
|
|
|
|
|
|
|
- openai-gpt
|
|
|
|
- gpt2
|
|
|
|
- gpt2-medium
|
|
|
|
- gpt2-large
|
|
|
|
- gpt2-xl
|
|
|
|
- distilgpt2
|
|
|
|
- EleutherAI/gpt-neo-1.3B
|
|
|
|
- EleutherAI/gpt-j-6B
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>I-Bert</summary>
|
|
|
|
|
|
|
|
- kssteven/ibert-roberta-base
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>LED</summary>
|
|
|
|
|
|
|
|
- allenai/led-base-16384
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>MobileBert</summary>
|
|
|
|
|
|
|
|
- google/mobilebert-uncased
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>MPNet</summary>
|
|
|
|
|
|
|
|
- microsoft/mpnet-base
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Nystromformer</summary>
|
|
|
|
|
|
|
|
- uw-madison/nystromformer-512
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Reformer</summary>
|
|
|
|
|
|
|
|
- google/reformer-crime-and-punishment
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Splinter</summary>
|
|
|
|
|
|
|
|
- tau/splinter-base
|
|
|
|
- tau/splinter-base-qass
|
|
|
|
- tau/splinter-large
|
|
|
|
- tau/splinter-large-qass
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>SqueezeBert</summary>
|
|
|
|
|
|
|
|
- squeezebert/squeezebert-uncased
|
|
|
|
- squeezebert/squeezebert-mnli
|
|
|
|
- squeezebert/squeezebert-mnli-headless
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>TransfoXL</summary>
|
|
|
|
|
|
|
|
- transfo-xl-wt103
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>XLM</summary>
|
|
|
|
|
|
|
|
- xlm-mlm-en-2048
|
|
|
|
- xlm-mlm-ende-1024
|
|
|
|
- xlm-mlm-enfr-1024
|
|
|
|
- xlm-mlm-enro-1024
|
|
|
|
- xlm-mlm-tlm-xnli15-1024
|
|
|
|
- xlm-mlm-xnli15-1024
|
|
|
|
- xlm-clm-enfr-1024
|
|
|
|
- xlm-clm-ende-1024
|
|
|
|
- xlm-mlm-17-1280
|
|
|
|
- xlm-mlm-100-1280
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>XLMRoberta</summary>
|
|
|
|
|
|
|
|
- xlm-roberta-base
|
|
|
|
- xlm-roberta-large
|
|
|
|
- xlm-roberta-large-finetuned-conll02-dutch
|
|
|
|
- xlm-roberta-large-finetuned-conll02-spanish
|
|
|
|
- xlm-roberta-large-finetuned-conll03-english
|
|
|
|
- xlm-roberta-large-finetuned-conll03-german
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>XLNet</summary>
|
|
|
|
|
|
|
|
- xlnet-base-cased
|
|
|
|
- xlnet-large-cased
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<details><summary>Yoso</summary>
|
|
|
|
|
|
|
|
- uw-madison/yoso-4096
|
|
|
|
</details>
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
***checkpoint_path***: *str*
|
|
|
|
|
|
|
|
The path to local checkpoint, defaults to None.
|
|
|
|
If None, the operator will download and load pretrained model by `model_name` from Huggingface transformers.
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
***tokenizer***: *object*
|
|
|
|
|
|
|
|
The method to tokenize input text, defaults to None.
|
|
|
|
If None, the operator will use default tokenizer by `model_name` from Huggingface transformers.
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Interface
|
|
|
|
|
|
|
|
The operator takes a piece of text in string as input.
|
|
|
|
It loads tokenizer and pre-trained model using model name.
|
|
|
|
and then return text embedding in ndarray.
|
|
|
|
|
|
|
|
***\_\_call\_\_(txt)***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***txt***: *str*
|
|
|
|
|
|
|
|
The text in string.
|
|
|
|
|
|
|
|
|
|
|
|
**Returns**:
|
|
|
|
|
|
|
|
*numpy.ndarray*
|
|
|
|
|
|
|
|
The text embedding extracted by model.
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
***save_model(format='pytorch', path='default')***
|
|
|
|
|
|
|
|
Save model to local with specified format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The format of saved model, defaults to 'pytorch'.
|
|
|
|
|
|
|
|
***path***: *str*
|
|
|
|
|
|
|
|
The path where model is saved to. By default, it will save model to the operator directory.
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
***supported_model_names(format=None)***
|
|
|
|
|
|
|
|
Get a list of all supported model names or supported model names for specified model format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The model format such as 'pytorch', 'torchscript'.
|
|
|
|
|
|
|
|
```python
|
|
|
|
from towhee import ops
|
|
|
|
|
|
|
|
|
|
|
|
op = ops.text_embedding.transformers().get_op()
|
|
|
|
full_list = op.supported_model_names()
|
|
|
|
onnx_list = op.supported_model_names(format='onnx')
|
|
|
|
print(f'Onnx-support/Total Models: {len(onnx_list)}/{len(full_list)}')
|
|
|
|
```
|
|
|
|
2022-12-13 16:25:15,916 - 140704500614336 - auto_transformers.py-auto_transformers:68 - WARNING: The operator is initialized without specified model.
|
|
|
|
Onnx-support/Total Models: 111/126
|