transformers/README.md

# Text Embedding with Transformers

*author: Jael Gu*

<br />

## Desription

A text embedding operator takes a sentence, paragraph, or document in string as an input
and output an embedding vector in ndarray which captures the input's core semantic elements.
This operator is implemented with pretrained models from [Huggingface Transformers](https://huggingface.co/docs/transformers).

<br />

## Code Example

Use the pretrained model 'distilbert-base-cased'
to generate a text embedding for the sentence "Hello, world.". 

*Write the pipeline*:

```python
import towhee

towhee.dc(["Hello, world."]) \
      .text_embedding.transformers(model_name="distilbert-base-cased")
```

<br />

## Factory Constructor

Create the operator via the following factory method

***text_embedding.transformers(model_name="bert-base-uncased")***

**Parameters:**

***model_name***: *str*

The model name in string.
The default model name is "bert-base-uncased".

Supported model names:
- [Bert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert)
  - bert-base-cased
  - bert-large-cased
  - bert-base-multilingual-uncased
  - bert-base-multilingual-cased
  - bert-base-chinese
  - bert-base-german-cased
  - bert-large-uncased-whole-word-masking
  - bert-large-cased-whole-word-masking
  - bert-large-uncased-whole-word-masking-finetuned-squad
  - bert-large-cased-whole-word-masking-finetuned-squad
  - bert-base-cased-finetuned-mrpc
  - bert-base-german-dbmdz-cased
  - bert-base-german-dbmdz-uncased
  - cl-tohoku/bert-base-japanese-whole-word-masking
  - cl-tohoku/bert-base-japanese-char
  - cl-tohoku/bert-base-japanese-char-whole-word-masking
  - TurkuNLP/bert-base-finnish-cased-v1
  - TurkuNLP/bert-base-finnish-uncased-v1
  - wietsedv/bert-base-dutch-cased
- [BigBird](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/big_bird)
  - google/bigbird-roberta-base
  - google/bigbird-roberta-large
  - google/bigbird-base-trivia-itc
- [Albert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/albert)
  - albert-base-v1
  - albert-large-v1
  - albert-xlarge-v1
  - albert-xxlarge-v1
  - albert-base-v2
  - albert-large-v2
  - albert-xlarge-v2
  - albert-xxlarge-v2
- [Bart](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bart)
  - facebook/bart-large
- [BertGeneration](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert-generation)
  - google/bert_for_seq_generation_L-24_bbc_encoder
- [BigBirdPegasus](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bigbird_pegasus)
  - google/bigbird-pegasus-large-arxiv
  - google/bigbird-pegasus-large-pubmed
  - google/bigbird-pegasus-large-bigpatent
- [CamemBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/camembert)
  - camembert-base
  - Musixmatch/umberto-commoncrawl-cased-v1
  - Musixmatch/umberto-wikipedia-uncased-v1
- [Canine](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/canine)
  - google/canine-s
  - google/canine-c
- [Convbert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/convbert)
  - YituTech/conv-bert-base
  - YituTech/conv-bert-medium-small
  - YituTech/conv-bert-small
- [CTRL](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/ctrl)
  - ctrl
- [DeBERTa](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/deberta)
  - microsoft/deberta-base
  - microsoft/deberta-large
  - microsoft/deberta-xlarge
  - microsoft/deberta-base-mnli
  - microsoft/deberta-large-mnli
  - microsoft/deberta-xlarge-mnli
  - microsoft/deberta-v2-xlarge
  - microsoft/deberta-v2-xxlarge
  - microsoft/deberta-v2-xlarge-mnli
  - microsoft/deberta-v2-xxlarge-mnli
- [DistilBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/distilbert)
  - distilbert-base-uncased
  - distilbert-base-uncased-distilled-squad
  - distilbert-base-cased
  - distilbert-base-cased-distilled-squad
  - distilbert-base-german-cased
  - distilbert-base-multilingual-cased
  - distilbert-base-uncased-finetuned-sst-2-english
- [Electra](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/electra)
  - google/electra-small-generator
  - google/electra-base-generator
  - google/electra-large-generator
  - google/electra-small-discriminator
  - google/electra-base-discriminator
  - google/electra-large-discriminator
- [Flaubert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/flaubert)
  - flaubert/flaubert_small_cased
  - flaubert/flaubert_base_uncased
  - flaubert/flaubert_base_cased
  - flaubert/flaubert_large_cased
- [FNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/fnet)
  - google/fnet-base
  - google/fnet-large
- [FSMT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/fsmt)
  - facebook/wmt19-ru-en
- [Funnel](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/funnel)
  - funnel-transformer/small
  - funnel-transformer/small-base
  - funnel-transformer/medium
  - funnel-transformer/medium-base
  - funnel-transformer/intermediate
  - funnel-transformer/intermediate-base
  - funnel-transformer/large
  - funnel-transformer/large-base
  - funnel-transformer/xlarge-base
  - funnel-transformer/xlarge
- [GPT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/openai-gpt)
  - openai-gpt
  - gpt2
  - gpt2-medium
  - gpt2-large
  - gpt2-xl
  - distilgpt2
  - EleutherAI/gpt-neo-1.3B
  - EleutherAI/gpt-j-6B
- [I-Bert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/ibert)
  - kssteven/ibert-roberta-base
- [LED](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/led)
  - allenai/led-base-16384
- [MobileBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/mobilebert)
  - google/mobilebert-uncased
- [MPNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/mpnet)
  - microsoft/mpnet-base
- [Nystromformer](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/nystromformer)
  - uw-madison/nystromformer-512
- [Reformer](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/reformer)
  - google/reformer-crime-and-punishment
- [Splinter](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/splinter)
  - tau/splinter-base
  - tau/splinter-base-qass
  - tau/splinter-large
  - tau/splinter-large-qass
- [SqueezeBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/squeezebert)
  - squeezebert/squeezebert-uncased
  - squeezebert/squeezebert-mnli
  - squeezebert/squeezebert-mnli-headless
- [TransfoXL](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/transfo-xl)
  - transfo-xl-wt103
- [XLM](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlm)
  - xlm-mlm-en-2048
  - xlm-mlm-ende-1024
  - xlm-mlm-enfr-1024
  - xlm-mlm-enro-1024
  - xlm-mlm-tlm-xnli15-1024
  - xlm-mlm-xnli15-1024
  - xlm-clm-enfr-1024
  - xlm-clm-ende-1024
  - xlm-mlm-17-1280
  - xlm-mlm-100-1280
- [XLMRoberta](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlm-roberta)
  - xlm-roberta-base
  - xlm-roberta-large
  - xlm-roberta-large-finetuned-conll02-dutch
  - xlm-roberta-large-finetuned-conll02-spanish
  - xlm-roberta-large-finetuned-conll03-english
  - xlm-roberta-large-finetuned-conll03-german
- [XLNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlnet)
  - xlnet-base-cased 
  - xlnet-large-cased
- [Yoso](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/yoso)
  - uw-madison/yoso-4096


<br />

## Interface

The operator takes a text in string as input.
It loads tokenizer and pre-trained model using model name.
and then return text embedding in ndarray.


**Parameters:**

***text***: *str*

The text in string.


**Returns**:

*numpy.ndarray*

The text embedding extracted by model.
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`# Text Embedding with Transformers`

			`author: Jael Gu`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`<br />`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`## Desription`

Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`A text embedding operator takes a sentence, paragraph, or document in string as an input`
			`and output an embedding vector in ndarray which captures the input's core semantic elements.`
			`This operator is implemented with pretrained models from [Huggingface Transformers](https://huggingface.co/docs/transformers).`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`<br />`

Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`## Code Example`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Use the pretrained model 'distilbert-base-cased'`
			`to generate a text embedding for the sentence "Hello, world.".`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Write the pipeline:`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			```python
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`import towhee`
Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`towhee.dc(["Hello, world."]) \`
			`.text_embedding.transformers(model_name="distilbert-base-cased")`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			```

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`<br />`

Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`## Factory Constructor`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Create the operator via the following factory method`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`*text_embedding.transformers(model_name="bert-base-uncased")*`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
			`Parameters:`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`*model_name: str*`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The model name in string.`
Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The default model name is "bert-base-uncased".`

			`Supported model names:`
			`- [Bert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert)`
			`- bert-base-cased`
			`- bert-large-cased`
			`- bert-base-multilingual-uncased`
			`- bert-base-multilingual-cased`
			`- bert-base-chinese`
			`- bert-base-german-cased`
			`- bert-large-uncased-whole-word-masking`
			`- bert-large-cased-whole-word-masking`
			`- bert-large-uncased-whole-word-masking-finetuned-squad`
			`- bert-large-cased-whole-word-masking-finetuned-squad`
			`- bert-base-cased-finetuned-mrpc`
			`- bert-base-german-dbmdz-cased`
			`- bert-base-german-dbmdz-uncased`
			`- cl-tohoku/bert-base-japanese-whole-word-masking`
			`- cl-tohoku/bert-base-japanese-char`
			`- cl-tohoku/bert-base-japanese-char-whole-word-masking`
			`- TurkuNLP/bert-base-finnish-cased-v1`
			`- TurkuNLP/bert-base-finnish-uncased-v1`
			`- wietsedv/bert-base-dutch-cased`
			`- [BigBird](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/big_bird)`
			`- google/bigbird-roberta-base`
			`- google/bigbird-roberta-large`
			`- google/bigbird-base-trivia-itc`
			`- [Albert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/albert)`
			`- albert-base-v1`
			`- albert-large-v1`
			`- albert-xlarge-v1`
			`- albert-xxlarge-v1`
			`- albert-base-v2`
			`- albert-large-v2`
			`- albert-xlarge-v2`
			`- albert-xxlarge-v2`
			`- [Bart](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bart)`
			`- facebook/bart-large`
			`- [BertGeneration](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert-generation)`
			`- google/bert_for_seq_generation_L-24_bbc_encoder`
			`- [BigBirdPegasus](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bigbird_pegasus)`
			`- google/bigbird-pegasus-large-arxiv`
			`- google/bigbird-pegasus-large-pubmed`
			`- google/bigbird-pegasus-large-bigpatent`
			`- [CamemBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/camembert)`
			`- camembert-base`
			`- Musixmatch/umberto-commoncrawl-cased-v1`
			`- Musixmatch/umberto-wikipedia-uncased-v1`
			`- [Canine](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/canine)`
			`- google/canine-s`
			`- google/canine-c`
			`- [Convbert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/convbert)`
			`- YituTech/conv-bert-base`
			`- YituTech/conv-bert-medium-small`
			`- YituTech/conv-bert-small`
			`- [CTRL](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/ctrl)`
			`- ctrl`
			`- [DeBERTa](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/deberta)`
			`- microsoft/deberta-base`
			`- microsoft/deberta-large`
			`- microsoft/deberta-xlarge`
			`- microsoft/deberta-base-mnli`
			`- microsoft/deberta-large-mnli`
			`- microsoft/deberta-xlarge-mnli`
			`- microsoft/deberta-v2-xlarge`
			`- microsoft/deberta-v2-xxlarge`
			`- microsoft/deberta-v2-xlarge-mnli`
			`- microsoft/deberta-v2-xxlarge-mnli`
			`- [DistilBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/distilbert)`
			`- distilbert-base-uncased`
			`- distilbert-base-uncased-distilled-squad`
			`- distilbert-base-cased`
			`- distilbert-base-cased-distilled-squad`
			`- distilbert-base-german-cased`
			`- distilbert-base-multilingual-cased`
			`- distilbert-base-uncased-finetuned-sst-2-english`
			`- [Electra](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/electra)`
			`- google/electra-small-generator`
			`- google/electra-base-generator`
			`- google/electra-large-generator`
			`- google/electra-small-discriminator`
			`- google/electra-base-discriminator`
			`- google/electra-large-discriminator`
			`- [Flaubert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/flaubert)`
			`- flaubert/flaubert_small_cased`
			`- flaubert/flaubert_base_uncased`
			`- flaubert/flaubert_base_cased`
			`- flaubert/flaubert_large_cased`
			`- [FNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/fnet)`
			`- google/fnet-base`
			`- google/fnet-large`
			`- [FSMT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/fsmt)`
			`- facebook/wmt19-ru-en`
			`- [Funnel](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/funnel)`
			`- funnel-transformer/small`
			`- funnel-transformer/small-base`
			`- funnel-transformer/medium`
			`- funnel-transformer/medium-base`
			`- funnel-transformer/intermediate`
			`- funnel-transformer/intermediate-base`
			`- funnel-transformer/large`
			`- funnel-transformer/large-base`
			`- funnel-transformer/xlarge-base`
			`- funnel-transformer/xlarge`
			`- [GPT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/openai-gpt)`
			`- openai-gpt`
			`- gpt2`
			`- gpt2-medium`
			`- gpt2-large`
			`- gpt2-xl`
			`- distilgpt2`
			`- EleutherAI/gpt-neo-1.3B`
			`- EleutherAI/gpt-j-6B`
			`- [I-Bert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/ibert)`
			`- kssteven/ibert-roberta-base`
			`- [LED](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/led)`
			`- allenai/led-base-16384`
			`- [MobileBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/mobilebert)`
			`- google/mobilebert-uncased`
			`- [MPNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/mpnet)`
			`- microsoft/mpnet-base`
			`- [Nystromformer](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/nystromformer)`
			`- uw-madison/nystromformer-512`
			`- [Reformer](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/reformer)`
			`- google/reformer-crime-and-punishment`
			`- [Splinter](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/splinter)`
			`- tau/splinter-base`
			`- tau/splinter-base-qass`
			`- tau/splinter-large`
			`- tau/splinter-large-qass`
			`- [SqueezeBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/squeezebert)`
			`- squeezebert/squeezebert-uncased`
			`- squeezebert/squeezebert-mnli`
			`- squeezebert/squeezebert-mnli-headless`
			`- [TransfoXL](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/transfo-xl)`
			`- transfo-xl-wt103`
			`- [XLM](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlm)`
			`- xlm-mlm-en-2048`
			`- xlm-mlm-ende-1024`
			`- xlm-mlm-enfr-1024`
			`- xlm-mlm-enro-1024`
			`- xlm-mlm-tlm-xnli15-1024`
			`- xlm-mlm-xnli15-1024`
			`- xlm-clm-enfr-1024`
			`- xlm-clm-ende-1024`
			`- xlm-mlm-17-1280`
			`- xlm-mlm-100-1280`
			`- [XLMRoberta](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlm-roberta)`
			`- xlm-roberta-base`
			`- xlm-roberta-large`
			`- xlm-roberta-large-finetuned-conll02-dutch`
			`- xlm-roberta-large-finetuned-conll02-spanish`
			`- xlm-roberta-large-finetuned-conll03-english`
			`- xlm-roberta-large-finetuned-conll03-german`
			`- [XLNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlnet)`
			`- xlnet-base-cased`
			`- xlnet-large-cased`
			`- [Yoso](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/yoso)`
			`- uw-madison/yoso-4096`

Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`<br />`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`## Interface`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The operator takes a text in string as input.`
			`It loads tokenizer and pre-trained model using model name.`
Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`and then return text embedding in ndarray.`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago

Refactor Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Parameters:`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`*text: str*`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The text in string.`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago

Debug Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`Returns:`

Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`numpy.ndarray`
Refactor operator Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago
Update README Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 4 years ago			`The text embedding extracted by model.`
Initial commit 4 years ago