logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

225 lines
7.5 KiB

# Text Embedding with Transformers
*author: Jael Gu*
<br />
## Desription
A text embedding operator takes a sentence, paragraph, or document in string as an input
and output an embedding vector in ndarray which captures the input's core semantic elements.
This operator is implemented with pretrained models from [Huggingface Transformers](https://huggingface.co/docs/transformers).
<br />
## Code Example
Use the pretrained model 'distilbert-base-cased'
to generate a text embedding for the sentence "Hello, world.".
*Write the pipeline*:
```python
import towhee
towhee.dc(["Hello, world."]) \
.text_embedding.transformers(model_name="distilbert-base-cased")
```
<br />
## Factory Constructor
Create the operator via the following factory method
***text_embedding.transformers(model_name="bert-base-uncased")***
**Parameters:**
***model_name***: *str*
The model name in string.
The default model name is "bert-base-uncased".
Supported model names:
- [Bert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert)
- bert-base-cased
- bert-large-cased
- bert-base-multilingual-uncased
- bert-base-multilingual-cased
- bert-base-chinese
- bert-base-german-cased
- bert-large-uncased-whole-word-masking
- bert-large-cased-whole-word-masking
- bert-large-uncased-whole-word-masking-finetuned-squad
- bert-large-cased-whole-word-masking-finetuned-squad
- bert-base-cased-finetuned-mrpc
- bert-base-german-dbmdz-cased
- bert-base-german-dbmdz-uncased
- cl-tohoku/bert-base-japanese-whole-word-masking
- cl-tohoku/bert-base-japanese-char
- cl-tohoku/bert-base-japanese-char-whole-word-masking
- TurkuNLP/bert-base-finnish-cased-v1
- TurkuNLP/bert-base-finnish-uncased-v1
- wietsedv/bert-base-dutch-cased
- [BigBird](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/big_bird)
- google/bigbird-roberta-base
- google/bigbird-roberta-large
- google/bigbird-base-trivia-itc
- [Albert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/albert)
- albert-base-v1
- albert-large-v1
- albert-xlarge-v1
- albert-xxlarge-v1
- albert-base-v2
- albert-large-v2
- albert-xlarge-v2
- albert-xxlarge-v2
- [Bart](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bart)
- facebook/bart-large
- [BertGeneration](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert-generation)
- google/bert_for_seq_generation_L-24_bbc_encoder
- [BigBirdPegasus](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bigbird_pegasus)
- google/bigbird-pegasus-large-arxiv
- google/bigbird-pegasus-large-pubmed
- google/bigbird-pegasus-large-bigpatent
- [CamemBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/camembert)
- camembert-base
- Musixmatch/umberto-commoncrawl-cased-v1
- Musixmatch/umberto-wikipedia-uncased-v1
- [Canine](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/canine)
- google/canine-s
- google/canine-c
- [Convbert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/convbert)
- YituTech/conv-bert-base
- YituTech/conv-bert-medium-small
- YituTech/conv-bert-small
- [CTRL](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/ctrl)
- ctrl
- [DeBERTa](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/deberta)
- microsoft/deberta-base
- microsoft/deberta-large
- microsoft/deberta-xlarge
- microsoft/deberta-base-mnli
- microsoft/deberta-large-mnli
- microsoft/deberta-xlarge-mnli
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v2-xxlarge
- microsoft/deberta-v2-xlarge-mnli
- microsoft/deberta-v2-xxlarge-mnli
- [DistilBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/distilbert)
- distilbert-base-uncased
- distilbert-base-uncased-distilled-squad
- distilbert-base-cased
- distilbert-base-cased-distilled-squad
- distilbert-base-german-cased
- distilbert-base-multilingual-cased
- distilbert-base-uncased-finetuned-sst-2-english
- [Electra](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/electra)
- google/electra-small-generator
- google/electra-base-generator
- google/electra-large-generator
- google/electra-small-discriminator
- google/electra-base-discriminator
- google/electra-large-discriminator
- [Flaubert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/flaubert)
- flaubert/flaubert_small_cased
- flaubert/flaubert_base_uncased
- flaubert/flaubert_base_cased
- flaubert/flaubert_large_cased
- [FNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/fnet)
- google/fnet-base
- google/fnet-large
- [FSMT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/fsmt)
- facebook/wmt19-ru-en
- [Funnel](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/funnel)
- funnel-transformer/small
- funnel-transformer/small-base
- funnel-transformer/medium
- funnel-transformer/medium-base
- funnel-transformer/intermediate
- funnel-transformer/intermediate-base
- funnel-transformer/large
- funnel-transformer/large-base
- funnel-transformer/xlarge-base
- funnel-transformer/xlarge
- [GPT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/openai-gpt)
- openai-gpt
- gpt2
- gpt2-medium
- gpt2-large
- gpt2-xl
- distilgpt2
- EleutherAI/gpt-neo-1.3B
- EleutherAI/gpt-j-6B
- [I-Bert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/ibert)
- kssteven/ibert-roberta-base
- [LED](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/led)
- allenai/led-base-16384
- [MobileBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/mobilebert)
- google/mobilebert-uncased
- [MPNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/mpnet)
- microsoft/mpnet-base
- [Nystromformer](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/nystromformer)
- uw-madison/nystromformer-512
- [Reformer](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/reformer)
- google/reformer-crime-and-punishment
- [Splinter](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/splinter)
- tau/splinter-base
- tau/splinter-base-qass
- tau/splinter-large
- tau/splinter-large-qass
- [SqueezeBert](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/squeezebert)
- squeezebert/squeezebert-uncased
- squeezebert/squeezebert-mnli
- squeezebert/squeezebert-mnli-headless
- [TransfoXL](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/transfo-xl)
- transfo-xl-wt103
- [XLM](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlm)
- xlm-mlm-en-2048
- xlm-mlm-ende-1024
- xlm-mlm-enfr-1024
- xlm-mlm-enro-1024
- xlm-mlm-tlm-xnli15-1024
- xlm-mlm-xnli15-1024
- xlm-clm-enfr-1024
- xlm-clm-ende-1024
- xlm-mlm-17-1280
- xlm-mlm-100-1280
- [XLMRoberta](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlm-roberta)
- xlm-roberta-base
- xlm-roberta-large
- xlm-roberta-large-finetuned-conll02-dutch
- xlm-roberta-large-finetuned-conll02-spanish
- xlm-roberta-large-finetuned-conll03-english
- xlm-roberta-large-finetuned-conll03-german
- [XLNet](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/xlnet)
- xlnet-base-cased
- xlnet-large-cased
- [Yoso](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/yoso)
- uw-madison/yoso-4096
<br />
## Interface
The operator takes a text in string as input.
It loads tokenizer and pre-trained model using model name.
and then return text embedding in ndarray.
**Parameters:**
***text***: *str*
The text in string.
**Returns**:
*numpy.ndarray*
The text embedding extracted by model.
2 years ago