copied
Readme
Files and versions
7.6 KiB
Text Embedding with Transformers
author: Jael Gu
Desription
A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements. This operator is implemented with pretrained models from Huggingface Transformers.
Code Example
Use the pretrained model 'distilbert-base-cased' to generate a text embedding for the sentence "Hello, world.".
Write the pipeline:
import towhee
towhee.dc(["Hello, world."]) \
.text_embedding.transformers(model_name="distilbert-base-cased")
Factory Constructor
Create the operator via the following factory method
text_embedding.transformers(model_name="bert-base-uncased")
Parameters:
model_name: str
The model name in string. The default model name is "bert-base-uncased".
Supported model names:
Test
- a
- b
- Bert
- bert-base-cased
- bert-large-cased
- bert-base-multilingual-uncased
- bert-base-multilingual-cased
- bert-base-chinese
- bert-base-german-cased
- bert-large-uncased-whole-word-masking
- bert-large-cased-whole-word-masking
- bert-large-uncased-whole-word-masking-finetuned-squad
- bert-large-cased-whole-word-masking-finetuned-squad
- bert-base-cased-finetuned-mrpc
- bert-base-german-dbmdz-cased
- bert-base-german-dbmdz-uncased
- cl-tohoku/bert-base-japanese-whole-word-masking
- cl-tohoku/bert-base-japanese-char
- cl-tohoku/bert-base-japanese-char-whole-word-masking
- TurkuNLP/bert-base-finnish-cased-v1
- TurkuNLP/bert-base-finnish-uncased-v1
- wietsedv/bert-base-dutch-cased
- BigBird
- google/bigbird-roberta-base
- google/bigbird-roberta-large
- google/bigbird-base-trivia-itc
- Albert
- albert-base-v1
- albert-large-v1
- albert-xlarge-v1
- albert-xxlarge-v1
- albert-base-v2
- albert-large-v2
- albert-xlarge-v2
- albert-xxlarge-v2
- Bart
- facebook/bart-large
- BertGeneration
- google/bert_for_seq_generation_L-24_bbc_encoder
- BigBirdPegasus
- google/bigbird-pegasus-large-arxiv
- google/bigbird-pegasus-large-pubmed
- google/bigbird-pegasus-large-bigpatent
- CamemBert
- camembert-base
- Musixmatch/umberto-commoncrawl-cased-v1
- Musixmatch/umberto-wikipedia-uncased-v1
- Canine
- google/canine-s
- google/canine-c
- Convbert
- YituTech/conv-bert-base
- YituTech/conv-bert-medium-small
- YituTech/conv-bert-small
- CTRL
- ctrl
- DeBERTa
- microsoft/deberta-base
- microsoft/deberta-large
- microsoft/deberta-xlarge
- microsoft/deberta-base-mnli
- microsoft/deberta-large-mnli
- microsoft/deberta-xlarge-mnli
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v2-xxlarge
- microsoft/deberta-v2-xlarge-mnli
- microsoft/deberta-v2-xxlarge-mnli
- DistilBert
- distilbert-base-uncased
- distilbert-base-uncased-distilled-squad
- distilbert-base-cased
- distilbert-base-cased-distilled-squad
- distilbert-base-german-cased
- distilbert-base-multilingual-cased
- distilbert-base-uncased-finetuned-sst-2-english
- Electra
- google/electra-small-generator
- google/electra-base-generator
- google/electra-large-generator
- google/electra-small-discriminator
- google/electra-base-discriminator
- google/electra-large-discriminator
- Flaubert
- flaubert/flaubert_small_cased
- flaubert/flaubert_base_uncased
- flaubert/flaubert_base_cased
- flaubert/flaubert_large_cased
- FNet
- google/fnet-base
- google/fnet-large
- FSMT
- facebook/wmt19-ru-en
- Funnel
- funnel-transformer/small
- funnel-transformer/small-base
- funnel-transformer/medium
- funnel-transformer/medium-base
- funnel-transformer/intermediate
- funnel-transformer/intermediate-base
- funnel-transformer/large
- funnel-transformer/large-base
- funnel-transformer/xlarge-base
- funnel-transformer/xlarge
- GPT
- openai-gpt
- gpt2
- gpt2-medium
- gpt2-large
- gpt2-xl
- distilgpt2
- EleutherAI/gpt-neo-1.3B
- EleutherAI/gpt-j-6B
- I-Bert
- kssteven/ibert-roberta-base
- LED
- allenai/led-base-16384
- MobileBert
- google/mobilebert-uncased
- MPNet
- microsoft/mpnet-base
- Nystromformer
- uw-madison/nystromformer-512
- Reformer
- google/reformer-crime-and-punishment
- Splinter
- tau/splinter-base
- tau/splinter-base-qass
- tau/splinter-large
- tau/splinter-large-qass
- SqueezeBert
- squeezebert/squeezebert-uncased
- squeezebert/squeezebert-mnli
- squeezebert/squeezebert-mnli-headless
- TransfoXL
- transfo-xl-wt103
- XLM
- xlm-mlm-en-2048
- xlm-mlm-ende-1024
- xlm-mlm-enfr-1024
- xlm-mlm-enro-1024
- xlm-mlm-tlm-xnli15-1024
- xlm-mlm-xnli15-1024
- xlm-clm-enfr-1024
- xlm-clm-ende-1024
- xlm-mlm-17-1280
- xlm-mlm-100-1280
- XLMRoberta
- xlm-roberta-base
- xlm-roberta-large
- xlm-roberta-large-finetuned-conll02-dutch
- xlm-roberta-large-finetuned-conll02-spanish
- xlm-roberta-large-finetuned-conll03-english
- xlm-roberta-large-finetuned-conll03-german
- XLNet
- xlnet-base-cased
- xlnet-large-cased
- Yoso
- uw-madison/yoso-4096
Interface
The operator takes a text in string as input. It loads tokenizer and pre-trained model using model name. and then return text embedding in ndarray.
Parameters:
text: str
The text in string.
Returns:
numpy.ndarray
The text embedding extracted by model.
7.6 KiB
Text Embedding with Transformers
author: Jael Gu
Desription
A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements. This operator is implemented with pretrained models from Huggingface Transformers.
Code Example
Use the pretrained model 'distilbert-base-cased' to generate a text embedding for the sentence "Hello, world.".
Write the pipeline:
import towhee
towhee.dc(["Hello, world."]) \
.text_embedding.transformers(model_name="distilbert-base-cased")
Factory Constructor
Create the operator via the following factory method
text_embedding.transformers(model_name="bert-base-uncased")
Parameters:
model_name: str
The model name in string. The default model name is "bert-base-uncased".
Supported model names:
Test
- a
- b
- Bert
- bert-base-cased
- bert-large-cased
- bert-base-multilingual-uncased
- bert-base-multilingual-cased
- bert-base-chinese
- bert-base-german-cased
- bert-large-uncased-whole-word-masking
- bert-large-cased-whole-word-masking
- bert-large-uncased-whole-word-masking-finetuned-squad
- bert-large-cased-whole-word-masking-finetuned-squad
- bert-base-cased-finetuned-mrpc
- bert-base-german-dbmdz-cased
- bert-base-german-dbmdz-uncased
- cl-tohoku/bert-base-japanese-whole-word-masking
- cl-tohoku/bert-base-japanese-char
- cl-tohoku/bert-base-japanese-char-whole-word-masking
- TurkuNLP/bert-base-finnish-cased-v1
- TurkuNLP/bert-base-finnish-uncased-v1
- wietsedv/bert-base-dutch-cased
- BigBird
- google/bigbird-roberta-base
- google/bigbird-roberta-large
- google/bigbird-base-trivia-itc
- Albert
- albert-base-v1
- albert-large-v1
- albert-xlarge-v1
- albert-xxlarge-v1
- albert-base-v2
- albert-large-v2
- albert-xlarge-v2
- albert-xxlarge-v2
- Bart
- facebook/bart-large
- BertGeneration
- google/bert_for_seq_generation_L-24_bbc_encoder
- BigBirdPegasus
- google/bigbird-pegasus-large-arxiv
- google/bigbird-pegasus-large-pubmed
- google/bigbird-pegasus-large-bigpatent
- CamemBert
- camembert-base
- Musixmatch/umberto-commoncrawl-cased-v1
- Musixmatch/umberto-wikipedia-uncased-v1
- Canine
- google/canine-s
- google/canine-c
- Convbert
- YituTech/conv-bert-base
- YituTech/conv-bert-medium-small
- YituTech/conv-bert-small
- CTRL
- ctrl
- DeBERTa
- microsoft/deberta-base
- microsoft/deberta-large
- microsoft/deberta-xlarge
- microsoft/deberta-base-mnli
- microsoft/deberta-large-mnli
- microsoft/deberta-xlarge-mnli
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v2-xxlarge
- microsoft/deberta-v2-xlarge-mnli
- microsoft/deberta-v2-xxlarge-mnli
- DistilBert
- distilbert-base-uncased
- distilbert-base-uncased-distilled-squad
- distilbert-base-cased
- distilbert-base-cased-distilled-squad
- distilbert-base-german-cased
- distilbert-base-multilingual-cased
- distilbert-base-uncased-finetuned-sst-2-english
- Electra
- google/electra-small-generator
- google/electra-base-generator
- google/electra-large-generator
- google/electra-small-discriminator
- google/electra-base-discriminator
- google/electra-large-discriminator
- Flaubert
- flaubert/flaubert_small_cased
- flaubert/flaubert_base_uncased
- flaubert/flaubert_base_cased
- flaubert/flaubert_large_cased
- FNet
- google/fnet-base
- google/fnet-large
- FSMT
- facebook/wmt19-ru-en
- Funnel
- funnel-transformer/small
- funnel-transformer/small-base
- funnel-transformer/medium
- funnel-transformer/medium-base
- funnel-transformer/intermediate
- funnel-transformer/intermediate-base
- funnel-transformer/large
- funnel-transformer/large-base
- funnel-transformer/xlarge-base
- funnel-transformer/xlarge
- GPT
- openai-gpt
- gpt2
- gpt2-medium
- gpt2-large
- gpt2-xl
- distilgpt2
- EleutherAI/gpt-neo-1.3B
- EleutherAI/gpt-j-6B
- I-Bert
- kssteven/ibert-roberta-base
- LED
- allenai/led-base-16384
- MobileBert
- google/mobilebert-uncased
- MPNet
- microsoft/mpnet-base
- Nystromformer
- uw-madison/nystromformer-512
- Reformer
- google/reformer-crime-and-punishment
- Splinter
- tau/splinter-base
- tau/splinter-base-qass
- tau/splinter-large
- tau/splinter-large-qass
- SqueezeBert
- squeezebert/squeezebert-uncased
- squeezebert/squeezebert-mnli
- squeezebert/squeezebert-mnli-headless
- TransfoXL
- transfo-xl-wt103
- XLM
- xlm-mlm-en-2048
- xlm-mlm-ende-1024
- xlm-mlm-enfr-1024
- xlm-mlm-enro-1024
- xlm-mlm-tlm-xnli15-1024
- xlm-mlm-xnli15-1024
- xlm-clm-enfr-1024
- xlm-clm-ende-1024
- xlm-mlm-17-1280
- xlm-mlm-100-1280
- XLMRoberta
- xlm-roberta-base
- xlm-roberta-large
- xlm-roberta-large-finetuned-conll02-dutch
- xlm-roberta-large-finetuned-conll02-spanish
- xlm-roberta-large-finetuned-conll03-english
- xlm-roberta-large-finetuned-conll03-german
- XLNet
- xlnet-base-cased
- xlnet-large-cased
- Yoso
- uw-madison/yoso-4096
Interface
The operator takes a text in string as input. It loads tokenizer and pre-trained model using model name. and then return text embedding in ndarray.
Parameters:
text: str
The text in string.
Returns:
numpy.ndarray
The text embedding extracted by model.