logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

Updated 2 years ago

text-embedding

Text Embedding with Transformers

author: Jael Gu


Description

A text embedding operator takes a sentence, paragraph, or document in string as an input and output an embedding vector in ndarray which captures the input's core semantic elements. This operator is implemented with pre-trained models from Huggingface Transformers.


Code Example

Use the pre-trained model 'distilbert-base-cased' to generate a text embedding for the sentence "Hello, world.".

Write the pipeline:

import towhee

(
    towhee.dc(["Hello, world."])
          .text_embedding.transformers(model_name="distilbert-base-cased")
)

Write a same pipeline with explicit inputs/outputs name specifications:

import towhee

(
    towhee.dc['text'](["Hello, world."])
          .text_embedding.transformers['text', 'vec'](model_name="distilbert-base-cased")
          .show()
)


Factory Constructor

Create the operator via the following factory method:

text_embedding.transformers(model_name=None)

Parameters:

model_name: str

The model name in string, defaults to None. If None, the operator will be initialized without specified model.

Supported model names:

Albert
  • albert-base-v1
  • albert-large-v1
  • albert-xlarge-v1
  • albert-xxlarge-v1
  • albert-base-v2
  • albert-large-v2
  • albert-xlarge-v2
  • albert-xxlarge-v2
Bart
  • facebook/bart-large
Bert
  • bert-base-cased
  • bert-base-uncased
  • bert-large-cased
  • bert-large-uncased
  • bert-base-multilingual-uncased
  • bert-base-multilingual-cased
  • bert-base-chinese
  • bert-base-german-cased
  • bert-large-uncased-whole-word-masking
  • bert-large-cased-whole-word-masking
  • bert-large-uncased-whole-word-masking-finetuned-squad
  • bert-large-cased-whole-word-masking-finetuned-squad
  • bert-base-cased-finetuned-mrpc
  • bert-base-german-dbmdz-cased
  • bert-base-german-dbmdz-uncased
  • cl-tohoku/bert-base-japanese-whole-word-masking
  • cl-tohoku/bert-base-japanese-char
  • cl-tohoku/bert-base-japanese-char-whole-word-masking
  • TurkuNLP/bert-base-finnish-cased-v1
  • TurkuNLP/bert-base-finnish-uncased-v1
  • wietsedv/bert-base-dutch-cased
BertGeneration
  • google/bert_for_seq_generation_L-24_bbc_encoder
BigBird
  • google/bigbird-roberta-base
  • google/bigbird-roberta-large
  • google/bigbird-base-trivia-itc
BigBirdPegasus
  • google/bigbird-pegasus-large-arxiv
  • google/bigbird-pegasus-large-pubmed
  • google/bigbird-pegasus-large-bigpatent
CamemBert
  • camembert-base
  • Musixmatch/umberto-commoncrawl-cased-v1
  • Musixmatch/umberto-wikipedia-uncased-v1
Canine
  • google/canine-s
  • google/canine-c
Convbert
  • YituTech/conv-bert-base
  • YituTech/conv-bert-medium-small
  • YituTech/conv-bert-small
CTRL - ctrl
DeBERTa
  • microsoft/deberta-base
  • microsoft/deberta-large
  • microsoft/deberta-xlarge
  • microsoft/deberta-base-mnli
  • microsoft/deberta-large-mnli
  • microsoft/deberta-xlarge-mnli
  • microsoft/deberta-v2-xlarge
  • microsoft/deberta-v2-xxlarge
  • microsoft/deberta-v2-xlarge-mnli
  • microsoft/deberta-v2-xxlarge-mnli
DistilBert
  • distilbert-base-uncased
  • distilbert-base-uncased-distilled-squad
  • distilbert-base-cased
  • distilbert-base-cased-distilled-squad
  • distilbert-base-german-cased
  • distilbert-base-multilingual-cased
  • distilbert-base-uncased-finetuned-sst-2-english
Electral
  • google/electra-small-generator
  • google/electra-base-generator
  • google/electra-large-generator
  • google/electra-small-discriminator
  • google/electra-base-discriminator
  • google/electra-large-discriminator
Flaubert
  • flaubert/flaubert_small_cased
  • flaubert/flaubert_base_uncased
  • flaubert/flaubert_base_cased
  • flaubert/flaubert_large_cased
FNet
  • google/fnet-base
  • google/fnet-large
FSMT
  • facebook/wmt19-ru-en
Funnel
  • funnel-transformer/small
  • funnel-transformer/small-base
  • funnel-transformer/medium
  • funnel-transformer/medium-base
  • funnel-transformer/intermediate
  • funnel-transformer/intermediate-base
  • funnel-transformer/large
  • funnel-transformer/large-base
  • funnel-transformer/xlarge-base
  • funnel-transformer/xlarge
GPT
  • openai-gpt
  • gpt2
  • gpt2-medium
  • gpt2-large
  • gpt2-xl
  • distilgpt2
  • EleutherAI/gpt-neo-1.3B
  • EleutherAI/gpt-j-6B
I-Bert
  • kssteven/ibert-roberta-base
LED
  • allenai/led-base-16384
MobileBert
  • google/mobilebert-uncased
MPNet
  • microsoft/mpnet-base
Nystromformer
  • uw-madison/nystromformer-512
Reformer
  • google/reformer-crime-and-punishment
Splinter
  • tau/splinter-base
  • tau/splinter-base-qass
  • tau/splinter-large
  • tau/splinter-large-qass
SqueezeBert
  • squeezebert/squeezebert-uncased
  • squeezebert/squeezebert-mnli
  • squeezebert/squeezebert-mnli-headless
TransfoXL
  • transfo-xl-wt103
XLM
  • xlm-mlm-en-2048
  • xlm-mlm-ende-1024
  • xlm-mlm-enfr-1024
  • xlm-mlm-enro-1024
  • xlm-mlm-tlm-xnli15-1024
  • xlm-mlm-xnli15-1024
  • xlm-clm-enfr-1024
  • xlm-clm-ende-1024
  • xlm-mlm-17-1280
  • xlm-mlm-100-1280
XLMRoberta
  • xlm-roberta-base
  • xlm-roberta-large
  • xlm-roberta-large-finetuned-conll02-dutch
  • xlm-roberta-large-finetuned-conll02-spanish
  • xlm-roberta-large-finetuned-conll03-english
  • xlm-roberta-large-finetuned-conll03-german
XLNet
  • xlnet-base-cased
  • xlnet-large-cased
Yoso
  • uw-madison/yoso-4096


checkpoint_path: str

The path to local checkpoint, defaults to None. If None, the operator will download and load pretrained model by model_name from Huggingface transformers.


tokenizer: object

The method to tokenize input text, defaults to None. If None, the operator will use default tokenizer by model_name from Huggingface transformers.


Interface

The operator takes a piece of text in string as input. It loads tokenizer and pre-trained model using model name. and then return text embedding in ndarray.

__call__(txt)

Parameters:

txt: str

​ The text in string.

Returns:

numpy.ndarray

​ The text embedding extracted by model.


save_model(format='pytorch', path='default')

Save model to local with specified format.

Parameters:

format: str

​ The format of saved model, defaults to 'pytorch'.

path: str

​ The path where model is saved to. By default, it will save model to the operator directory.


supported_model_names(format=None)

Get a list of all supported model names or supported model names for specified model format.

Parameters:

format: str

​ The model format such as 'pytorch', 'torchscript'.

from towhee import ops


op = ops.text_embedding.transformers().get_op()
full_list = op.supported_model_names()
onnx_list = op.supported_model_names(format='onnx')
print(f'Onnx-support/Total Models: {len(onnx_list)}/{len(full_list)}')
2022-12-13 16:25:15,916 - 140704500614336 - auto_transformers.py-auto_transformers:68 - WARNING: The operator is initialized without specified model.
Onnx-support/Total Models: 111/126
Jael Gu 9c26196077 Update op to support both models 68 Commits
folder-icon benchmark Update run.py with ann_search.milvus_client 2 years ago
file-icon .gitattributes
1.1 KiB
download-icon
Initial commit 3 years ago
file-icon README.md
8.3 KiB
download-icon
Update README 2 years ago
file-icon __init__.py
709 B
download-icon
Debug 3 years ago
file-icon auto_transformers.py
16 KiB
download-icon
Update op to support both models 2 years ago
file-icon requirements.txt
56 B
download-icon
Update requirements 2 years ago
file-icon result.png
5.8 KiB
download-icon
Add results 2 years ago
file-icon test_onnx.py
3.1 KiB
download-icon
Update onnx test 2 years ago
file-icon test_onnx2.py
5.7 KiB
download-icon
Update onnx test 2 years ago
file-icon test_torchscript.py
1.1 KiB
download-icon
Update test scripts 2 years ago
file-icon train_clm_with_hf_trainer.py
18 KiB
download-icon
lazy import evaluate and datasets for avoiding potential error. 2 years ago
file-icon train_mlm_with_hf_trainer.py
20 KiB
download-icon
lazy import evaluate and datasets for avoiding potential error. 2 years ago