|  |  |  | # Sentence Embedding with Transformers
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | *author: [Jael Gu](https://github.com/jaelgu)* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br /> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Description
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | A sentence embedding operator generates one embedding vector in ndarray for each input text. | 
					
						
							|  |  |  | The embedding represents the semantic information of the whole input text as one vector. | 
					
						
							|  |  |  | This operator is implemented with pre-trained models from [Huggingface Transformers](https://huggingface.co/docs/transformers). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br /> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Code Example
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Use the pre-trained model 'sentence-transformers/paraphrase-albert-small-v2' | 
					
						
							|  |  |  | to generate an embedding for the sentence "Hello, world.". | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | *Write a same pipeline with explicit inputs/outputs name specifications:* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | - **option 1 (towhee>=0.9.0):** | 
					
						
							|  |  |  | ```python | 
					
						
							|  |  |  | from towhee.dc2 import pipe, ops, DataCollection | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | p = ( | 
					
						
							|  |  |  |     pipe.input('text') | 
					
						
							|  |  |  |         .map('text', 'vec',  | 
					
						
							|  |  |  |              ops.sentence_embedding.transformers(model_name='sentence-transformers/paraphrase-albert-small-v2')) | 
					
						
							|  |  |  |         .output('text', 'vec') | 
					
						
							|  |  |  | ) | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | DataCollection(p('Hello, world.')).show() | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <img src="./result.png" width="800px"/> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | - **option 2:** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```python | 
					
						
							|  |  |  | import towhee | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ( | 
					
						
							|  |  |  |     towhee.dc['text'](['Hello, world.']) | 
					
						
							|  |  |  |           .sentence_embedding.transformers['text', 'vec']( | 
					
						
							|  |  |  |                 model_name='sentence-transformers/paraphrase-albert-small-v2') | 
					
						
							|  |  |  |           .show() | 
					
						
							|  |  |  | ) | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br /> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Factory Constructor
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Create the operator via the following factory method: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***sentence_embedding.transformers(model_name=None)*** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | **Parameters:** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***model_name***: *str* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The model name in string, defaults to None. | 
					
						
							|  |  |  | If None, the operator will be initialized without specified model. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Supported model names: NLP transformers models listed in [Huggingface Models](https://huggingface.co/models). | 
					
						
							|  |  |  | Please note that only models listed in `supported_model_names` are tested. | 
					
						
							|  |  |  | You can refer to [Towhee Pipeline]() for benchmark. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***checkpoint_path***: *str* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The path to local checkpoint, defaults to None. | 
					
						
							|  |  |  | If None, the operator will download and load pretrained model by `model_name` from Huggingface transformers. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br />  | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***tokenizer***: *object* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The method to tokenize input text, defaults to None. | 
					
						
							|  |  |  | If None, the operator will use default tokenizer by `model_name` from Huggingface transformers. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br /> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Interface
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The operator takes a piece of text in string as input. | 
					
						
							|  |  |  | It loads tokenizer and pre-trained model using model name, | 
					
						
							|  |  |  | and then return a text emabedding in numpy.ndarray. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***\_\_call\_\_(txt)*** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | **Parameters:** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***data***: *Union[str, list]* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	The text in string or a list of texts. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | **Returns**: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | *numpy.ndarray or list* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	The text embedding (or token embeddings) extracted by model. | 
					
						
							|  |  |  | If `data` is string, the operator returns an embedding in numpy.ndarray with shape of (dim,). | 
					
						
							|  |  |  | If `data` is a list, the operator returns a list of embedding(s) with length of input list. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br /> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***save_model(format='pytorch', path='default')*** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Save model to local with specified format. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | **Parameters:** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***format***: *str* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	The format to export model as, such as 'pytorch', 'torchscript', 'onnx', | 
					
						
							|  |  |  | defaults to 'pytorch'. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***path***: *str* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	The path where exported model is saved to. | 
					
						
							|  |  |  | By default, it will save model to `saved` directory under the operator cache. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```python | 
					
						
							|  |  |  | from towhee import ops | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | op = ops.sentence_embedding.transformers(model_name='sentence-transformers/paraphrase-albert-small-v2').get_op() | 
					
						
							|  |  |  | op.save_model('onnx', 'test.onnx') | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | PosixPath('/Home/.towhee/operators/sentence-embedding/transformers/main/test.onnx') | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <br /> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***supported_model_names(format=None)*** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Get a list of all supported model names or supported model names for specified model format. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | **Parameters:** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ***format***: *str* | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	The model format such as 'pytorch', 'torchscript', 'onnx'. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```python | 
					
						
							|  |  |  | from towhee import ops | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | op = ops.sentence_embedding.transformers().get_op() | 
					
						
							|  |  |  | full_list = op.supported_model_names() | 
					
						
							|  |  |  | onnx_list = op.supported_model_names(format='onnx') | 
					
						
							|  |  |  | ``` |