|
|
|
# Sentence Embedding with Sentence Transformers
|
|
|
|
|
|
|
|
*author: [Jael Gu](https://github.com/jaelgu)*
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Description
|
|
|
|
|
|
|
|
This operator takes a sentence or a list of sentences in string as input.
|
|
|
|
It generates an embedding vector in numpy.ndarray for each sentence, which captures the input sentence's core semantic elements.
|
|
|
|
This operator is implemented with pre-trained models from [Sentence Transformers](https://www.sbert.net/).
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Code Example
|
|
|
|
|
|
|
|
Use the pre-trained model "all-MiniLM-L12-v2"
|
|
|
|
to generate a text embedding for the sentence "This is a sentence.".
|
|
|
|
|
|
|
|
*Write a same pipeline with explicit inputs/outputs name specifications:*
|
|
|
|
|
|
|
|
- **option 1 (towhee>=0.9.0):**
|
|
|
|
```python
|
|
|
|
from towhee.dc2 import pipe, ops, DataCollection
|
|
|
|
|
|
|
|
p = (
|
|
|
|
pipe.input('sentence')
|
|
|
|
.map('sentence', 'vec', ops.sentence_embedding.sbert(model_name='all-MiniLM-L12-v2'))
|
|
|
|
.output('sentence', 'vec')
|
|
|
|
)
|
|
|
|
|
|
|
|
DataCollection(p('This is a sentence.')).show()
|
|
|
|
```
|
|
|
|
|
|
|
|
<img src="./result.png" width="800px"/>
|
|
|
|
|
|
|
|
- **option 2:**
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc['sentence'](['This is a sentence.'])
|
|
|
|
.sentence_embedding.sbert['sentence', 'vec'](model_name='all-MiniLM-L12-v2')
|
|
|
|
.show()
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Factory Constructor
|
|
|
|
|
|
|
|
Create the operator via the following factory method:
|
|
|
|
|
|
|
|
***text_embedding.sbert(model_name='all-MiniLM-L12-v2')***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***model_name***: *str*
|
|
|
|
|
|
|
|
The model name in string. Supported model names:
|
|
|
|
|
|
|
|
Refer to [SBert Doc](https://www.sbert.net/docs/pretrained_models.html).
|
|
|
|
Please note that only models listed `supported_model_names` are tested.
|
|
|
|
You can refer to [Towhee Pipeline]() for model performance.
|
|
|
|
|
|
|
|
***device***: *str*
|
|
|
|
|
|
|
|
The device to run model, defaults to None.
|
|
|
|
If None, it will use 'cuda' automatically when cuda is available.
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Interface
|
|
|
|
|
|
|
|
The operator takes a sentence or a list of sentences in string as input.
|
|
|
|
It loads tokenizer and pre-trained model using model name,
|
|
|
|
and then returns text embedding in numpy.ndarray.
|
|
|
|
|
|
|
|
***__call__(txt)***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***txt***: *Union[List[str], str]*
|
|
|
|
|
|
|
|
A sentence or a list of sentences in string.
|
|
|
|
|
|
|
|
|
|
|
|
**Returns**:
|
|
|
|
|
|
|
|
*Union[List[numpy.ndarray], numpy.ndarray]*
|
|
|
|
|
|
|
|
If input is a sentence in string, then it returns an embedding vector of shape (dim,) in numpy.ndarray.
|
|
|
|
If input is a list of sentences, then it returns a list of embedding vectors, each of which a numpy.ndarray in shape of (dim,).
|
|
|
|
|
|
|
|
<br/>
|
|
|
|
|
|
|
|
***supported_model_names(format=None)***
|
|
|
|
|
|
|
|
Get a list of all supported model names or supported model names for specified model format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The model format such as 'pytorch', defaults to None.
|
|
|
|
If None, it will return a full list of supported model names.
|
|
|
|
|
|
|
|
```python
|
|
|
|
from towhee import ops
|
|
|
|
|
|
|
|
op = ops.sentence_embedding.sentence_transformers().get_op()
|
|
|
|
full_list = op.supported_model_names()
|
|
|
|
```
|