sbert/README.md

# Sentence Embedding with Sentence Transformers

*author: [Jael Gu](https://github.com/jaelgu)*

<br />

## Description

This operator takes a sentence or a list of sentences in string as input.
It generates an embedding vector in numpy.ndarray for each sentence, which captures the input sentence's core semantic elements.
This operator is implemented with pre-trained models from [Sentence Transformers](https://www.sbert.net/).

<br />

## Code Example

Use the pre-trained model "all-MiniLM-L12-v2"
to generate a text embedding for the sentence "This is a sentence.".

*Write a pipeline with explicit inputs/outputs name specifications:*

```python
from towhee.dc2 import pipe, ops, DataCollection

p = (
    pipe.input('sentence')
        .map('sentence', 'vec', ops.sentence_embedding.sbert(model_name='all-MiniLM-L12-v2'))
        .output('sentence', 'vec')
)

DataCollection(p('This is a sentence.')).show()
```

<img src="./result.png" width="800px"/>

<br />

## Factory Constructor

Create the operator via the following factory method:

***text_embedding.sbert(model_name='all-MiniLM-L12-v2')***

**Parameters:**

***model_name***: *str*

The model name in string. Supported model names:

Refer to [SBert Doc](https://www.sbert.net/docs/pretrained_models.html).
Please note that only models listed `supported_model_names` are tested.
You can refer to [Towhee Pipeline]() for model performance.

***device***: *str*

The device to run model, defaults to None.
If None, it will use 'cuda' automatically when cuda is available.

<br />

## Interface

The operator takes a sentence or a list of sentences in string as input.
It loads tokenizer and pre-trained model using model name,
and then returns text embedding in numpy.ndarray.

***__call__(txt)***

**Parameters:**

***txt***: *Union[List[str], str]*

	A sentence or a list of sentences in string.


**Returns**:

*Union[List[numpy.ndarray], numpy.ndarray]*

	If input is a sentence in string, then it returns an embedding vector of shape (dim,) in numpy.ndarray.
If input is a list of sentences, then it returns a list of embedding vectors, each of which a numpy.ndarray in shape of (dim,).

<br/>

***supported_model_names(format=None)***

Get a list of all supported model names or supported model names for specified model format.

**Parameters:**

***format***: *str*

	The model format such as 'pytorch', defaults to None.
If None, it will return a full list of supported model names.

```python
from towhee import ops

op = ops.sentence_embedding.sentence_transformers().get_op()
full_list = op.supported_model_names()
```

## Fine-tune

### Get started

In this example, we fine-tune operator in Semantic Textual Similarity (STS) task, which assigns a score on the similarity of two texts. 
We use the STSbenchmark as training data to fine-tune.   

We only need to construct an op instance and pass in some configurations to train the specified task.
```python
import towhee
import os
from sentence_transformers import util

op = towhee.ops.sentence_embedding.sentence_transformers(model_name='nli-distilroberta-base-v2').get_op()

sts_dataset_path = 'datasets/stsbenchmark.tsv.gz'

if not os.path.exists(sts_dataset_path):
    util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)

training_config = {
    'sts_dataset_path': sts_dataset_path,
    'train_batch_size': 16,
    'num_epochs': 4,
    'model_save_path': './output'
}
op.train(training_config)
```

### Dive deep and customize your training
You can change the [training script](https://towhee.io/sentence-embedding/sentence_transformers/src/branch/main/train_sts_task.py) in your customer way. 
Or your can refer to the original [sbert training guide](https://www.sbert.net/docs/training/overview.html) and [code example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) for more information.
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`# Sentence Embedding with Sentence Transformers`
Initial commit 3 years ago
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`author: [Jael Gu](https://github.com/jaelgu)`

			`<br />`

			`## Description`

			`This operator takes a sentence or a list of sentences in string as input.`
			`It generates an embedding vector in numpy.ndarray for each sentence, which captures the input sentence's core semantic elements.`
			`This operator is implemented with pre-trained models from [Sentence Transformers](https://www.sbert.net/).`

			`<br />`

			`## Code Example`

			`Use the pre-trained model "all-MiniLM-L12-v2"`
			`to generate a text embedding for the sentence "This is a sentence.".`

Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`Write a pipeline with explicit inputs/outputs name specifications:`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago
			```python
			`from towhee.dc2 import pipe, ops, DataCollection`

			`p = (`
			`pipe.input('sentence')`
			`.map('sentence', 'vec', ops.sentence_embedding.sbert(model_name='all-MiniLM-L12-v2'))`
			`.output('sentence', 'vec')`
			`)`

			`DataCollection(p('This is a sentence.')).show()`
			```

			`<img src="./result.png" width="800px"/>`

			`<br />`

			`## Factory Constructor`

			`Create the operator via the following factory method:`

			`*text_embedding.sbert(model_name='all-MiniLM-L12-v2')*`

			`Parameters:`

			`*model_name: str*`

			`The model name in string. Supported model names:`

			`Refer to [SBert Doc](https://www.sbert.net/docs/pretrained_models.html).`
			Please note that only models listed `supported_model_names` are tested.
			`You can refer to [Towhee Pipeline]() for model performance.`

			`*device: str*`

			`The device to run model, defaults to None.`
			`If None, it will use 'cuda' automatically when cuda is available.`

			`<br />`

			`## Interface`

			`The operator takes a sentence or a list of sentences in string as input.`
			`It loads tokenizer and pre-trained model using model name,`
			`and then returns text embedding in numpy.ndarray.`

			`*__call__(txt)*`

			`Parameters:`

			`*txt: Union[List[str], str]*`

			`A sentence or a list of sentences in string.`


			`Returns:`

			`Union[List[numpy.ndarray], numpy.ndarray]`

			`If input is a sentence in string, then it returns an embedding vector of shape (dim,) in numpy.ndarray.`
			`If input is a list of sentences, then it returns a list of embedding vectors, each of which a numpy.ndarray in shape of (dim,).`

			`<br/>`

			`*supported_model_names(format=None)*`

			`Get a list of all supported model names or supported model names for specified model format.`

			`Parameters:`

			`*format: str*`

			`The model format such as 'pytorch', defaults to None.`
			`If None, it will return a full list of supported model names.`

			```python
			`from towhee import ops`

			`op = ops.sentence_embedding.sentence_transformers().get_op()`
			`full_list = op.supported_model_names()`
			```
train 3 years ago
			`## Fine-tune`

			`### Get started`

			`In this example, we fine-tune operator in Semantic Textual Similarity (STS) task, which assigns a score on the similarity of two texts.`
			`We use the STSbenchmark as training data to fine-tune.`

			`We only need to construct an op instance and pass in some configurations to train the specified task.`
			```python
			`import towhee`
			`import os`
			`from sentence_transformers import util`

			`op = towhee.ops.sentence_embedding.sentence_transformers(model_name='nli-distilroberta-base-v2').get_op()`

			`sts_dataset_path = 'datasets/stsbenchmark.tsv.gz'`

			`if not os.path.exists(sts_dataset_path):`
			`util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)`

			`training_config = {`
			`'sts_dataset_path': sts_dataset_path,`
			`'train_batch_size': 16,`
			`'num_epochs': 4,`
			`'model_save_path': './output'`
			`}`
			`op.train(training_config)`
			```

			`### Dive deep and customize your training`
			`You can change the [training script](https://towhee.io/sentence-embedding/sentence_transformers/src/branch/main/train_sts_task.py) in your customer way.`
			`Or your can refer to the original [sbert training guide](https://www.sbert.net/docs/training/overview.html) and [code example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) for more information.`