unixcoder/README.md

# Code & Text Embedding with UniXcoder

*author: [Jael Gu](https://github.com/jaelgu)*

<br />

## Description

A code search operator takes a text string of programming language or natural language as an input
and returns an embedding vector in ndarray which captures the input's core semantic elements.
This operator is implemented with pre-trained [UniXcoder](https://arxiv.org/pdf/2203.03850.pdf) models 
from [Huggingface Transformers](https://huggingface.co/docs/transformers).

<br />

## Code Example

Use the pre-trained model "microsoft/unixcoder-base"
to generate text embeddings for given 
text description "return max value" and code "def max(a,b): if a>b: return a else return b".

*Write a pipeline with explicit inputs/outputs name specifications:*

```python
from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text')
        .map('text', 'embedding', ops.code_search.unixcoder())
        .output('text', 'embedding')
)

DataCollection(p('find max value')).show()
DataCollection(p('def max(a,b): if a>b: return a else return b')).show()
```

<img src="./result.png" width="800px"/>

<br />

## Factory Constructor

Create the operator via the following factory method:

***code_search.unixcoder(model_name="microsoft/unixcoder-base")***

**Parameters:**

***model_name***: *str*

The model name in string.
The default model name is "microsoft/unixcoder-base".

***device***: *str*

The device to run model inference.
The default value is None, which enables GPU if cuda is available.

Supported model names:
- microsoft/unixcoder-base
- microsoft/unixcoder-base-nine

<br />

## Interface

The operator takes a piece of text in string as input.
It loads tokenizer and pre-trained model using model name.
and then return an embedding in ndarray.

***__call__(txt)***

**Parameters:**

***txt***: *str*

	The text string in programming language or natural language.


**Returns**:

*numpy.ndarray*

	The text embedding generated by model, in shape of (dim,).


***save_model(format="pytorch", path="default")***

Save model to local with specified format.

**Parameters:**

***format***: *str*

	The format of saved model, defaults to "pytorch".

***format***: *path*

	The path where model is saved to. By default, it will save model to the operator directory.


***supported_model_names(format=None)***

Get a list of all supported model names or supported model names for specified model format.

**Parameters:**

***format***: *str*

	The model format such as "pytorch", "torchscript".
The default value is None, which will return all supported model names.
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`# Code & Text Embedding with UniXcoder`

			`author: [Jael Gu](https://github.com/jaelgu)`

			`<br />`

			`## Description`

			`A code search operator takes a text string of programming language or natural language as an input`
			`and returns an embedding vector in ndarray which captures the input's core semantic elements.`
			`This operator is implemented with pre-trained [UniXcoder](https://arxiv.org/pdf/2203.03850.pdf) models`
			`from [Huggingface Transformers](https://huggingface.co/docs/transformers).`

			`<br />`

			`## Code Example`

			`Use the pre-trained model "microsoft/unixcoder-base"`
			`to generate text embeddings for given`
			`text description "return max value" and code "def max(a,b): if a>b: return a else return b".`

udpate Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`Write a pipeline with explicit inputs/outputs name specifications:`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago
			```python
Remove dc2 Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`from towhee import pipe, ops, DataCollection`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago
udpate Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`p = (`
			`pipe.input('text')`
			`.map('text', 'embedding', ops.code_search.unixcoder())`
			`.output('text', 'embedding')`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`)`

udpate Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`DataCollection(p('find max value')).show()`
			`DataCollection(p('def max(a,b): if a>b: return a else return b')).show()`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			```

			`<img src="./result.png" width="800px"/>`

			`<br />`

			`## Factory Constructor`

			`Create the operator via the following factory method:`

			`*code_search.unixcoder(model_name="microsoft/unixcoder-base")*`

			`Parameters:`

			`*model_name: str*`

			`The model name in string.`
			`The default model name is "microsoft/unixcoder-base".`

			`*device: str*`

			`The device to run model inference.`
			`The default value is None, which enables GPU if cuda is available.`

			`Supported model names:`
			`- microsoft/unixcoder-base`
Add more model names Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`- microsoft/unixcoder-base-nine`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago
			`<br />`

			`## Interface`

			`The operator takes a piece of text in string as input.`
			`It loads tokenizer and pre-trained model using model name.`
			`and then return an embedding in ndarray.`

			`*__call__(txt)*`

			`Parameters:`

			`*txt: str*`

			`The text string in programming language or natural language.`


			`Returns:`

			`numpy.ndarray`

			`The text embedding generated by model, in shape of (dim,).`


			`*save_model(format="pytorch", path="default")*`

			`Save model to local with specified format.`

			`Parameters:`

			`*format: str*`

			`The format of saved model, defaults to "pytorch".`

			`*format: path*`

			`The path where model is saved to. By default, it will save model to the operator directory.`


			`*supported_model_names(format=None)*`

			`Get a list of all supported model names or supported model names for specified model format.`

			`Parameters:`

			`*format: str*`

			`The model format such as "pytorch", "torchscript".`
			`The default value is None, which will return all supported model names.`
Initial commit 2 years ago