|
|
|
# Code & Text Embedding with UniXcoder
|
|
|
|
|
|
|
|
*author: [Jael Gu](https://github.com/jaelgu)*
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Description
|
|
|
|
|
|
|
|
A code search operator takes a text string of programming language or natural language as an input
|
|
|
|
and returns an embedding vector in ndarray which captures the input's core semantic elements.
|
|
|
|
This operator is implemented with pre-trained [UniXcoder](https://arxiv.org/pdf/2203.03850.pdf) models
|
|
|
|
from [Huggingface Transformers](https://huggingface.co/docs/transformers).
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Code Example
|
|
|
|
|
|
|
|
Use the pre-trained model "microsoft/unixcoder-base"
|
|
|
|
to generate text embeddings for given
|
|
|
|
text description "return max value" and code "def max(a,b): if a>b: return a else return b".
|
|
|
|
|
|
|
|
*Write the pipeline*:
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc(['find max value', 'def max(a,b): if a>b: return a else return b'])
|
|
|
|
.code_search.unixcoder(model_name='microsoft/unixcoder-base')
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
*Write a same pipeline with explicit inputs/outputs name specifications:*
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc['text'](['return max value', 'def max(a,b): if a>b: return a else return b'])
|
|
|
|
.code_search.unixcoder['text', 'embedding']()
|
|
|
|
.show()
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
<img src="./result.png" width="800px"/>
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Factory Constructor
|
|
|
|
|
|
|
|
Create the operator via the following factory method:
|
|
|
|
|
|
|
|
***code_search.unixcoder(model_name="microsoft/unixcoder-base")***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***model_name***: *str*
|
|
|
|
|
|
|
|
The model name in string.
|
|
|
|
The default model name is "microsoft/unixcoder-base".
|
|
|
|
|
|
|
|
***device***: *str*
|
|
|
|
|
|
|
|
The device to run model inference.
|
|
|
|
The default value is None, which enables GPU if cuda is available.
|
|
|
|
|
|
|
|
Supported model names:
|
|
|
|
- microsoft/unixcoder-base
|
|
|
|
- microsoft/unixcoder-base-nine
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Interface
|
|
|
|
|
|
|
|
The operator takes a piece of text in string as input.
|
|
|
|
It loads tokenizer and pre-trained model using model name.
|
|
|
|
and then return an embedding in ndarray.
|
|
|
|
|
|
|
|
***__call__(txt)***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***txt***: *str*
|
|
|
|
|
|
|
|
The text string in programming language or natural language.
|
|
|
|
|
|
|
|
|
|
|
|
**Returns**:
|
|
|
|
|
|
|
|
*numpy.ndarray*
|
|
|
|
|
|
|
|
The text embedding generated by model, in shape of (dim,).
|
|
|
|
|
|
|
|
|
|
|
|
***save_model(format="pytorch", path="default")***
|
|
|
|
|
|
|
|
Save model to local with specified format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The format of saved model, defaults to "pytorch".
|
|
|
|
|
|
|
|
***format***: *path*
|
|
|
|
|
|
|
|
The path where model is saved to. By default, it will save model to the operator directory.
|
|
|
|
|
|
|
|
|
|
|
|
***supported_model_names(format=None)***
|
|
|
|
|
|
|
|
Get a list of all supported model names or supported model names for specified model format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The model format such as "pytorch", "torchscript".
|
|
|
|
The default value is None, which will return all supported model names.
|
|
|
|
|