|
|
|
# Code & Text Embedding with CodeBert
|
|
|
|
|
|
|
|
*author: [Jael Gu](https://github.com/jaelgu)*
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Description
|
|
|
|
|
|
|
|
A code search operator takes a text string of programming language or natural language as an input
|
|
|
|
and returns an embedding vector in ndarray which captures the input's core semantic elements.
|
|
|
|
This operator is implemented with pre-trained models from [Huggingface Transformers](https://huggingface.co/docs/transformers).
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Code Example
|
|
|
|
|
|
|
|
Use the pre-trained model "huggingface/CodeBERTa-small-v1"
|
|
|
|
to generate text embeddings for given
|
|
|
|
text description "return max value" and code "def max(a,b): if a>b: return a else return b".
|
|
|
|
|
|
|
|
*Write the pipeline*:
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc(['find max value', 'def max(a,b): if a>b: return a else return b'])
|
|
|
|
.code_search.codebert(model_name='huggingface/CodeBERTa-small-v1')
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
*Write a same pipeline with explicit inputs/outputs name specifications:*
|
|
|
|
|
|
|
|
```python
|
|
|
|
import towhee
|
|
|
|
|
|
|
|
(
|
|
|
|
towhee.dc['text'](['return max value', 'def max(a,b): if a>b: return a else return b'])
|
|
|
|
.code_search.codebert['text', 'embedding']()
|
|
|
|
.show()
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
<img src="./result.png" width="800px"/>
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Factory Constructor
|
|
|
|
|
|
|
|
Create the operator via the following factory method:
|
|
|
|
|
|
|
|
***code_search.codebert(model_name="huggingface/CodeBERTa-small-v1")***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***model_name***: *str*
|
|
|
|
|
|
|
|
The model name in string.
|
|
|
|
The default model name is "huggingface/CodeBERTa-small-v1".
|
|
|
|
|
|
|
|
***device***: *str*
|
|
|
|
|
|
|
|
The device to run model inference.
|
|
|
|
The default value is None, which enables GPU if cuda is available.
|
|
|
|
|
|
|
|
Supported model names:
|
|
|
|
|
|
|
|
|
|
|
|
<br />
|
|
|
|
|
|
|
|
## Interface
|
|
|
|
|
|
|
|
The operator takes a piece of text in string as input.
|
|
|
|
It loads tokenizer and pre-trained model using model name.
|
|
|
|
and then return an embedding in ndarray.
|
|
|
|
|
|
|
|
***__call__(txt)***
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***txt***: *str*
|
|
|
|
|
|
|
|
The text string in programming language or natural language.
|
|
|
|
|
|
|
|
|
|
|
|
**Returns**:
|
|
|
|
|
|
|
|
*numpy.ndarray*
|
|
|
|
|
|
|
|
The text embedding generated by model, in shape of (dim,).
|
|
|
|
|
|
|
|
|
|
|
|
***save_model(format="pytorch", path="default")***
|
|
|
|
|
|
|
|
Save model to local with specified format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The format of saved model, defaults to "pytorch".
|
|
|
|
|
|
|
|
***format***: *path*
|
|
|
|
|
|
|
|
The path where model is saved to. By default, it will save model to the operator directory.
|
|
|
|
|
|
|
|
|
|
|
|
***supported_model_names(format=None)***
|
|
|
|
|
|
|
|
Get a list of all supported model names or supported model names for specified model format.
|
|
|
|
|
|
|
|
**Parameters:**
|
|
|
|
|
|
|
|
***format***: *str*
|
|
|
|
|
|
|
|
The model format such as "pytorch", "torchscript".
|
|
|
|
The default value is None, which will return all supported model names.
|
|
|
|
|