towhee
copied
Readme
Files and versions
Updated 2 years ago
towhee
OSSChat Insert
Description
OSSChat is enhanced ChatGPT with documentation, issues, blog posts, community Q&A as knowledge bases. Built for every community and developer. The osschat-insert pipeline is a pipeline to insert data.
Code Example
Create Milvus collection
Before running the pipeline, please create Milvus collection first.
The
dim
is the dimensionality of the feature vector generated by the configuredmodel
in theosschat-insert
pipeline.
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
collection_name = 'osschat'
dim = 384
connections.connect(host='127.0.0.1', port='19530')
fields = [
FieldSchema(name='id', dtype=DataType.INT64, description='ids', is_primary=True, auto_id=True),
FieldSchema(name='text_id', dtype=DataType.VARCHAR, description='text', max_length=500),
FieldSchema(name='text', dtype=DataType.VARCHAR, description='text', max_length=1000),
FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='embedding vectors', dim=dim)
]
schema = CollectionSchema(fields=fields, description='enhanced qa')
collection = Collection(name=collection_name, schema=schema)
index_params = {
'metric_type':"IP",
'index_type':"IVF_FLAT",
'params':{"nlist":2048}
}
collection.create_index(field_name="embedding", index_params=index_params)
Create pipeline and set the configuration
You need also start elasticsearch.
More parameters refer to the Configuration.
from towhee import AutoPipes, AutoConfig
config = AutoConfig.load_config('osschat-insert')
config.embedding_model = 'all-MiniLM-L6-v2'
config.milvus_host = '127.0.0.1'
config.milvus_port = '19530'
config.es_host = '127.0.0.1'
config.es_port = '9200'
p = AutoPipes.pipeline('osschat-insert', config=config)
res = p('https://github.com/towhee-io/towhee/blob/main/README.md')
Then you can run collection.flush()
and collection.num_entities
to check the number of the data in Milvus as a knowledge base.
Configuration
EnhancedQAInsertConfig
Configuration for Text Splitter:
type: str
The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['RecursiveCharacter', 'Markdown', 'PythonCode', 'Character', 'NLTK', 'Spacy', 'Tiktoken', 'HuggingFace'].
chunk_size: int The size of each chunk, defaults to 300.
splitter_kwargs: dict
The kwargs for the splitter, defaults to {}.
Configuration for Sentence Embedding:
embedding_model: str
The model name for sentence embedding, defaults to 'all-MiniLM-L6-v2'
.
You can refer to the above Model(s) list to set the model, some of these models are from HuggingFace (open source), and some are from OpenAI (not open, required API key).
openai_api_key: str
The api key of openai, default to None
.
This key is required if the model is from OpenAI, you can check the model provider in the above Model(s) list.
embedding_device: int
The number of device, defaults to -1
, which means using the CPU.
If the setting is not -1
, the specified GPU device will be used.
Configuration for Milvus:
milvus_host: str
Host of Milvus vector database, default is '127.0.0.1'
.
milvus_port: str
Port of Milvus vector database, default is '19530'
.
milvus_user: str
The user name for Cloud user, defaults to None
.
milvus_password: str
The user password for Cloud user, defaults to None
.
Configuration for Elasticsearch:
es_host: str
Host of Elasticsearch, default is '127.0.0.1'
.
es_port: str
Port of Elasticsearche, default is '9200'
.
es_user: str
The user name for Elasticsearch, defaults to None
.
es_password: str
The user password for Elasticsearch, defaults to None
.
Interface
Insert documentation into Milvus as a knowledge base.
Parameters:
doc: str
Path or url of the document to be loaded.
milvus_collection: str The collection name for Milvus vector database, is required when inserting data into Milvus.
es_index: str The index name of elasticsearch.
Returns: MutationResult
A MutationResult after inserting Milvus.
| 2 Commits | ||
---|---|---|---|
|
1.1 KiB
|
2 years ago | |
|
5.5 KiB
|
2 years ago | |
|
4.8 KiB
|
2 years ago |