Readme

Files and versions

5.9 KiB

Raw Blame History

OSSChat Insert

Description

OSSChat is enhanced ChatGPT with documentation, issues, blog posts, community Q&A as knowledge bases. Built for every community and developer. The osschat-insert pipeline is used to insert data.

Code Example

Create Milvus collection

Before running the pipeline, please create Milvus collection first.

The dim is the dimensionality of the feature vector generated by the configured model in the osschat-insert pipeline.

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

collection_name = 'osschat'
dim = 384

connections.connect(host='127.0.0.1', port='19530')

fields = [
   FieldSchema(name='id', dtype=DataType.INT64, description='ids', is_primary=True, auto_id=True),
   FieldSchema(name='text_id', dtype=DataType.VARCHAR, description='text', max_length=500),
   FieldSchema(name='text', dtype=DataType.VARCHAR, description='text', max_length=1000),
   FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='embedding vectors', dim=dim)
]
schema = CollectionSchema(fields=fields, description='osschat')
collection = Collection(name=collection_name, schema=schema)

index_params = {
    'metric_type':"IP",
    'index_type':"IVF_FLAT",
    'params':{"nlist":2048}
}
collection.create_index(field_name="embedding", index_params=index_params)

Create pipeline and set the configuration

If you set config.es_enable to True, you need also start elasticsearch.

More parameters refer to the Configuration.

from towhee import AutoPipes, AutoConfig

config = AutoConfig.load_config('osschat-insert')
config.embedding_model = 'all-MiniLM-L6-v2'
config.milvus_host = '127.0.0.1'
config.milvus_port = '19530'
config.es_enable = True
config.es_connection_kwargs = {hosts=['https://localhost:9200', basic_auth=('elastic', 'password')]}

p = AutoPipes.pipeline('osschat-insert', config=config)
res = p('https://github.com/towhee-io/towhee/blob/main/README.md', 'osschat')

Then you can run collection.flush() and collection.num_entities to check the number of the data in Milvus as a knowledge base.

And run es_client.search(index='osschat', body={"query":{"match_all":{}}})['hits']['total']['value'] to get the number in elasticsearch.

Configuration

EnhancedQAInsertConfig

Configuration for Text Splitter:

type: str

The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['RecursiveCharacter', 'Markdown', 'PythonCode', 'Character', 'NLTK', 'Spacy', 'Tiktoken', 'HuggingFace'].

chunk_size: int

The size of each chunk, defaults to 300.

splitter_kwargs: dict

The kwargs for the splitter, defaults to {}.

Configuration for Sentence Embedding:

embedding_model: str

The model name for sentence embedding, defaults to 'all-MiniLM-L6-v2'. You can refer to the above Model(s) list to set the model, some of these models are from HuggingFace (open source), and some are from OpenAI (not open, required API key).

openai_api_key: str

The api key of openai, default to None. This key is required if the model is from OpenAI, you can check the model provider in the above Model(s) list.

embedding_device: int

The number of device, defaults to -1, which means using the CPU. If the setting is not -1, the specified GPU device will be used.

embedding_normalize: bool

Whether to normalize the embedding vectors, defaults to True.

Configuration for Milvus:

milvus_uri: str

URI of Milvus vector database, default is None, which will use host and port instead.

milvus_host: str

Host of Milvus vector database, default is '127.0.0.1'.

milvus_port: str

Port of Milvus vector database, default is '19530'.

milvus_token: str

Token of Zilliz Cloud user, default is None.

milvus_user: str

The user name for Cloud user, defaults to None.

milvus_password: str

The user password for Cloud user, defaults to None.

Configuration for Elasticsearch:

es_enable: bool

Whether to use Elasticsearch, default is True.

es_connection_kwargs

The connection arguments to connect elastic service.

token_model

The model used to count tokens, defaults to 'gpt-3.5-turbo'.

Interface

Insert documentation into Milvus(and Elasticsearch) as a knowledge base.

Parameters:

doc: str

Path or url of the document to be loaded.

project_name: str

The collection name for Milvus vector database, also the index name of Elasticsearch.

Returns: dict

A dictionary includes 'milvus_res' and 'es_res'.

5.9 KiB

Raw Blame History

OSSChat Insert

Description

OSSChat is enhanced ChatGPT with documentation, issues, blog posts, community Q&A as knowledge bases. Built for every community and developer. The osschat-insert pipeline is used to insert data.

Code Example

Create Milvus collection

Before running the pipeline, please create Milvus collection first.

The dim is the dimensionality of the feature vector generated by the configured model in the osschat-insert pipeline.

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

collection_name = 'osschat'
dim = 384

connections.connect(host='127.0.0.1', port='19530')

fields = [
   FieldSchema(name='id', dtype=DataType.INT64, description='ids', is_primary=True, auto_id=True),
   FieldSchema(name='text_id', dtype=DataType.VARCHAR, description='text', max_length=500),
   FieldSchema(name='text', dtype=DataType.VARCHAR, description='text', max_length=1000),
   FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='embedding vectors', dim=dim)
]
schema = CollectionSchema(fields=fields, description='osschat')
collection = Collection(name=collection_name, schema=schema)

index_params = {
    'metric_type':"IP",
    'index_type':"IVF_FLAT",
    'params':{"nlist":2048}
}
collection.create_index(field_name="embedding", index_params=index_params)

Create pipeline and set the configuration

If you set config.es_enable to True, you need also start elasticsearch.

More parameters refer to the Configuration.

from towhee import AutoPipes, AutoConfig

config = AutoConfig.load_config('osschat-insert')
config.embedding_model = 'all-MiniLM-L6-v2'
config.milvus_host = '127.0.0.1'
config.milvus_port = '19530'
config.es_enable = True
config.es_connection_kwargs = {hosts=['https://localhost:9200', basic_auth=('elastic', 'password')]}

p = AutoPipes.pipeline('osschat-insert', config=config)
res = p('https://github.com/towhee-io/towhee/blob/main/README.md', 'osschat')

Then you can run collection.flush() and collection.num_entities to check the number of the data in Milvus as a knowledge base.

And run es_client.search(index='osschat', body={"query":{"match_all":{}}})['hits']['total']['value'] to get the number in elasticsearch.

Configuration

EnhancedQAInsertConfig

Configuration for Text Splitter:

type: str

The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['RecursiveCharacter', 'Markdown', 'PythonCode', 'Character', 'NLTK', 'Spacy', 'Tiktoken', 'HuggingFace'].

chunk_size: int

The size of each chunk, defaults to 300.

splitter_kwargs: dict

The kwargs for the splitter, defaults to {}.

Configuration for Sentence Embedding:

embedding_model: str

openai_api_key: str

The api key of openai, default to None. This key is required if the model is from OpenAI, you can check the model provider in the above Model(s) list.

embedding_device: int

The number of device, defaults to -1, which means using the CPU. If the setting is not -1, the specified GPU device will be used.

embedding_normalize: bool

Whether to normalize the embedding vectors, defaults to True.

Configuration for Milvus:

milvus_uri: str

URI of Milvus vector database, default is None, which will use host and port instead.

milvus_host: str

Host of Milvus vector database, default is '127.0.0.1'.

milvus_port: str

Port of Milvus vector database, default is '19530'.

milvus_token: str

Token of Zilliz Cloud user, default is None.

milvus_user: str

The user name for Cloud user, defaults to None.

milvus_password: str

The user password for Cloud user, defaults to None.

Configuration for Elasticsearch:

es_enable: bool

Whether to use Elasticsearch, default is True.

es_connection_kwargs

The connection arguments to connect elastic service.

token_model

The model used to count tokens, defaults to 'gpt-3.5-turbo'.

Interface

Insert documentation into Milvus(and Elasticsearch) as a knowledge base.

Parameters:

doc: str

Path or url of the document to be loaded.

project_name: str

The collection name for Milvus vector database, also the index name of Elasticsearch.

Returns: dict

A dictionary includes 'milvus_res' and 'es_res'.

Readme

Files and versions

5.9 KiB Raw Blame History

OSSChat Insert

Description

Code Example

Create Milvus collection

Create pipeline and set the configuration

Configuration

EnhancedQAInsertConfig

Configuration for Text Splitter:

Configuration for Sentence Embedding:

Configuration for Milvus:

Configuration for Elasticsearch:

Interface

5.9 KiB Raw Blame History

OSSChat Insert

Description

Code Example

Create Milvus collection

Create pipeline and set the configuration

Configuration

EnhancedQAInsertConfig

Configuration for Text Splitter:

Configuration for Sentence Embedding:

Configuration for Milvus:

Configuration for Elasticsearch:

Interface

5.9 KiB

Raw Blame History

5.9 KiB

Raw Blame History