osschat-insert/README.md

# OSSChat Insert

## Description

**[OSSChat](https://osschat.io/)** is enhanced ChatGPT with documentation, issues, blog posts, community Q&A as knowledge bases. Built for every community and developer. The osschat-insert pipeline is used to insert data.

<br />


## Code Example

### **Create Milvus collection**

Before running the pipeline, please [create Milvus collection](https://milvus.io/docs/v2.0.x/create_collection.md) first.

> The `dim` is the dimensionality of the feature vector generated by the configured `model` in the `osschat-insert` pipeline.

```python
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

collection_name = 'osschat'
dim = 384

connections.connect(host='127.0.0.1', port='19530')

fields = [
   FieldSchema(name='id', dtype=DataType.INT64, description='ids', is_primary=True, auto_id=True),
   FieldSchema(name='text_id', dtype=DataType.VARCHAR, description='text', max_length=500),
   FieldSchema(name='text', dtype=DataType.VARCHAR, description='text', max_length=1000),
   FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='embedding vectors', dim=dim)
]
schema = CollectionSchema(fields=fields, description='osschat')
collection = Collection(name=collection_name, schema=schema)

index_params = {
    'metric_type':"IP",
    'index_type':"IVF_FLAT",
    'params':{"nlist":2048}
}
collection.create_index(field_name="embedding", index_params=index_params)
```

### **Create pipeline and set the configuration**

> If you set config.es_enable to True, you need also start [elasticsearch](https://www.elastic.co/elasticsearch/).
>
> More parameters refer to the Configuration.

```python
from towhee import AutoPipes, AutoConfig

config = AutoConfig.load_config('osschat-insert')
config.embedding_model = 'all-MiniLM-L6-v2'
config.milvus_host = '127.0.0.1'
config.milvus_port = '19530'
config.es_enable = True
config.es_connection_kwargs = {hosts=['https://localhost:9200', basic_auth=('elastic', 'password')]}

p = AutoPipes.pipeline('osschat-insert', config=config)
res = p('https://github.com/towhee-io/towhee/blob/main/README.md', 'osschat')
```

Then you can run `collection.flush() ` and `collection.num_entities` to check the number of the data in Milvus as a knowledge base.

And run `es_client.search(index='osschat', body={"query":{"match_all":{}}})['hits']['total']['value']` to get the number in elasticsearch.

<br />


## Configuration 

### **EnhancedQAInsertConfig**

#### **Configuration for [Text Splitter](https://towhee.io/towhee/text-splitter):**

***type***: str

The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['[RecursiveCharacter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)', '[Markdown](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/markdown.html)', '[PythonCode](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/python.html)', '[Character](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html#)', '[NLTK](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/nltk.html)', '[Spacy](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/spacy.html)', '[Tiktoken](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/tiktoken_splitter.html)', '[HuggingFace](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/huggingface_length_function.html)'].

***chunk_size***: int

The size of each chunk, defaults to 300.

***splitter_kwargs***: dict

The kwargs for the splitter, defaults to {}.

#### **Configuration for Sentence Embedding:**

***embedding_model***: str

The model name for sentence embedding, defaults to `'all-MiniLM-L6-v2'`.
You can refer to the above [Model(s) list ](https://towhee.io/tasks/detail/operator?field_name=Natural-Language-Processing&task_name=Sentence-Embedding)to set the model, some of these models are from [HuggingFace](https://huggingface.co/) (open source), and some are from [OpenAI](https://openai.com/) (not open, required API key).

***openai_api_key***: str

The api key of openai, default to `None`.
This key is required if  the model is from OpenAI, you can check the model provider in the above [Model(s) list](https://towhee.io/sentence-embedding/openai).

***embedding_device:*** int

The number of device, defaults to `-1`, which means using the CPU. 
If the setting is not `-1`, the specified GPU device will be used.

***embedding_normalize:*** bool

Whether to normalize the embedding vectors, defaults to `True`.

#### **Configuration for [Milvus](https://towhee.io/ann-insert/osschat-milvus):**

***milvus_uri***: str

URI of Milvus vector database, default is `None`, which will use host and port instead.

***milvus_host***: str

Host of Milvus vector database, default is `'127.0.0.1'`.

***milvus_port***: str

Port of Milvus vector database, default is `'19530'`. 

***milvus_token***: str

Token of [Zilliz Cloud user](https://zilliz.com/cloud), default is `None`.

***milvus_user***: str

The user name for [Cloud user](https://zilliz.com/cloud), defaults to `None`.

***milvus_password***: str

The user password for [Cloud user](https://zilliz.com/cloud), defaults to `None`.

#### **Configuration for [Elasticsearch](https://towhee.io/elasticsearch/osschat-index):**

***es_enable***: bool

Whether to use Elasticsearch, default is `True`.

***es_connection_kwargs***

The connection arguments to connect elastic service.

***token_model***

The model used to count tokens, defaults to 'gpt-3.5-turbo'.

<br />


## Interface

Insert documentation into Milvus(and Elasticsearch) as a knowledge base.

**Parameters:**

 ***doc***: str

Path or url of the document to be loaded.

***project_name***: str

The collection name for Milvus vector database, also the index name of Elasticsearch.

<br />

**Returns:** dict

A dictionary includes 'milvus_res' and 'es_res'.


# More Resources

- [OpenAI's ChatGPT - Zilliz blog](https://zilliz.com/learn/ChatGPT-Vector-Database-Prompt-as-code): A guide to the new AI Stack - ChatGPT, your Vector Database, and Prompt as code
- [Enhancing ChatGPT with Milvus: Powering AI with Long-Term Memory - Zilliz blog](https://zilliz.com/learn/enhancing-chatgpt-with-milvus): By integrating GPTCache and Milvus with ChatGPT, businesses can create a more robust and efficient AI-powered support system. This approach leverages the advanced capabilities of generative AI and introduces a form of long-term memory, allowing the AI to recall and reuse information effectively.
- [ChatGPT+ Vector database + prompt-as-code - The CVP Stack - Zilliz blog](https://zilliz.com/blog/ChatGPT-VectorDB-Prompt-as-code): Extend the capability of ChatGPT with a Vector database and prompts-as-code
Add osschat-insert 3 years ago			`# OSSChat Insert`
Initial commit 3 years ago
Add osschat-insert 3 years ago			`## Description`

Update with es 3 years ago			`[OSSChat](https://osschat.io/) is enhanced ChatGPT with documentation, issues, blog posts, community Q&A as knowledge bases. Built for every community and developer. The osschat-insert pipeline is used to insert data.`
Add osschat-insert 3 years ago
			`<br />`



			`## Code Example`

			`### Create Milvus collection`

			`Before running the pipeline, please [create Milvus collection](https://milvus.io/docs/v2.0.x/create_collection.md) first.`

			> The `dim` is the dimensionality of the feature vector generated by the configured `model` in the `osschat-insert` pipeline.

			```python
			`from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility`

			`collection_name = 'osschat'`
			`dim = 384`

			`connections.connect(host='127.0.0.1', port='19530')`

			`fields = [`
			`FieldSchema(name='id', dtype=DataType.INT64, description='ids', is_primary=True, auto_id=True),`
			`FieldSchema(name='text_id', dtype=DataType.VARCHAR, description='text', max_length=500),`
			`FieldSchema(name='text', dtype=DataType.VARCHAR, description='text', max_length=1000),`
			`FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='embedding vectors', dim=dim)`
			`]`
Update with es 3 years ago			`schema = CollectionSchema(fields=fields, description='osschat')`
Add osschat-insert 3 years ago			`collection = Collection(name=collection_name, schema=schema)`

			`index_params = {`
			`'metric_type':"IP",`
			`'index_type':"IVF_FLAT",`
			`'params':{"nlist":2048}`
			`}`
			`collection.create_index(field_name="embedding", index_params=index_params)`
			```

			`### Create pipeline and set the configuration`

Update config 3 years ago			`> If you set config.es_enable to True, you need also start [elasticsearch](https://www.elastic.co/elasticsearch/).`
Add osschat-insert 3 years ago			`>`
			`> More parameters refer to the Configuration.`

			```python
			`from towhee import AutoPipes, AutoConfig`

			`config = AutoConfig.load_config('osschat-insert')`
			`config.embedding_model = 'all-MiniLM-L6-v2'`
			`config.milvus_host = '127.0.0.1'`
			`config.milvus_port = '19530'`
Update config 3 years ago			`config.es_enable = True`
Change elastic connection Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`config.es_connection_kwargs = {hosts=['https://localhost:9200', basic_auth=('elastic', 'password')]}`
Add osschat-insert 3 years ago
			`p = AutoPipes.pipeline('osschat-insert', config=config)`
Update config 3 years ago			`res = p('https://github.com/towhee-io/towhee/blob/main/README.md', 'osschat')`
Add osschat-insert 3 years ago			```

			Then you can run `collection.flush() ` and `collection.num_entities` to check the number of the data in Milvus as a knowledge base.

Update README 3 years ago			And run `es_client.search(index='osschat', body={"query":{"match_all":{}}})['hits']['total']['value']` to get the number in elasticsearch.

Add osschat-insert 3 years ago			`<br />`




			`## Configuration`

			`### EnhancedQAInsertConfig`

			`#### Configuration for [Text Splitter](https://towhee.io/towhee/text-splitter):`

			`*type*: str`

			The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['[RecursiveCharacter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)', '[Markdown](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/markdown.html)', '[PythonCode](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/python.html)', '[Character](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html#)', '[NLTK](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/nltk.html)', '[Spacy](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/spacy.html)', '[Tiktoken](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/tiktoken_splitter.html)', '[HuggingFace](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/huggingface_length_function.html)'].

			`*chunk_size*: int`
Update config 3 years ago
Add osschat-insert 3 years ago			`The size of each chunk, defaults to 300.`

			`*splitter_kwargs*: dict`

			`The kwargs for the splitter, defaults to {}.`

			`#### Configuration for Sentence Embedding:`

			`*embedding_model*: str`
Update config 3 years ago
Add osschat-insert 3 years ago			The model name for sentence embedding, defaults to `'all-MiniLM-L6-v2'`.
			`You can refer to the above [Model(s) list ](https://towhee.io/tasks/detail/operator?field_name=Natural-Language-Processing&task_name=Sentence-Embedding)to set the model, some of these models are from [HuggingFace](https://huggingface.co/) (open source), and some are from [OpenAI](https://openai.com/) (not open, required API key).`

			`*openai_api_key*: str`
Update config 3 years ago
Add osschat-insert 3 years ago			The api key of openai, default to `None`.
			`This key is required if the model is from OpenAI, you can check the model provider in the above [Model(s) list](https://towhee.io/sentence-embedding/openai).`

			`*embedding_device:* int`
Update config 3 years ago
Add osschat-insert 3 years ago			The number of device, defaults to `-1`, which means using the CPU.
			If the setting is not `-1`, the specified GPU device will be used.

Update config 3 years ago			`*embedding_normalize:* bool`

			Whether to normalize the embedding vectors, defaults to `True`.

Add osschat-insert 3 years ago			`#### Configuration for [Milvus](https://towhee.io/ann-insert/osschat-milvus):`

Support milvus uri and token Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`*milvus_uri*: str`

			URI of Milvus vector database, default is `None`, which will use host and port instead.

Add osschat-insert 3 years ago			`*milvus_host*: str`
Update config 3 years ago
Add osschat-insert 3 years ago			Host of Milvus vector database, default is `'127.0.0.1'`.

			`*milvus_port*: str`
Update config 3 years ago
Add osschat-insert 3 years ago			Port of Milvus vector database, default is `'19530'`.

Support milvus uri and token Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`*milvus_token*: str`

			Token of [Zilliz Cloud user](https://zilliz.com/cloud), default is `None`.

Add osschat-insert 3 years ago			`*milvus_user*: str`
Update config 3 years ago
Add osschat-insert 3 years ago			The user name for [Cloud user](https://zilliz.com/cloud), defaults to `None`.

			`*milvus_password*: str`
Update config 3 years ago
Add osschat-insert 3 years ago			The user password for [Cloud user](https://zilliz.com/cloud), defaults to `None`.

			`#### Configuration for [Elasticsearch](https://towhee.io/elasticsearch/osschat-index):`

Update config 3 years ago			`*es_enable*: bool`

			Whether to use Elasticsearch, default is `True`.

Change elastic connection Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`*es_connection_kwargs*`
Update config 3 years ago
Change elastic connection Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`The connection arguments to connect elastic service.`
Add osschat-insert 3 years ago
Add token count Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 2 years ago			`*token_model*`

			`The model used to count tokens, defaults to 'gpt-3.5-turbo'.`

Add osschat-insert 3 years ago			`<br />`



			`## Interface`

Update output Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 3 years ago			`Insert documentation into Milvus(and Elasticsearch) as a knowledge base.`
Add osschat-insert 3 years ago
			`Parameters:`

			`*doc*: str`

			`Path or url of the document to be loaded.`

Update config 3 years ago			`*project_name*: str`
Add osschat-insert 3 years ago
Update output Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 3 years ago			`The collection name for Milvus vector database, also the index name of Elasticsearch.`
Add osschat-insert 3 years ago
Update README 3 years ago			`<br />`
Add osschat-insert 3 years ago
Update output Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 3 years ago			`Returns: dict`
Add osschat-insert 3 years ago
Update output Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 3 years ago			`A dictionary includes 'milvus_res' and 'es_res'.`
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 1 year ago

			`# More Resources`

			`- [OpenAI's ChatGPT - Zilliz blog](https://zilliz.com/learn/ChatGPT-Vector-Database-Prompt-as-code): A guide to the new AI Stack - ChatGPT, your Vector Database, and Prompt as code`
			`- [Enhancing ChatGPT with Milvus: Powering AI with Long-Term Memory - Zilliz blog](https://zilliz.com/learn/enhancing-chatgpt-with-milvus): By integrating GPTCache and Milvus with ChatGPT, businesses can create a more robust and efficient AI-powered support system. This approach leverages the advanced capabilities of generative AI and introduces a form of long-term memory, allowing the AI to recall and reuse information effectively.`
			`- [ChatGPT+ Vector database + prompt-as-code - The CVP Stack - Zilliz blog](https://zilliz.com/blog/ChatGPT-VectorDB-Prompt-as-code): Extend the capability of ChatGPT with a Vector database and prompts-as-code`