logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

77 lines
2.1 KiB

# Text Splitter
*author: shiyu22*
<br />
### Description
**Text splitter** is used to split text into chunk lists.
> Refer to [Text Splitters](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html) for the operation of splitting text.
<br />
### Code Example
```Python
from towhee import pipe, ops, DataCollection
p = (
pipe.input('url')
.map('url', 'text', ops.text_loader())
.flat_map('text', 'text', ops.text_splitter())
.output('url', 'text')
)
res = p('https://github.com/towhee-io/towhee/blob/main/README.md')
DataCollection(res).show()
```
<img src="./result.png" alt="result" height="200px"/>
<br />
## Factory Constructor
Create the operator via the following factory method
***towhee.text_loader(chunk_size=300)***
**Parameters:**
***type***: str
​ The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['[RecursiveCharacter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)', '[Markdown](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/markdown.html)', '[PythonCode](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/python.html)', '[Character](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html#)', '[NLTK](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/nltk.html)', '[Spacy](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/spacy.html)', '[Tiktoken](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/tiktoken_splitter.html)', '[HuggingFace](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/huggingface_length_function.html)'].
***chunk_size***: int
​ The maximum size of chunk, defaults to 300.
<br />
### Interface
The operator split incoming the text and return chunks.
**Parameters:**
***data***: str
​ The text data.
**Return**: List[Document]
A list of the chunked document.
1 year ago