# Text Splitter

*author: shiyu22*

<br />


### Description

**Text splitter** is used to split text into chunk lists. 

> Refer to [Text Splitters](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html) for the operation of splitting text.

<br />


### Code Example

```Python
from towhee import pipe, ops, DataCollection

p = (
    pipe.input('url')
        .map('url', 'text', ops.text_loader())
        .flat_map('text', 'text', ops.text_splitter())
        .output('url', 'text')
    )

res = p('https://github.com/towhee-io/towhee/blob/main/README.md')
DataCollection(res).show()
```

<img src="./result.png" alt="result" height="200px"/>

<br />


## Factory Constructor

Create the operator via the following factory method

***towhee.text_loader(type='RecursiveCharacter', chunk_size=300, \*\*kwargs)***

**Parameters:**

​	***type***: str

​	The type of splitter, defaults to 'RecursiveCharacter'. You can set this parameter in ['[RecursiveCharacter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)', '[Markdown](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/markdown.html)', '[PythonCode](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/python.html)', '[Character](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html#)', '[NLTK](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/nltk.html)', '[Spacy](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/spacy.html)', '[Tiktoken](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/tiktoken_splitter.html)', '[HuggingFace](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/huggingface_length_function.html)'].

​    ***chunk_size***: int

​	The maximum size of chunk, defaults to 300.

<br />


### Interface

The operator split incoming the text and return chunks.

**Parameters:**

​	***data***: str

​	The text data.


**Return**: List[Document]

A list of the chunked document.


# More Resources

- [Experiment with 5 Chunking Strategies via LangChain for LLM  - Zilliz blog](https://zilliz.com/blog/experimenting-with-different-chunking-strategies-via-langchain): Explore the complexities of text chunking in retrieval augmented generation applications and learn how different chunking strategies impact the same piece of data.
- [A Guide to Chunking Strategies for Retrieval Augmented Generation (RAG) - Zilliz blog](https://zilliz.com/learn/guide-to-chunking-strategies-for-rag): We explored various facets of chunking strategies within Retrieval-Augmented Generation (RAG) systems in this guide.
- [Sentence Transformers for Long-Form Text - Zilliz blog](https://zilliz.com/learn/Sentence-Transformers-for-Long-Form-Text): Deep diving into modern transformer-based embeddings for long-form text.
- [Key Strategies for Smart Retrieval Augmented Generation (RAG) - Zilliz blog](https://zilliz.com/blog/exploring-rag-chunking-llms-and-evaluations): Three key strategies to get the most out of RAG: smart text chunking, iterating on different embedding models, and experimenting with different LLMs
- [The guide to jina-embeddings-v2-small-en | Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-small-en): jina-embeddings-v2-small-en: specialized text embedding model for long English documents; up to 8192 tokens.
- [Massive Text Embedding Benchmark (MTEB)](https://zilliz.com/glossary/massive-text-embedding-benchmark-(mteb)): A standardized way to evaluate text embedding models across a range of tasks and languages, leading to better text embedding models for your app
- [OpenAI text-embedding-3-large | Zilliz](https://zilliz.com/ai-models/text-embedding-3-large): Building GenAI applications with text-embedding-3-large model and Zilliz Cloud / Milvus
- [The guide to jina-embeddings-v2-base-en | Jina AI](https://zilliz.com/ai-models/jina-embeddings-v2-base-en): jina-embeddings-v2-base-en: specialized embedding model for English text and long documents; support sequences of up to 8192 tokens
- [Text as Data, From Anywhere to Anywhere - Zilliz blog](https://zilliz.com/blog/text-as-data-from-anywhere-to-anywhere): Whether you prefer a no-code or minimal-code approach, Airbyte and PyAirbyte offer robust solutions for integrating both structured and unstructured data. AJ Steers' painted a good picture of the potential of these tools in revolutionizing data workflows.