logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

2.1 KiB

Text Spliter

author: shiyu22


Description

Text spliter is used to split text into chunk lists.

Refer to Recursive Characters for the operation of splitting text.


Code Example

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('url')
        .map('url', 'text', ops.text_loader())
        .flat_map('text', 'text', ops.text_spliter())
        .output('url', 'text')
    )

res = p('https://github.com/towhee-io/towhee/blob/main/README.md')
DataCollection(res).show()
result


Factory Constructor

Create the operator via the following factory method

towhee.text_loader(chunk_size=300)

Parameters:

type: str

​ The type of spliter, defaults to 'RecursiveCharacter'. You can set this parameter in ['RecursiveCharacter', 'Markdown', 'PythonCode', 'Character', 'NLTK', 'Spacy', 'Tiktoken', 'HuggingFace'].

chunk_size: int

​ The maximum size of chunk, defaults to 300.


Interface

The operator split incoming the text and return chunks.

Parameters:

data: str

​ The text data.

Return: List[Document]

A list of the chunked document.

2.1 KiB

Text Spliter

author: shiyu22


Description

Text spliter is used to split text into chunk lists.

Refer to Recursive Characters for the operation of splitting text.


Code Example

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('url')
        .map('url', 'text', ops.text_loader())
        .flat_map('text', 'text', ops.text_spliter())
        .output('url', 'text')
    )

res = p('https://github.com/towhee-io/towhee/blob/main/README.md')
DataCollection(res).show()
result


Factory Constructor

Create the operator via the following factory method

towhee.text_loader(chunk_size=300)

Parameters:

type: str

​ The type of spliter, defaults to 'RecursiveCharacter'. You can set this parameter in ['RecursiveCharacter', 'Markdown', 'PythonCode', 'Character', 'NLTK', 'Spacy', 'Tiktoken', 'HuggingFace'].

chunk_size: int

​ The maximum size of chunk, defaults to 300.


Interface

The operator split incoming the text and return chunks.

Parameters:

data: str

​ The text data.

Return: List[Document]

A list of the chunked document.