Readme

Files and versions

3.4 KiB

Raw Permalink Blame History

readthedocs

author: junjie.jiang

Desription

To get the list of documents for a single Read the Docs project.

Code Example

Example


from towhee import DataLoader, pipe, ops
p = (
    pipe.input('url')
    .map('url', 'text', ops.text_loader())
    .flat_map('text', 'sentence', ops.text_splitter())
    .map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))
    .map('embedding', 'embedding', ops.towhee.np_normalize())
    .output('embedding')
)



for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html')):
    print(p(data).to_list(kv_format=True))

# batch
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html'), batch_size=10):
    p.batch(data)

Parameters:

page_prefix: str

The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.

index_page: str

The main page contains links to all other pages, if None, will use page_prefix.

example: https://towhee.readthedocs.io/en/latest/

include: Union[List[str], str]

Only contains URLs that meet this condition.

exclude: Union[List[str], str]

Filter out URLs that meet this condition.

More Resources

RAG Without OpenAI: BentoML, OctoAI and Milvus - Zilliz blog: In this tutorial we will use BentoML to serve embeddings, OctoAI to get the LLM and Milvus as our vector database.
Building RAG with Llama3, Ollama, DSPy, and Milvus - Zilliz blog: In this article, we aim to guide readers through constructing an RAG system using four key technologies: Llama3, Ollama, DSPy, and Milvus. First, letâs understand what they are.
An LLM Powered Text to Image Prompt Generation with Milvus - Zilliz blog: An interesting LLM project powered by the Milvus vector database for generating more efficient text-to-image prompts.
Vectorizing and Querying EPUB Content with the Unstructured and Milvus - Zilliz blog: In this post, we explore the vectorization and retrieval of EPUB data using Milvus and the Unstructured framework, offering developers actionable insights for enhancing LLM performance.
Vectorizing PDFs - Ingesting PDFs into Vector Databases with Milvus and Zilliz - Zilliz blog: You will learn how Zilliz Cloud Pipeline transforms PDF data into a format ready for LLMs to use in semantic search tasks. Finally, we will conduct data retrieval using vector search.
Training Text Embeddings with Jina AI - Zilliz blog: In a recent talk by Bo Wang, he discussed the creation of Jina text embeddings for modern vector search and RAG systems. He also shared methodologies for training embedding models that effectively encode extensive information, along with guidance o

3.4 KiB

Raw Permalink Blame History

readthedocs

author: junjie.jiang

Desription

To get the list of documents for a single Read the Docs project.

Code Example

Example


from towhee import DataLoader, pipe, ops
p = (
    pipe.input('url')
    .map('url', 'text', ops.text_loader())
    .flat_map('text', 'sentence', ops.text_splitter())
    .map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))
    .map('embedding', 'embedding', ops.towhee.np_normalize())
    .output('embedding')
)



for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html')):
    print(p(data).to_list(kv_format=True))

# batch
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html'), batch_size=10):
    p.batch(data)

Parameters:

page_prefix: str

The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.

index_page: str

The main page contains links to all other pages, if None, will use page_prefix.

example: https://towhee.readthedocs.io/en/latest/

include: Union[List[str], str]

Only contains URLs that meet this condition.

exclude: Union[List[str], str]

Filter out URLs that meet this condition.

More Resources

RAG Without OpenAI: BentoML, OctoAI and Milvus - Zilliz blog: In this tutorial we will use BentoML to serve embeddings, OctoAI to get the LLM and Milvus as our vector database.
Building RAG with Llama3, Ollama, DSPy, and Milvus - Zilliz blog: In this article, we aim to guide readers through constructing an RAG system using four key technologies: Llama3, Ollama, DSPy, and Milvus. First, letâs understand what they are.
An LLM Powered Text to Image Prompt Generation with Milvus - Zilliz blog: An interesting LLM project powered by the Milvus vector database for generating more efficient text-to-image prompts.
Vectorizing and Querying EPUB Content with the Unstructured and Milvus - Zilliz blog: In this post, we explore the vectorization and retrieval of EPUB data using Milvus and the Unstructured framework, offering developers actionable insights for enhancing LLM performance.
Vectorizing PDFs - Ingesting PDFs into Vector Databases with Milvus and Zilliz - Zilliz blog: You will learn how Zilliz Cloud Pipeline transforms PDF data into a format ready for LLMs to use in semantic search tasks. Finally, we will conduct data retrieval using vector search.
Training Text Embeddings with Jina AI - Zilliz blog: In a recent talk by Bo Wang, he discussed the creation of Jina text embeddings for modern vector search and RAG systems. He also shared methodologies for training embedding models that effectively encode extensive information, along with guidance o

Readme

Files and versions

3.4 KiB Raw Permalink Blame History

readthedocs

Desription

Code Example

Example

More Resources

3.4 KiB Raw Permalink Blame History

readthedocs

Desription

Code Example

Example

More Resources

3.4 KiB

Raw Permalink Blame History

3.4 KiB

Raw Permalink Blame History