readthedocs
copied
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions
Updated 6 months ago
data-source
readthedocs
author: junjie.jiang
Desription
To get the list of documents for a single Read the Docs project.
Code Example
Example
from towhee import DataLoader, pipe, ops
p = (
pipe.input('url')
.map('url', 'text', ops.text_loader())
.flat_map('text', 'sentence', ops.text_splitter())
.map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))
.map('embedding', 'embedding', ops.towhee.np_normalize())
.output('embedding')
)
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html')):
print(p(data).to_list(kv_format=True))
# batch
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html'), batch_size=10):
p.batch(data)
Parameters:
page_prefix: str
The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.
index_page: str
The main page contains links to all other pages, if None, will use page_prefix
.
example: https://towhee.readthedocs.io/en/latest/
include: Union[List[str], str]
Only contains URLs that meet this condition.
exclude: Union[List[str], str]
Filter out URLs that meet this condition.
More Resources
- RAG Without OpenAI: BentoML, OctoAI and Milvus - Zilliz blog: In this tutorial we will use BentoML to serve embeddings, OctoAI to get the LLM and Milvus as our vector database.
- Building RAG with Llama3, Ollama, DSPy, and Milvus - Zilliz blog: In this article, we aim to guide readers through constructing an RAG system using four key technologies: Llama3, Ollama, DSPy, and Milvus. First, letâs understand what they are.
- An LLM Powered Text to Image Prompt Generation with Milvus - Zilliz blog: An interesting LLM project powered by the Milvus vector database for generating more efficient text-to-image prompts.
- Vectorizing and Querying EPUB Content with the Unstructured and Milvus - Zilliz blog: In this post, we explore the vectorization and retrieval of EPUB data using Milvus and the Unstructured framework, offering developers actionable insights for enhancing LLM performance.
- Vectorizing PDFs - Ingesting PDFs into Vector Databases with Milvus and Zilliz - Zilliz blog: You will learn how Zilliz Cloud Pipeline transforms PDF data into a format ready for LLMs to use in semantic search tasks. Finally, we will conduct data retrieval using vector search.
- Training Text Embeddings with Jina AI - Zilliz blog: In a recent talk by Bo Wang, he discussed the creation of Jina text embeddings for modern vector search and RAG systems. He also shared methodologies for training embedding models that effectively encode extensive information, along with guidance o
| 5 Commits | ||
---|---|---|---|
|
1.1 KiB
|
2 years ago | |
|
3.4 KiB
|
6 months ago | |
|
704 B
|
2 years ago | |
|
2.1 KiB
|
2 years ago | |
|
13 B
|
2 years ago |