readthedocs
copied
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions
Updated 11 months ago
data-source
readthedocs
author: junjie.jiang
Desription
To get the list of documents for a single Read the Docs project.
Code Example
Example
from towhee import DataLoader, pipe, ops
p = (
pipe.input('url')
.map('url', 'text', ops.text_loader())
.flat_map('text', 'sentence', ops.text_splitter())
.map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))
.map('embedding', 'embedding', ops.towhee.np_normalize())
.output('embedding')
)
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html')):
print(p(data).to_list(kv_format=True))
# batch
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html'), batch_size=10):
p.batch(data)
Parameters:
page_prefix: str
The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.
index_page: str
The main page contains links to all other pages, if None, will use page_prefix
.
example: https://towhee.readthedocs.io/en/latest/
include: Union[List[str], str]
Only contains URLs that meet this condition.
exclude: Union[List[str], str]
Filter out URLs that meet this condition.
junjie.jiang
d1f1e05c11
| 4 Commits | ||
---|---|---|---|
.gitattributes |
1.1 KiB
|
11 months ago | |
README.md |
1.4 KiB
|
11 months ago | |
__init__.py |
704 B
|
11 months ago | |
docs_reader.py |
2.1 KiB
|
11 months ago | |
requirements.txt |
13 B
|
11 months ago |