logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

Updated 11 months ago

data-source

readthedocs

author: junjie.jiang


Desription

To get the list of documents for a single Read the Docs project.


Code Example

Example


from towhee import DataLoader, pipe, ops
p = (
    pipe.input('url')
    .map('url', 'text', ops.text_loader())
    .flat_map('text', 'sentence', ops.text_splitter())
    .map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))
    .map('embedding', 'embedding', ops.towhee.np_normalize())
    .output('embedding')
)



for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html')):
    print(p(data).to_list(kv_format=True))

# batch
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='html', exclude='index.html'), batch_size=10):
    p.batch(data)

Parameters:

page_prefix: str

The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.

index_page: str

The main page contains links to all other pages, if None, will use page_prefix.

example: https://towhee.readthedocs.io/en/latest/

include: Union[List[str], str]

Only contains URLs that meet this condition.

exclude: Union[List[str], str]

Filter out URLs that meet this condition.

junjie.jiang d1f1e05c11 Update readme 4 Commits
file-icon .gitattributes
1.1 KiB
download-icon
Initial commit 11 months ago
file-icon README.md
1.4 KiB
download-icon
Update readme 11 months ago
file-icon __init__.py
704 B
download-icon
Add docs reader 11 months ago
file-icon docs_reader.py
2.1 KiB
download-icon
Update 11 months ago
file-icon requirements.txt
13 B
download-icon
Add docs reader 11 months ago