readthedocs/README.md

# readthedocs

*author: junjie.jiang*


<br />

## Desription

To get the list of documents for a single Read the Docs project.

<br />


## Code Example

### Example

```python

from towhee import DataLoader, pipe, ops
p = (
    pipe.input('url')
    .map('url', 'text', ops.text_loader())
    .flat_map('text', 'sentence', text_split_op)
    .map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))
    .map('embedding', 'embedding', ops.towhee.np_normalize())
    .output('embedding')
)


# table cols: id, image_path, label

for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='*html')):
    print(p(data).to_list(kv_format=True))

# batch
for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='*html'), batch_size=10):
    p.batch(data)
```

**Parameters:**


***page_prefix:*** *str*

The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.

***index_page:*** *str*

The main page contains links to all other pages, if None, will use `page_prefix`.

example: https://towhee.readthedocs.io/en/latest/

***include:*** *Union[List[str], str]*

Only contains URLs that meet this condition.

***exclude:*** *Union[List[str], str]*

Filter out URLs that meet this condition.
Initial commit 2 years ago			`# readthedocs`

Add docs reader Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com> 2 years ago			`author: junjie.jiang`


			`<br />`

			`## Desription`

			`To get the list of documents for a single Read the Docs project.`

			`<br />`


			`## Code Example`

			`### Example`

			```python

			`from towhee import DataLoader, pipe, ops`
			`p = (`
			`pipe.input('url')`
			`.map('url', 'text', ops.text_loader())`
			`.flat_map('text', 'sentence', text_split_op)`
			`.map('sentence', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2'))`
			`.map('embedding', 'embedding', ops.towhee.np_normalize())`
			`.output('embedding')`
			`)`


			`# table cols: id, image_path, label`

			`for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='*html')):`
			`print(p(data).to_list(kv_format=True))`

			`# batch`
			`for data in DataLoader(ops.data_source.readthedocs('https://towhee.readthedocs.io/en/latest/', include='*html'), batch_size=10):`
			`p.batch(data)`
			```

			`Parameters:`


			`*page_prefix:* str`

			`The root path of the page. Generally, the crawled links are relative paths. The complete URL needs to be obtained by splicing the root path + relative path.`

			`*index_page:* str`

			The main page contains links to all other pages, if None, will use `page_prefix`.

			`example: https://towhee.readthedocs.io/en/latest/`

			`*include:* Union[List[str], str]`

			`Only contains URLs that meet this condition.`

			`*exclude:* Union[List[str], str]`

			`Filter out URLs that meet this condition.`