drl/README.md

# Video-Text Retrieval Embedding with DRL

*author: Chen Zhang*


<br />


## Description

This operator extracts features for video or text with [DRL(Disentangled Representation Learning for Text-Video Retrieval)](https://arxiv.org/pdf/2203.07111v1.pdf), and then it can get the similarity by  Weighted Token-wise Interaction (WTI) module.


<br />

![](WTI.png)

## Code Example

Read the text 'kids feeding and playing with the horse' to generate a text embedding. 

```python
from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text') \
        .map('text', 'vec', ops.video_text_embedding.drl(base_encoder='clip_vit_b32', modality='text', device='cuda:0')) \
        .output('text', 'vec')
)

DataCollection(p('kids feeding and playing with the horse')).show()
```

![](text_emb_result.png)    

Load an video from path './demo_video.mp4' to generate a video embedding. 

```python
from towhee import pipe, ops, DataCollection

p = (
    pipe.input('video_path') \
        .map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \
        .map('flame_gen', 'flame_list', lambda x: [y for y in x]) \
        .map('flame_list', 'vec', ops.video_text_embedding.drl(base_encoder='clip_vit_b32', modality='video', device='cuda:0')) \
        .output('video_path', 'flame_list', 'vec')
)

DataCollection(p('./demo_video.mp4')).show()
```

![](video_emb_result.png)    
<br />

Note: For this model, cpu is not support, and you must specify device='cuda...'

## Factory Constructor

Create the operator via the following factory method

***drl(base_encoder, modality)***

**Parameters:**

   ***base_encoder:*** *str*

   The base CLIP encode name in DRL model. Supported model names: 
- clip_vit_b32


   ***modality:*** *str*

   Which modality(*video* or *text*) is used to generate the embedding. 


<br />


## Interface

An video-text embedding operator takes a list of [towhee VideoFrame](link/to/towhee/image/api/doc) or string as input and generate an embedding in ndarray.


**Parameters:**

	***data:*** *List[towhee.types.VideoFrame]*  or *str*

  The data (list of VideoFrame(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.	


**Returns:** *numpy.ndarray*

   The data embedding extracted by model. When text, the shape is (text_token_num, model_dim), when video, the shape is (video_token_num, model_dim)


# More Resources

- [Vector Database Use Cases: Video Similarity Search - Zilliz](https://zilliz.com/vector-database-use-cases/video-similarity-search): Experience a 10x performance boost and unparalleled precision when your video similarity search system is powered by Zilliz Cloud.
- [ColBERT: A Token-Level Embedding and Ranking Model  - Zilliz blog](https://zilliz.com/learn/explore-colbert-token-level-embedding-and-ranking-model-for-similarity-search): Unlike traditional embedding models like BERT, which focus on pooling embeddings into a single vector, ColBERT retains individual token representations. Through its innovative late interaction mechanism, it enables more precise and granular similarity calculations.
- [The guide to mistral-embed | Mistral AI](https://zilliz.com/ai-models/mistral-embed): mistral-embed: a specialized embedding model for text data with a context window of 8,000 tokens. Optimized for similarity retrieval and RAG applications.
- [Supercharged Semantic Similarity Search in Production - Zilliz blog](https://zilliz.com/learn/supercharged-semantic-similarity-search-in-production): Building a Blazing Fast, Highly Scalable Text-to-Image Search with CLIP embeddings and Milvus, the most advanced open-source vector database.
- [The guide to all-MiniLM-L12-v2 | Hugging Face](https://zilliz.com/ai-models/all-MiniLM-L12-v2): all-MiniLM-L12-v2: a text embedding model ideal for semantic search and RAG and fine-tuned based on Microsoft/MiniLM-L12-H384-uncased
- [Build a Multimodal Search System with Milvus - Zilliz blog](https://zilliz.com/blog/how-vector-dbs-are-revolutionizing-unstructured-data-search-ai-applications): Implementing a Multimodal Similarity Search System Using Milvus, Radient, ImageBind, and Meta-Chameleon-7b
- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar/success): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
- [Training Text Embeddings with Jina AI  - Zilliz blog](https://zilliz.com/blog/training-text-embeddings-with-jina-ai): In a recent talk by Bo Wang, he discussed the creation of Jina text embeddings for modern vector search and RAG systems. He also shared methodologies for training embedding models that effectively encode extensive information, along with guidance o
init DRL model 3 years ago			`# Video-Text Retrieval Embedding with DRL`

			`author: Chen Zhang`


			`<br />`



			`## Description`

			`This operator extracts features for video or text with [DRL(Disentangled Representation Learning for Text-Video Retrieval)](https://arxiv.org/pdf/2203.07111v1.pdf), and then it can get the similarity by Weighted Token-wise Interaction (WTI) module.`


			`<br />`

			`![](WTI.png)`

			`## Code Example`

			`Read the text 'kids feeding and playing with the horse' to generate a text embedding.`

			```python
rm dc2 in readme 2 years ago			`from towhee import pipe, ops, DataCollection`
init DRL model 3 years ago
update readme with dc2 2 years ago			`p = (`
			`pipe.input('text') \`
			`.map('text', 'vec', ops.video_text_embedding.drl(base_encoder='clip_vit_b32', modality='text', device='cuda:0')) \`
			`.output('text', 'vec')`
			`)`
init DRL model 3 years ago
update readme with dc2 2 years ago			`DataCollection(p('kids feeding and playing with the horse')).show()`
init DRL model 3 years ago			```

update readme with dc2 2 years ago			`![](text_emb_result.png)`
init DRL model 3 years ago
update readme with dc2 2 years ago			`Load an video from path './demo_video.mp4' to generate a video embedding.`
init DRL model 3 years ago
			```python
rm dc2 in readme 2 years ago			`from towhee import pipe, ops, DataCollection`
update readme with dc2 2 years ago
			`p = (`
			`pipe.input('video_path') \`
			`.map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \`
			`.map('flame_gen', 'flame_list', lambda x: [y for y in x]) \`
			`.map('flame_list', 'vec', ops.video_text_embedding.drl(base_encoder='clip_vit_b32', modality='video', device='cuda:0')) \`
			`.output('video_path', 'flame_list', 'vec')`
			`)`
init DRL model 3 years ago
update readme with dc2 2 years ago			`DataCollection(p('./demo_video.mp4')).show()`
			```
init DRL model 3 years ago
update readme with dc2 2 years ago			`![](video_emb_result.png)`
init DRL model 3 years ago			`<br />`

update readme 2 years ago			`Note: For this model, cpu is not support, and you must specify device='cuda...'`
init DRL model 3 years ago
			`## Factory Constructor`

			`Create the operator via the following factory method`

			`*drl(base_encoder, modality)*`

			`Parameters:`

			`*base_encoder:* str`

			`The base CLIP encode name in DRL model. Supported model names:`
			`- clip_vit_b32`


			`*modality:* str`

			`Which modality(video or text) is used to generate the embedding.`


			`<br />`



			`## Interface`

			`An video-text embedding operator takes a list of [towhee VideoFrame](link/to/towhee/image/api/doc) or string as input and generate an embedding in ndarray.`


			`Parameters:`

			`*data:* List[towhee.types.VideoFrame] or str`

			`The data (list of VideoFrame(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.`



			`Returns: numpy.ndarray`

			`The data embedding extracted by model. When text, the shape is (text_token_num, model_dim), when video, the shape is (video_token_num, model_dim)`



Initial commit 3 years ago
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 10 months ago

			`# More Resources`

			`- [Vector Database Use Cases: Video Similarity Search - Zilliz](https://zilliz.com/vector-database-use-cases/video-similarity-search): Experience a 10x performance boost and unparalleled precision when your video similarity search system is powered by Zilliz Cloud.`
			`- [ColBERT: A Token-Level Embedding and Ranking Model - Zilliz blog](https://zilliz.com/learn/explore-colbert-token-level-embedding-and-ranking-model-for-similarity-search): Unlike traditional embedding models like BERT, which focus on pooling embeddings into a single vector, ColBERT retains individual token representations. Through its innovative late interaction mechanism, it enables more precise and granular similarity calculations.`
			`- [The guide to mistral-embed \| Mistral AI](https://zilliz.com/ai-models/mistral-embed): mistral-embed: a specialized embedding model for text data with a context window of 8,000 tokens. Optimized for similarity retrieval and RAG applications.`
			`- [Supercharged Semantic Similarity Search in Production - Zilliz blog](https://zilliz.com/learn/supercharged-semantic-similarity-search-in-production): Building a Blazing Fast, Highly Scalable Text-to-Image Search with CLIP embeddings and Milvus, the most advanced open-source vector database.`
			`- [The guide to all-MiniLM-L12-v2 \| Hugging Face](https://zilliz.com/ai-models/all-MiniLM-L12-v2): all-MiniLM-L12-v2: a text embedding model ideal for semantic search and RAG and fine-tuned based on Microsoft/MiniLM-L12-H384-uncased`
			`- [Build a Multimodal Search System with Milvus - Zilliz blog](https://zilliz.com/blog/how-vector-dbs-are-revolutionizing-unstructured-data-search-ai-applications): Implementing a Multimodal Similarity Search System Using Milvus, Radient, ImageBind, and Meta-Chameleon-7b`
			`- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus \| Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.`
			`- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus \| Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar/success): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.`
			`- [Training Text Embeddings with Jina AI - Zilliz blog](https://zilliz.com/blog/training-text-embeddings-with-jina-ai): In a recent talk by Bo Wang, he discussed the creation of Jina text embeddings for modern vector search and RAG systems. He also shared methodologies for training embedding models that effectively encode extensive information, along with guidance o`