logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

113 lines
5.1 KiB

3 years ago
# Video-Text Retrieval Embedding with DRL
*author: Chen Zhang*
<br />
## Description
This operator extracts features for video or text with [DRL(Disentangled Representation Learning for Text-Video Retrieval)](https://arxiv.org/pdf/2203.07111v1.pdf), and then it can get the similarity by Weighted Token-wise Interaction (WTI) module.
<br />
![](WTI.png)
## Code Example
Read the text 'kids feeding and playing with the horse' to generate a text embedding.
```python
2 years ago
from towhee import pipe, ops, DataCollection
3 years ago
p = (
pipe.input('text') \
.map('text', 'vec', ops.video_text_embedding.drl(base_encoder='clip_vit_b32', modality='text', device='cuda:0')) \
.output('text', 'vec')
)
3 years ago
DataCollection(p('kids feeding and playing with the horse')).show()
3 years ago
```
![](text_emb_result.png)
3 years ago
Load an video from path './demo_video.mp4' to generate a video embedding.
3 years ago
```python
2 years ago
from towhee import pipe, ops, DataCollection
p = (
pipe.input('video_path') \
.map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \
.map('flame_gen', 'flame_list', lambda x: [y for y in x]) \
.map('flame_list', 'vec', ops.video_text_embedding.drl(base_encoder='clip_vit_b32', modality='video', device='cuda:0')) \
.output('video_path', 'flame_list', 'vec')
)
3 years ago
DataCollection(p('./demo_video.mp4')).show()
```
3 years ago
![](video_emb_result.png)
3 years ago
<br />
2 years ago
Note: For this model, cpu is not support, and you must specify device='cuda...'
3 years ago
## Factory Constructor
Create the operator via the following factory method
***drl(base_encoder, modality)***
**Parameters:**
***base_encoder:*** *str*
​ The base CLIP encode name in DRL model. Supported model names:
- clip_vit_b32
***modality:*** *str*
​ Which modality(*video* or *text*) is used to generate the embedding.
<br />
## Interface
An video-text embedding operator takes a list of [towhee VideoFrame](link/to/towhee/image/api/doc) or string as input and generate an embedding in ndarray.
**Parameters:**
***data:*** *List[towhee.types.VideoFrame]* or *str*
​ The data (list of VideoFrame(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.
**Returns:** *numpy.ndarray*
​ The data embedding extracted by model. When text, the shape is (text_token_num, model_dim), when video, the shape is (video_token_num, model_dim)
3 years ago
# More Resources
- [Vector Database Use Cases: Video Similarity Search - Zilliz](https://zilliz.com/vector-database-use-cases/video-similarity-search): Experience a 10x performance boost and unparalleled precision when your video similarity search system is powered by Zilliz Cloud.
- [ColBERT: A Token-Level Embedding and Ranking Model - Zilliz blog](https://zilliz.com/learn/explore-colbert-token-level-embedding-and-ranking-model-for-similarity-search): Unlike traditional embedding models like BERT, which focus on pooling embeddings into a single vector, ColBERT retains individual token representations. Through its innovative late interaction mechanism, it enables more precise and granular similarity calculations.
- [The guide to mistral-embed | Mistral AI](https://zilliz.com/ai-models/mistral-embed): mistral-embed: a specialized embedding model for text data with a context window of 8,000 tokens. Optimized for similarity retrieval and RAG applications.
- [Supercharged Semantic Similarity Search in Production - Zilliz blog](https://zilliz.com/learn/supercharged-semantic-similarity-search-in-production): Building a Blazing Fast, Highly Scalable Text-to-Image Search with CLIP embeddings and Milvus, the most advanced open-source vector database.
- [The guide to all-MiniLM-L12-v2 | Hugging Face](https://zilliz.com/ai-models/all-MiniLM-L12-v2): all-MiniLM-L12-v2: a text embedding model ideal for semantic search and RAG and fine-tuned based on Microsoft/MiniLM-L12-H384-uncased
- [Build a Multimodal Search System with Milvus - Zilliz blog](https://zilliz.com/blog/how-vector-dbs-are-revolutionizing-unstructured-data-search-ai-applications): Implementing a Multimodal Similarity Search System Using Milvus, Radient, ImageBind, and Meta-Chameleon-7b
- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar/success): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
- [Training Text Embeddings with Jina AI - Zilliz blog](https://zilliz.com/blog/training-text-embeddings-with-jina-ai): In a recent talk by Bo Wang, he discussed the creation of Jina text embeddings for modern vector search and RAG systems. He also shared methodologies for training embedding models that effectively encode extensive information, along with guidance o