Video-Text Retrieval Embedding with BridgeFormer

author: Jinling Xu

Description

This operator extracts features for video or text with BridgeFormer which can generate embeddings for text and video by jointly training a video encoder and text encoder to maximize the cosine similarity.

Code Example

Load a video from path './demo_video.mp4' to generate a video embedding.

Read the text 'kids feeding and playing with the horse' to generate a text embedding.

Encode video (default):

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('video_path') \
        .map('video_path', 'video_frames', ops.video_decode.ffmpeg()) \
        .map('video_frames', 'vec', ops.video_text_embedding.bridge_former(model_name='frozen_model', modality='video')) \
        .output('video_path', 'video_frames', 'vec')
)

DataCollection(p('./demo_video.mp4')).show()

Encode text:

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text') \
        .map('text', 'vec', ops.video_text_embedding.bridge_former(model_name='frozen_model', modality='text')) \
        .output('text', 'vec')
)

DataCollection(p('kids feeding and playing with the horse')).show()

Factory Constructor

Create the operator via the following factory method

bridge_former(model_name, modality, weight_path)

Parameters:

model_name: str

The model name of frozen in time. Supported model names:

frozen_model
clip_initialized_model

modality: str

Which modality(video or text) is used to generate the embedding.

weight_path: str

pretrained model weights path.

Interface

An video-text embedding operator takes a list of Towhee VideoFrame or string as input and generate an embedding in ndarray.

Parameters:

data: List[towhee.types.Image] or str

The data (list of Towhee VideoFrame (which is uniform subsampled from a video) or text based on specified modality) to generate embedding.

Returns: numpy.ndarray

The data embedding extracted by model.

More Resources

Vector Database Use Cases: Video Similarity Search - Zilliz: Experience a 10x performance boost and unparalleled precision when your video similarity search system is powered by Zilliz Cloud.
Sentence Transformers for Long-Form Text - Zilliz blog: Deep diving into modern transformer-based embeddings for long-form text.
How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
What is BERT (Bidirectional Encoder Representations from Transformers)? - Zilliz blog: Learn what Bidirectional Encoder Representations from Transformers (BERT) is and how it uses pre-training and fine-tuning to achieve its remarkable performance.
Supercharged Semantic Similarity Search in Production - Zilliz blog: Building a Blazing Fast, Highly Scalable Text-to-Image Search with CLIP embeddings and Milvus, the most advanced open-source vector database.
Tutorial: Diving into Text Embedding Models | Zilliz Webinar: Register for a free webinar diving into text embedding models in a presentation and tutorial
Tutorial: Diving into Text Embedding Models | Zilliz Webinar: Register for a free webinar diving into text embedding models in a presentation and tutorial
Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar: Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar: Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.

bridge-former

Jael Gu cea65adf95 Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com>			19 Commits
.gitattributes	1.1 KiB	Initial commit	4 years ago
MCQ.pth	1016 MiB	pth	4 years ago
MCQ_CLIP.pth	337 MiB	pth	4 years ago
README.md	4.5 KiB	Add more resources	2 years ago
__init__.py	699 B	model	4 years ago
bridge_former.py	3.9 KiB	l	4 years ago
demo_video.mp4	950 KiB	model	4 years ago
get_configs.py	523 B	l	4 years ago
requirements.txt	101 B	add requestment	4 years ago
text_emb_result.png	14 KiB	update readme with dc2	4 years ago
video_emb_result.png	31 KiB	update readme with dc2	4 years ago