Video-Text Retrieval Embedding with BridgeFormer

author: Jinling Xu

Description

This operator extracts features for video or text with BridgeFormer which can generate embeddings for text and video by jointly training a video encoder and text encoder to maximize the cosine similarity.

Code Example

Load a video from path './demo_video.mp4' to generate a video embedding.

Read the text 'kids feeding and playing with the horse' to generate a text embedding.

Write the pipeline in simplified style:

Encode video (default):

import towhee
towhee.dc(['./demo_video.mp4']) \
      .video_decode.ffmpeg() \
      .video_text_embedding.bridge_former(model_name='frozen_model', modality='video') \
      .show()

Encode text:

import towhee

towhee.dc(['kids feeding and playing with the horse']) \
      .video_text_embedding.bridge_former(model_name='frozen_model', modality='text') \
      .show()

Write a same pipeline with explicit inputs/outputs name specifications:

import towhee

towhee.dc['path'](['./demo_video.mp4']) \
      .video_decode.ffmpeg['path', 'frames']() \
      .video_text_embedding.bridge_former['frames', 'vec'](model_name='frozen_model', modality='video') \
      .select['path', 'vec']() \
      .show(formatter={'path': 'video_path'})

towhee.dc['text'](["kids feeding and playing with the horse"]) \
      .video_text_embedding.bridge_former['text','vec'](model_name='frozen_model', modality='text') \
      .select['text', 'vec']() \
      .show()

Factory Constructor

Create the operator via the following factory method

bridge_former(model_name, modality, weight_path)

Parameters:

model_name: str

The model name of frozen in time. Supported model names:

frozen_model
clip_initialized_model

modality: str

Which modality(video or text) is used to generate the embedding.

weight_path: str

pretrained model weights path.

Interface

An video-text embedding operator takes a list of Towhee VideoFrame or string as input and generate an embedding in ndarray.

Parameters:

data: List[towhee.types.Image] or str

The data (list of Towhee VideoFrame (which is uniform subsampled from a video) or text based on specified modality) to generate embedding.

Returns: numpy.ndarray

The data embedding extracted by model.

xujinling 1092551987 add requestment Signed-off-by: xujinling <jinling.xu@zilliz.com>			16 Commits
.gitattributes	1.1 KiB	Initial commit	4 years ago
MCQ.pth	1016 MiB	pth	4 years ago
MCQ_CLIP.pth	337 MiB	pth	4 years ago
README.md	2.7 KiB	l	4 years ago
__init__.py	699 B	model	4 years ago
bridge_former.py	3.9 KiB	l	4 years ago
demo_video.mp4	950 KiB	model	4 years ago
get_configs.py	523 B	l	4 years ago
requirements.txt	101 B	add requestment	4 years ago
result1.png	12 KiB	l	4 years ago
result2.png	12 KiB	l	4 years ago
result3.png	115 KiB	l	4 years ago
result4.png	6.7 KiB	l	4 years ago