logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

Updated 1 year ago

video-text-embedding

Video-Text Retrieval Embdding with CLIP4Clip

author: Chen Zhang


Description

This operator extracts features for video or text with CLIP4Clip which can generate embeddings for text and video by jointly training a video encoder and text encoder to maximize the cosine similarity.


Code Example

Read the text 'kids feeding and playing with the horse' to generate an text embedding.

from towhee.dc2 import pipe, ops, DataCollection

p = (
    pipe.input('text') \
        .map('text', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='text', device='cuda:1')) \
        .output('text', 'vec')
)

DataCollection(p('kids feeding and playing with the horse')).show()

Load an video from path './demo_video.mp4' to generate an video embedding.

from towhee.dc2 import pipe, ops, DataCollection

p = (
    pipe.input('video_path') \
        .map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \
        .map('flame_gen', 'flame_list', lambda x: [y for y in x]) \
        .map('flame_list', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='video', device='cuda:2')) \
        .output('video_path', 'flame_list', 'vec')
)

DataCollection(p('./demo_video.mp4')).show()


Factory Constructor

Create the operator via the following factory method

clip4clip(model_name, modality, weight_path)

Parameters:

model_name: str

​ The model name of CLIP. Supported model names:

  • clip_vit_b32

modality: str

​ Which modality(video or text) is used to generate the embedding.

weight_path: str

​ pretrained model weights path.


Interface

An video-text embedding operator takes a list of towhee image or string as input and generate an embedding in ndarray.

Parameters:

data: List[towhee.types.Image] or str

​ The data (list of image(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.

Returns: numpy.ndarray

​ The data embedding extracted by model.

ChengZi 23ce1c6ccd add requirements 13 Commits
file-icon .gitattributes
1.1 KiB
download-icon
Initial commit 2 years ago
file-icon README.md
2.3 KiB
download-icon
update readme with dc2 1 year ago
file-icon __init__.py
739 B
download-icon
modify factory 2 years ago
file-icon clip4clip.py
4.5 KiB
download-icon
modify factory 2 years ago
file-icon demo_video.mp4
950 KiB
download-icon
modifty 2 years ago
file-icon pytorch_model.bin.1
337 MiB
download-icon
model 2 years ago
file-icon requirements.txt
58 B
download-icon
add requirements 1 year ago
file-icon text_emb_output.png
14 KiB
download-icon
update readme with dc2 1 year ago
file-icon video_emb_ouput.png
31 KiB
download-icon
update readme with dc2 1 year ago