logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

Updated 3 months ago

video-text-embedding

Video-Text Retrieval Embdding with CLIP4Clip

author: Chen Zhang


Description

This operator extracts features for video or text with CLIP4Clip which can generate embeddings for text and video by jointly training a video encoder and text encoder to maximize the cosine similarity.


Code Example

Read the text 'kids feeding and playing with the horse' to generate an text embedding.

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text') \
        .map('text', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='text', device='cuda:1')) \
        .output('text', 'vec')
)

DataCollection(p('kids feeding and playing with the horse')).show()

Load an video from path './demo_video.mp4' to generate an video embedding.

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('video_path') \
        .map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \
        .map('flame_gen', 'flame_list', lambda x: [y for y in x]) \
        .map('flame_list', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='video', device='cuda:2')) \
        .output('video_path', 'flame_list', 'vec')
)

DataCollection(p('./demo_video.mp4')).show()


Factory Constructor

Create the operator via the following factory method

clip4clip(model_name, modality, weight_path)

Parameters:

model_name: str

​ The model name of CLIP. Supported model names:

  • clip_vit_b32

modality: str

​ Which modality(video or text) is used to generate the embedding.

weight_path: str

​ pretrained model weights path.


Interface

An video-text embedding operator takes a list of towhee image or string as input and generate an embedding in ndarray.

Parameters:

data: List[towhee.types.Image] or str

​ The data (list of image(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.

Returns: numpy.ndarray

​ The data embedding extracted by model.

More Resources

Jael Gu a1e7e56e15 Add more resources 15 Commits
file-icon .gitattributes
1.1 KiB
download-icon
Initial commit 3 years ago
file-icon README.md
4.0 KiB
download-icon
Add more resources 3 months ago
file-icon __init__.py
739 B
download-icon
modify factory 3 years ago
file-icon clip4clip.py
4.5 KiB
download-icon
modify factory 3 years ago
file-icon demo_video.mp4
950 KiB
download-icon
modifty 3 years ago
file-icon pytorch_model.bin.1
337 MiB
download-icon
model 3 years ago
file-icon requirements.txt
58 B
download-icon
add requirements 2 years ago
file-icon text_emb_output.png
14 KiB
download-icon
update readme with dc2 2 years ago
file-icon video_emb_ouput.png
31 KiB
download-icon
update readme with dc2 2 years ago