logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

2.3 KiB

Video-Text Retrieval Embdding with CLIP4Clip

author: Chen Zhang


Description

This operator extracts features for video or text with CLIP4Clip which can generate embeddings for text and video by jointly training a video encoder and text encoder to maximize the cosine similarity.


Code Example

Read the text 'kids feeding and playing with the horse' to generate an text embedding.

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text') \
        .map('text', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='text', device='cuda:1')) \
        .output('text', 'vec')
)

DataCollection(p('kids feeding and playing with the horse')).show()

Load an video from path './demo_video.mp4' to generate an video embedding.

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('video_path') \
        .map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \
        .map('flame_gen', 'flame_list', lambda x: [y for y in x]) \
        .map('flame_list', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='video', device='cuda:2')) \
        .output('video_path', 'flame_list', 'vec')
)

DataCollection(p('./demo_video.mp4')).show()


Factory Constructor

Create the operator via the following factory method

clip4clip(model_name, modality, weight_path)

Parameters:

model_name: str

​ The model name of CLIP. Supported model names:

  • clip_vit_b32

modality: str

​ Which modality(video or text) is used to generate the embedding.

weight_path: str

​ pretrained model weights path.


Interface

An video-text embedding operator takes a list of towhee image or string as input and generate an embedding in ndarray.

Parameters:

data: List[towhee.types.Image] or str

​ The data (list of image(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.

Returns: numpy.ndarray

​ The data embedding extracted by model.

2.3 KiB

Video-Text Retrieval Embdding with CLIP4Clip

author: Chen Zhang


Description

This operator extracts features for video or text with CLIP4Clip which can generate embeddings for text and video by jointly training a video encoder and text encoder to maximize the cosine similarity.


Code Example

Read the text 'kids feeding and playing with the horse' to generate an text embedding.

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('text') \
        .map('text', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='text', device='cuda:1')) \
        .output('text', 'vec')
)

DataCollection(p('kids feeding and playing with the horse')).show()

Load an video from path './demo_video.mp4' to generate an video embedding.

from towhee import pipe, ops, DataCollection

p = (
    pipe.input('video_path') \
        .map('video_path', 'flame_gen', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 12})) \
        .map('flame_gen', 'flame_list', lambda x: [y for y in x]) \
        .map('flame_list', 'vec', ops.video_text_embedding.clip4clip(model_name='clip_vit_b32', modality='video', device='cuda:2')) \
        .output('video_path', 'flame_list', 'vec')
)

DataCollection(p('./demo_video.mp4')).show()


Factory Constructor

Create the operator via the following factory method

clip4clip(model_name, modality, weight_path)

Parameters:

model_name: str

​ The model name of CLIP. Supported model names:

  • clip_vit_b32

modality: str

​ Which modality(video or text) is used to generate the embedding.

weight_path: str

​ pretrained model weights path.


Interface

An video-text embedding operator takes a list of towhee image or string as input and generate an embedding in ndarray.

Parameters:

data: List[towhee.types.Image] or str

​ The data (list of image(which is uniform subsampled from a video) or text based on specified modality) to generate embedding.

Returns: numpy.ndarray

​ The data embedding extracted by model.