logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

6.6 KiB

CLIP

This folder illustrates how to use CLIP to build text index and to conduct cross-modal retrieval baseline.


Catalogue:


1. Build Text Index:

We show how to build the text index, from which the caption is retrieved from, using CLIP.

1.1. Build Text Index for MSCOCO:

First, we demonstrate how to build text index for MSCOCO.

1.1.1. Download Our Post-processed Index:

We share our built index for MSCOCO via this [link]. After downloading, unzip the downloaded file mscoco_index.zip under the current directory.

**** The resulting directory looks like:

.
├── ./mscoco_index/                    
    ├── index_matrix.txt # The file that stores the representations of captions from the training set of MSCOCO. Each row is a vector that corresponds to a specific caption from the training set.
    └── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.

1.1.2. Construct the Index by Yourself:

You can also rebuild the index by yourself. First, you should make sure you have downloaded the MSCOCO data following instructions [here]. Then, you can run the following command to build the index.

chmod +x ./build_mscoco_index.sh
./build_mscoco_index.sh

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --text_file_path: Where the training text corpus stores.
  • --save_index_prefix: In which directory you would like to store your index files.
  • --save_index_name: The saved name of the caption representations.
  • --save_mapping_dict_name: The saved name of the mapping dictionary between representations and captions.
  • --batch_size: The inference batch size.

1.2. Build Text Index for Flickr30k:

Next, we demonstrate how to build text index for Flickr30k.

1.2.1. Download Our Post-processed Index:

We share our built index for Flickr30k via this [link]. After downloading, unzip the downloaded file flickr30k_index.zip under the current directory.

**** The resulting directory looks like:

.
├── ./flickr30k_index/                    
    ├── index_matrix.txt # The file that stores the representations of captions from the training set of Flickr30k. Each row is a vector that corresponds to a specific caption from the training set.
    └── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.

1.2.2. Construct the Index by Yourself:

You can also rebuild the index by yourself. First, you should make sure you have downloaded the Flickr30k data following instructions [here]. Then, you can run the following command to build the index.

chmod +x ./build_flickr30k_index.sh
./build_flickr30k_index.sh

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --text_file_path: Where the training text corpus stores.
  • --save_index_prefix: In which directory you would like to store your index files.
  • --save_index_name: The saved name of the caption representations.
  • --save_mapping_dict_name: The saved name of the mapping dictionary between representations and captions.
  • --batch_size: The inference batch size.

2. CLIP Retrieval Baseline:

Here, we show how to conduct the CLIP retrieval baseline.

2.1. In Domain CLIP Retrieval:

To retrieve the captions from the in domain training set, you should run the following command:

chmod +x ./X_clip_retrieval.sh
./X_clip_retrieval.sh

Here, X is in ['mscoco', 'flickr30k'] which corresponds for the MSCOCO and Flickr30k benchmarks.

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --test_image_prefix_path: Where the test set images stores.
  • --test_path: Where the reference test captions file stores.
  • --index_matrix_path: The path of the representation index file.
  • --mapping_dict_path: The path of the mapping dictionary between representations and captions.
  • --save_path_prefix: Where to save the inferenced result.
  • --save_name: The saved name of the inferenced result.

[Note] As we are conducting in domain CLIP retrieval, the test images and the caption index should come from the same benchmark.

2.2. Cross Domain CLIP Retrieval:

To retrieve the captions from the cross domain training set, you should run the following command:

chmod +x ./source_X_target_Y_clip_retrieval.sh
./source_X_target_Y_clip_retrieval.sh

Here, X is the source domain from ['mscoco', 'flickr30k'] and Y is the target domain from ['flickr30k', 'mscoco'].

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --test_image_prefix_path: Where the test set images stores.
  • --test_path: Where the reference test captions file stores.
  • --index_matrix_path: The path of the representation index file.
  • --mapping_dict_path: The path of the mapping dictionary between representations and captions.
  • --save_path_prefix: Where to save the inferenced result.
  • --save_name: The saved name of the inferenced result.

[Note] As we are conducting cross domain CLIP retrieval, the test images and the caption index should come from different benchmarks.

6.6 KiB

CLIP

This folder illustrates how to use CLIP to build text index and to conduct cross-modal retrieval baseline.


Catalogue:


1. Build Text Index:

We show how to build the text index, from which the caption is retrieved from, using CLIP.

1.1. Build Text Index for MSCOCO:

First, we demonstrate how to build text index for MSCOCO.

1.1.1. Download Our Post-processed Index:

We share our built index for MSCOCO via this [link]. After downloading, unzip the downloaded file mscoco_index.zip under the current directory.

**** The resulting directory looks like:

.
├── ./mscoco_index/                    
    ├── index_matrix.txt # The file that stores the representations of captions from the training set of MSCOCO. Each row is a vector that corresponds to a specific caption from the training set.
    └── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.

1.1.2. Construct the Index by Yourself:

You can also rebuild the index by yourself. First, you should make sure you have downloaded the MSCOCO data following instructions [here]. Then, you can run the following command to build the index.

chmod +x ./build_mscoco_index.sh
./build_mscoco_index.sh

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --text_file_path: Where the training text corpus stores.
  • --save_index_prefix: In which directory you would like to store your index files.
  • --save_index_name: The saved name of the caption representations.
  • --save_mapping_dict_name: The saved name of the mapping dictionary between representations and captions.
  • --batch_size: The inference batch size.

1.2. Build Text Index for Flickr30k:

Next, we demonstrate how to build text index for Flickr30k.

1.2.1. Download Our Post-processed Index:

We share our built index for Flickr30k via this [link]. After downloading, unzip the downloaded file flickr30k_index.zip under the current directory.

**** The resulting directory looks like:

.
├── ./flickr30k_index/                    
    ├── index_matrix.txt # The file that stores the representations of captions from the training set of Flickr30k. Each row is a vector that corresponds to a specific caption from the training set.
    └── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.

1.2.2. Construct the Index by Yourself:

You can also rebuild the index by yourself. First, you should make sure you have downloaded the Flickr30k data following instructions [here]. Then, you can run the following command to build the index.

chmod +x ./build_flickr30k_index.sh
./build_flickr30k_index.sh

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --text_file_path: Where the training text corpus stores.
  • --save_index_prefix: In which directory you would like to store your index files.
  • --save_index_name: The saved name of the caption representations.
  • --save_mapping_dict_name: The saved name of the mapping dictionary between representations and captions.
  • --batch_size: The inference batch size.

2. CLIP Retrieval Baseline:

Here, we show how to conduct the CLIP retrieval baseline.

2.1. In Domain CLIP Retrieval:

To retrieve the captions from the in domain training set, you should run the following command:

chmod +x ./X_clip_retrieval.sh
./X_clip_retrieval.sh

Here, X is in ['mscoco', 'flickr30k'] which corresponds for the MSCOCO and Flickr30k benchmarks.

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --test_image_prefix_path: Where the test set images stores.
  • --test_path: Where the reference test captions file stores.
  • --index_matrix_path: The path of the representation index file.
  • --mapping_dict_path: The path of the mapping dictionary between representations and captions.
  • --save_path_prefix: Where to save the inferenced result.
  • --save_name: The saved name of the inferenced result.

[Note] As we are conducting in domain CLIP retrieval, the test images and the caption index should come from the same benchmark.

2.2. Cross Domain CLIP Retrieval:

To retrieve the captions from the cross domain training set, you should run the following command:

chmod +x ./source_X_target_Y_clip_retrieval.sh
./source_X_target_Y_clip_retrieval.sh

Here, X is the source domain from ['mscoco', 'flickr30k'] and Y is the target domain from ['flickr30k', 'mscoco'].

The arguments are as follows:

  • --clip_name: The configuration of the pre-trained CLIP model from huggingface.
  • --test_image_prefix_path: Where the test set images stores.
  • --test_path: Where the reference test captions file stores.
  • --index_matrix_path: The path of the representation index file.
  • --mapping_dict_path: The path of the mapping dictionary between representations and captions.
  • --save_path_prefix: Where to save the inferenced result.
  • --save_name: The saved name of the inferenced result.

[Note] As we are conducting cross domain CLIP retrieval, the test images and the caption index should come from different benchmarks.