magic

Readme

Files and versions

Updated 3 years ago

image-captioning

CLIP

This folder illustrates how to use CLIP to build text index and to conduct cross-modal retrieval baseline.

Catalogue:

1. Build Text Index
- 1.1. Build Text Index for MSCOCO
  - 1.1.1. Download Our Built Index
  - 1.1.2. Construct the Index by Yourself
- 1.2. Build Text Index for Flickr30k
  - 1.2.1. Download Our Built Index
  - 1.2.2. Construct the Index by Yourself
2. CLIP Retrieval Baseline
- 2.1. In Domain CLIP Retrieval
- 2.2. Cross Domain CLIP Retrieval

1. Build Text Index:

We show how to build the text index, from which the caption is retrieved from, using CLIP.

1.1. Build Text Index for MSCOCO:

First, we demonstrate how to build text index for MSCOCO.

1.1.1. Download Our Post-processed Index:

We share our built index for MSCOCO via this [link]. After downloading, unzip the downloaded file mscoco_index.zip under the current directory.

**** The resulting directory looks like:

.
├── ./mscoco_index/                    
    ├── index_matrix.txt # The file that stores the representations of captions from the training set of MSCOCO. Each row is a vector that corresponds to a specific caption from the training set.
    └── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.

1.1.2. Construct the Index by Yourself:

You can also rebuild the index by yourself. First, you should make sure you have downloaded the MSCOCO data following instructions [here]. Then, you can run the following command to build the index.

chmod +x ./build_mscoco_index.sh
./build_mscoco_index.sh

The arguments are as follows:

--clip_name: The configuration of the pre-trained CLIP model from huggingface.
--text_file_path: Where the training text corpus stores.
--save_index_prefix: In which directory you would like to store your index files.
--save_index_name: The saved name of the caption representations.
--save_mapping_dict_name: The saved name of the mapping dictionary between representations and captions.
--batch_size: The inference batch size.

1.2. Build Text Index for Flickr30k:

Next, we demonstrate how to build text index for Flickr30k.

1.2.1. Download Our Post-processed Index:

We share our built index for Flickr30k via this [link]. After downloading, unzip the downloaded file flickr30k_index.zip under the current directory.

**** The resulting directory looks like:

.
├── ./flickr30k_index/                    
    ├── index_matrix.txt # The file that stores the representations of captions from the training set of Flickr30k. Each row is a vector that corresponds to a specific caption from the training set.
    └── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.

1.2.2. Construct the Index by Yourself:

You can also rebuild the index by yourself. First, you should make sure you have downloaded the Flickr30k data following instructions [here]. Then, you can run the following command to build the index.

chmod +x ./build_flickr30k_index.sh
./build_flickr30k_index.sh

The arguments are as follows:

--clip_name: The configuration of the pre-trained CLIP model from huggingface.
--text_file_path: Where the training text corpus stores.
--save_index_prefix: In which directory you would like to store your index files.
--save_index_name: The saved name of the caption representations.
--save_mapping_dict_name: The saved name of the mapping dictionary between representations and captions.
--batch_size: The inference batch size.

2. CLIP Retrieval Baseline:

Here, we show how to conduct the CLIP retrieval baseline.

2.1. In Domain CLIP Retrieval:

To retrieve the captions from the in domain training set, you should run the following command:

chmod +x ./X_clip_retrieval.sh
./X_clip_retrieval.sh

Here, X is in ['mscoco', 'flickr30k'] which corresponds for the MSCOCO and Flickr30k benchmarks.

The arguments are as follows:

--clip_name: The configuration of the pre-trained CLIP model from huggingface.
--test_image_prefix_path: Where the test set images stores.
--test_path: Where the reference test captions file stores.
--index_matrix_path: The path of the representation index file.
--mapping_dict_path: The path of the mapping dictionary between representations and captions.
--save_path_prefix: Where to save the inferenced result.
--save_name: The saved name of the inferenced result.

[Note] As we are conducting in domain CLIP retrieval, the test images and the caption index should come from the same benchmark.

2.2. Cross Domain CLIP Retrieval:

To retrieve the captions from the cross domain training set, you should run the following command:

chmod +x ./source_X_target_Y_clip_retrieval.sh
./source_X_target_Y_clip_retrieval.sh

Here, X is the source domain from ['mscoco', 'flickr30k'] and Y is the target domain from ['flickr30k', 'mscoco'].

The arguments are as follows:

--clip_name: The configuration of the pre-trained CLIP model from huggingface.
--test_image_prefix_path: Where the test set images stores.
--test_path: Where the reference test captions file stores.
--index_matrix_path: The path of the representation index file.
--mapping_dict_path: The path of the mapping dictionary between representations and captions.
--save_path_prefix: Where to save the inferenced result.
--save_name: The saved name of the inferenced result.

[Note] As we are conducting cross domain CLIP retrieval, the test images and the caption index should come from different benchmarks.

wxywb ea1d912ae8 update the readme. Signed-off-by: wxywb <xy.wang@zilliz.com>			7 Commits
..
README.md	6.6 KiB	init the operator.	3 years ago
_clip.py	6.2 KiB	update the readme.	3 years ago
build_flickr30k_index.sh	310 B	init the operator.	3 years ago
build_mscoco_index.sh	301 B	init the operator.	3 years ago
build_text_index.py	3.8 KiB	init the operator.	3 years ago
clipretrieval.py	4.7 KiB	init the operator.	3 years ago
flickr30k_clip_retrieval.sh	450 B	init the operator.	3 years ago
mscoco_clip_retrieval.sh	429 B	init the operator.	3 years ago
source_flickr30k_target_mscoco_clip_retrieval.sh	459 B	init the operator.	3 years ago
source_mscoco_target_flickr30k_clip_retrieval.sh	462 B	init the operator.	3 years ago