copied
Readme
Files and versions
Updated 2 years ago
image-captioning
CLIP
This folder illustrates how to use CLIP to build text index and to conduct cross-modal retrieval baseline.
Catalogue:
1. Build Text Index:
We show how to build the text index, from which the caption is retrieved from, using CLIP.
1.1. Build Text Index for MSCOCO:
First, we demonstrate how to build text index for MSCOCO.
1.1.1. Download Our Post-processed Index:
We share our built index for MSCOCO via this [link]. After downloading, unzip the downloaded file mscoco_index.zip under the current directory.
**** The resulting directory looks like:
.
├── ./mscoco_index/
├── index_matrix.txt # The file that stores the representations of captions from the training set of MSCOCO. Each row is a vector that corresponds to a specific caption from the training set.
└── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.
1.1.2. Construct the Index by Yourself:
You can also rebuild the index by yourself. First, you should make sure you have downloaded the MSCOCO data following instructions [here]. Then, you can run the following command to build the index.
chmod +x ./build_mscoco_index.sh
./build_mscoco_index.sh
The arguments are as follows:
--clip_name
: The configuration of the pre-trained CLIP model from huggingface.--text_file_path
: Where the training text corpus stores.--save_index_prefix
: In which directory you would like to store your index files.--save_index_name
: The saved name of the caption representations.--save_mapping_dict_name
: The saved name of the mapping dictionary between representations and captions.--batch_size
: The inference batch size.
1.2. Build Text Index for Flickr30k:
Next, we demonstrate how to build text index for Flickr30k.
1.2.1. Download Our Post-processed Index:
We share our built index for Flickr30k via this [link]. After downloading, unzip the downloaded file flickr30k_index.zip under the current directory.
**** The resulting directory looks like:
.
├── ./flickr30k_index/
├── index_matrix.txt # The file that stores the representations of captions from the training set of Flickr30k. Each row is a vector that corresponds to a specific caption from the training set.
└── text_mapping.json # The file that stores the mappings between the representation and the corresponding caption.
1.2.2. Construct the Index by Yourself:
You can also rebuild the index by yourself. First, you should make sure you have downloaded the Flickr30k data following instructions [here]. Then, you can run the following command to build the index.
chmod +x ./build_flickr30k_index.sh
./build_flickr30k_index.sh
The arguments are as follows:
--clip_name
: The configuration of the pre-trained CLIP model from huggingface.--text_file_path
: Where the training text corpus stores.--save_index_prefix
: In which directory you would like to store your index files.--save_index_name
: The saved name of the caption representations.--save_mapping_dict_name
: The saved name of the mapping dictionary between representations and captions.--batch_size
: The inference batch size.
2. CLIP Retrieval Baseline:
Here, we show how to conduct the CLIP retrieval baseline.
2.1. In Domain CLIP Retrieval:
To retrieve the captions from the in domain training set, you should run the following command:
chmod +x ./X_clip_retrieval.sh
./X_clip_retrieval.sh
Here, X is in ['mscoco', 'flickr30k'] which corresponds for the MSCOCO and Flickr30k benchmarks.
The arguments are as follows:
--clip_name
: The configuration of the pre-trained CLIP model from huggingface.--test_image_prefix_path
: Where the test set images stores.--test_path
: Where the reference test captions file stores.--index_matrix_path
: The path of the representation index file.--mapping_dict_path
: The path of the mapping dictionary between representations and captions.--save_path_prefix
: Where to save the inferenced result.--save_name
: The saved name of the inferenced result.
[Note] As we are conducting in domain CLIP retrieval, the test images and the caption index should come from the same benchmark.
2.2. Cross Domain CLIP Retrieval:
To retrieve the captions from the cross domain training set, you should run the following command:
chmod +x ./source_X_target_Y_clip_retrieval.sh
./source_X_target_Y_clip_retrieval.sh
Here, X is the source domain from ['mscoco', 'flickr30k'] and Y is the target domain from ['flickr30k', 'mscoco'].
The arguments are as follows:
--clip_name
: The configuration of the pre-trained CLIP model from huggingface.--test_image_prefix_path
: Where the test set images stores.--test_path
: Where the reference test captions file stores.--index_matrix_path
: The path of the representation index file.--mapping_dict_path
: The path of the mapping dictionary between representations and captions.--save_path_prefix
: Where to save the inferenced result.--save_name
: The saved name of the inferenced result.
[Note] As we are conducting cross domain CLIP retrieval, the test images and the caption index should come from different benchmarks.
wxywb
ea1d912ae8
| 6 Commits | ||
---|---|---|---|
.. | |||
README.md |
6.6 KiB
|
2 years ago | |
_clip.py |
6.2 KiB
|
2 years ago | |
build_flickr30k_index.sh |
310 B
|
2 years ago | |
build_mscoco_index.sh |
301 B
|
2 years ago | |
build_text_index.py |
3.8 KiB
|
2 years ago | |
clipretrieval.py |
4.7 KiB
|
2 years ago | |
flickr30k_clip_retrieval.sh |
450 B
|
2 years ago | |
mscoco_clip_retrieval.sh |
429 B
|
2 years ago | |
source_flickr30k_target_mscoco_clip_retrieval.sh |
459 B
|
2 years ago | |
source_mscoco_target_flickr30k_clip_retrieval.sh |
462 B
|
2 years ago |