## Unsupervised Domain Adaptation of Language Model
****
### Catalogue:
* 1. MSCOCO Benchmark
* 1.1. MSCOCO Data Preparation
* 1.2. Unsupervised Domain Adaptation on MSCOCO
* 2. Flickr30k Benchmark
* 2.1. Flickr30k Data Preparation
* 2.2. Unsupervised Domain Adaptation on Flickr30k
* 3. Unsupervised Baselines
* 3.1. Contrastive Search
* 3.2. Top-k Sampling
* 3.3. Nucleus Sampling
****
#### 1. MSCOCO Benchmark:
We first describe how to perform unsupervised domain adaptation of language model on the text corpus of MSCOCO benchmark.
##### 1.1. MSCOCO Data Preparation:
To prepare the MSCOCO benchmark, please follow the instructions [[here]](https://github.com/yxuansu/MAGIC/tree/main/image_captioning/data#1-mscoco-benchmark).
##### 1.2.Unsupervised Domain Adaptation on MSCOCO:
After preparing the MSCOCO data, run the following command to train the language model.
```yaml
chmod +x ./train_mscoco.sh
./train_mscoco.sh
```
The arguments are as follows:
* `--model_name`: The name of huggingface pre-trained gpt model (e.g. gpt2, gpt-large).
* `--train_path`: The file path of training set.
* `--dev_path`: The file path of validation set.
* `--test_path`: The file path of test set.
* `--add_eos_token_to_data`: Whether adding an eos token at the end of text sequence.
* `--margin`: The contrastive margin $\rho$.
* `--max_len`: The maximum length of training samples.
* `--number_of_gpu`: The number of available GPUs.
* `--batch_size_per_gpu`: The batch size for each GPU.
* `--gradient_accumulation_steps`: How many forward computations between two gradient updates.
* `--effective_batch_size`: The overall batch size. It equals to batch_size_per_gpu x gradient_accumulation_steps x number_of_gpu.
* `--total_steps`: The number of total gradient update steps.
* `--print_every`: Have many steps to show the intermediate results.
* `--save_every`: How many steps to save one checkpoint.
* `--learning_rate`: The learning rate.
* `--save_path_prefix`: Where to save the checkpoints.
****
#### 2. Flickr30k Benchmark:
We then describe how to perform unsupervised domain adaptation of language model on the text corpus of Flickr30k benchmark.
##### 2.1. Flickr30k Data Preparation:
To prepare the Flickr30k benchmark, please follow the instructions [[here]](https://github.com/yxuansu/MAGIC/tree/main/image_captioning/data#2-flickr30k-benchmark).
##### 2.2. Unsupervised Domain Adaptation on Flickr30k:
After preparing the Flickr30k data, run the following command to train the language model.
```yaml
chmod +x ./train_flickr30k.sh
./train_flickr30k.sh
```
****
#### 3. Unsupervised Baselines:
Here, we illustrate how to use the language model to perform unsupervised baselines as described in our paper. Note that, all these methods are **unsupervised** as the language model is a text-only model and does not take image as input.
```python
# first, load the language model
import torch
from simctg import SimCTG
sos_token, pad_token = r'<-start_of_text->', r'<-pad->'
# we use the language model adapted on MSCOCO as an example.
language_model_name = r'cambridgeltl/magic_mscoco'
generation_model = SimCTG(language_model_name, sos_token, pad_token)
generation_model.eval()
# then, prepare the input ids. Note that, the text is always generated from the same start of sentence token.
tokens = generation_model.tokenizer.tokenize(sos_token)
input_ids = generation_model.tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)
```
##### 3.1. Contrastive Search :
```python
'''
use contrastive search to generate the result.
note that, contrastive search is a deterministic decoding method, thus the generated text is always the same.
'''
beam_width, alpha, decoding_len = 45, 0.1, 16
output_text = generation_model.fast_contrastive_search(input_ids, beam_width, alpha, decoding_len)
print (output_text)
'''
A man is riding a skateboard down a street.
'''
```
The arguments are as follows:
* `--input_ids`: The id of the start of sentence token.
* `--beam_width`: k in the contrastive search.
* `--alpha`: alpha in the contrastive search.
* `--decoding_len`: Number of tokens to generate.
##### 3.2. Top-k Sampling :
```python
'''
use top-k sampling to generate the result.
note that, the this method is a stochastic method, thus the generated text is always different.
'''
top_k, decoding_len = 40, 16
output_text = generation_model.top_k_sampling(input_ids, top_k, decoding_len)
print (output_text)
'''
some very different types of vases with flowers together
'''
```
The arguments are as follows:
* `--input_ids`: The id of the start of sentence token.
* `--k`: The k in top-k sampling.
* `--decoding_len`: Number of tokens to generate.
##### 3.3. Nucleus Sampling :
```python
'''
use nucleus sampling to generate the result.
note that, the this method is a stochastic method, thus the generated text is always different.
'''
nucleus_p, decoding_len = 0.95, 16
output_text = generation_model.nucleus_sampling(input_ids, nucleus_p, decoding_len)
print (output_text)
'''
Two young girls enjoying a hot dog hot dog bun.
'''
```
The arguments are as follows:
* `--input_ids`: The id of the start of sentence token.
* `--nucleus_p`: The probability in nucleus sampling.
* `--decoding_len`: Number of tokens to generate.