This operator generates the caption with [ClipCap](https://arxiv.org/abs/2111.09734) which describes the content of the given image. ClipCap uses CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. This is an adaptation from [rmokady/CLIP_prefix_caption](https://github.com/rmokady/CLIP_prefix_caption).