This operator extracts features for image or text with [CLIP](https://arxiv.org/abs/2108.02927) which can generate embeddings for text and image by jointly training an image encoder and text encoder to maximize the cosine similarity.
This operator extracts features for image or text with [CLIP](https://arxiv.org/abs/2103.00020) which can generate embeddings for text and image by jointly training an image encoder and text encoder to maximize the cosine similarity.