This operator extracts features for image or text with [ALBEF](https://arxiv.org/abs/2103.00020) which can generate embeddings for text and image by jointly training an image encoder and text encoder to maximize the cosine similarity. This research introduced a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. This repo is an adaptation from [salesforce / ALBEF](https://github.com/salesforce/ALBEF)
This operator extracts features for image or text with [ALBEF](https://arxiv.org/abs/2107.07651) which can generate embeddings for text and image by jointly training an image encoder and text encoder to maximize the cosine similarity. This research introduced a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. This repo is an adaptation from [salesforce / ALBEF](https://github.com/salesforce/ALBEF)
<br/>
@ -35,8 +35,8 @@ towhee.dc(["A teddybear on a skateboard in Times Square."]) \