camel/README.md

# Image Captioning with CaMEL

*author: David Wang*


<br />


## Description

This operator generates the caption with [CaMEL](https://arxiv.org/abs/2202.10492) which describes the content of the given image. CaMEL is a novel Transformer-based architecture for image captioning which leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. This is an adaptation from [aimagelab/camel](https://github.com/aimagelab/camel).


<br />


## Code Example

Load an image from path './image.jpg' to generate the caption. 

 *Write the pipeline in simplified style*:

```python
import towhee

towhee.glob('./image.jpg') \
      .image_decode() \
      .image_captioning.camel(model_name='camel_mesh') \
      .show()
```
<img src="./cap.png" alt="result1" style="height:20px;"/>

*Write a same pipeline with explicit inputs/outputs name specifications:*

```python
import towhee

towhee.glob['path']('./image.jpg') \
      .image_decode['path', 'img']() \
      .image_captioning.camel['img', 'text'](model_name='camel_mesh') \
      .select['img', 'text']() \
      .show()
```
<img src="./tabular.png" alt="result2" style="height:60px;"/>


<br />


## Factory Constructor

Create the operator via the following factory method

***camel(model_name)***

**Parameters:**

   ***model_name:*** *str*

   The model name of CaMEL. Supported model names: 
- camel_mesh

<br />

## Interface

An image captioning operator takes a [towhee image](link/to/towhee/image/api/doc) as input and generate the correspoing caption.


**Parameters:**

	***data:*** *towhee.types.Image (a sub-class of numpy.ndarray)* 

  The image to generate caption.	


**Returns:** *str*

   The caption generated by model.


    # More Resources

    - [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. 

At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
- [Multimodal RAG locally with CLIP and Llama3  - Zilliz blog](https://zilliz.com/blog/multimodal-RAG-with-CLIP-Llama3-and-milvus): A tutorial walks you through how to build a multimodal RAG with CLIP, Llama3, and Milvus.
- [Transforming Text: The Rise of Sentence Transformers in NLP - Zilliz blog](https://zilliz.com/learn/transforming-text-the-rise-of-sentence-transformers-in-nlp): Everything you need to know about the Transformers model, exploring its architecture, implementation, and limitations
- [Supercharged Semantic Similarity Search in Production - Zilliz blog](https://zilliz.com/learn/supercharged-semantic-similarity-search-in-production): Building a Blazing Fast, Highly Scalable Text-to-Image Search with CLIP embeddings and Milvus, the most advanced open-source vector database.
- [The guide to clip-vit-base-patch32 | OpenAI](https://zilliz.com/ai-models/clip-vit-base-patch32): clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
- [The guide to gte-base-en-v1.5 | Alibaba](https://zilliz.com/ai-models/gte-base-en-v1.5): gte-base-en-v1.5: specialized for English text; Built upon the transformer++ encoder backbone (BERT + RoPE + GLU)
- [An LLM Powered Text to Image Prompt Generation with Milvus - Zilliz blog](https://zilliz.com/blog/llm-powered-text-to-image-prompt-generation-with-milvus): An interesting LLM project powered by the Milvus vector database for generating more efficient text-to-image prompts.
- [From Text to Image: Fundamentals of CLIP - Zilliz blog](https://zilliz.com/blog/fundamentals-of-clip): Search algorithms rely on semantic similarity to retrieve the most relevant results. With the CLIP model, the semantics of texts and images can be connected in a high-dimensional vector space. Read this simple introduction to see how CLIP can help you build a powerful text-to-image service.
- [Zilliz partnership with PyTorch - View image search solution tutorial](https://zilliz.com/partners/pytorch): Zilliz partnership with PyTorch
init the operator. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`# Image Captioning with CaMEL`

			`author: David Wang`


			`<br />`


			`## Description`

update the figure. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			This operator generates the caption with [CaMEL](https://arxiv.org/abs/2202.10492) which describes the content of the given image. CaMEL is a novel Transformer-based architecture for image captioning which leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. This is an adaptation from [aimagelab/camel](https://github.com/aimagelab/camel).
init the operator. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago

			`<br />`


			`## Code Example`

			`Load an image from path './image.jpg' to generate the caption.`

			`Write the pipeline in simplified style:`

			```python
			`import towhee`

			`towhee.glob('./image.jpg') \`
			`.image_decode() \`
update the figure. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`.image_captioning.camel(model_name='camel_mesh') \`
init the operator. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`.show()`
			```
			`<img src="./cap.png" alt="result1" style="height:20px;"/>`

			`Write a same pipeline with explicit inputs/outputs name specifications:`

			```python
			`import towhee`

			`towhee.glob['path']('./image.jpg') \`
			`.image_decode['path', 'img']() \`
update the figure. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`.image_captioning.camel['img', 'text'](model_name='camel_mesh') \`
init the operator. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`.select['img', 'text']() \`
			`.show()`
			```
			`<img src="./tabular.png" alt="result2" style="height:60px;"/>`


			`<br />`


			`## Factory Constructor`

			`Create the operator via the following factory method`

update the figure. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`*camel(model_name)*`
init the operator. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago
			`Parameters:`

			`*model_name:* str`

update the figure. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago			`The model name of CaMEL. Supported model names:`
			`- camel_mesh`
init the operator. Signed-off-by: wxywb <xy.wang@zilliz.com> 3 years ago
			`<br />`

			`## Interface`

			`An image captioning operator takes a [towhee image](link/to/towhee/image/api/doc) as input and generate the correspoing caption.`


			`Parameters:`

			`*data:* towhee.types.Image (a sub-class of numpy.ndarray)`

			`The image to generate caption.`



			`Returns: str`

			`The caption generated by model.`

Initial commit 3 years ago
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 1 year ago

			`# More Resources`

			- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations.

			`At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.`
			`- [Multimodal RAG locally with CLIP and Llama3 - Zilliz blog](https://zilliz.com/blog/multimodal-RAG-with-CLIP-Llama3-and-milvus): A tutorial walks you through how to build a multimodal RAG with CLIP, Llama3, and Milvus.`
			`- [Transforming Text: The Rise of Sentence Transformers in NLP - Zilliz blog](https://zilliz.com/learn/transforming-text-the-rise-of-sentence-transformers-in-nlp): Everything you need to know about the Transformers model, exploring its architecture, implementation, and limitations`
			`- [Supercharged Semantic Similarity Search in Production - Zilliz blog](https://zilliz.com/learn/supercharged-semantic-similarity-search-in-production): Building a Blazing Fast, Highly Scalable Text-to-Image Search with CLIP embeddings and Milvus, the most advanced open-source vector database.`
			`- [The guide to clip-vit-base-patch32 \| OpenAI](https://zilliz.com/ai-models/clip-vit-base-patch32): clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.`
			`- [The guide to gte-base-en-v1.5 \| Alibaba](https://zilliz.com/ai-models/gte-base-en-v1.5): gte-base-en-v1.5: specialized for English text; Built upon the transformer++ encoder backbone (BERT + RoPE + GLU)`
			`- [An LLM Powered Text to Image Prompt Generation with Milvus - Zilliz blog](https://zilliz.com/blog/llm-powered-text-to-image-prompt-generation-with-milvus): An interesting LLM project powered by the Milvus vector database for generating more efficient text-to-image prompts.`
			`- [From Text to Image: Fundamentals of CLIP - Zilliz blog](https://zilliz.com/blog/fundamentals-of-clip): Search algorithms rely on semantic similarity to retrieve the most relevant results. With the CLIP model, the semantics of texts and images can be connected in a high-dimensional vector space. Read this simple introduction to see how CLIP can help you build a powerful text-to-image service.`
			`- [Zilliz partnership with PyTorch - View image search solution tutorial](https://zilliz.com/partners/pytorch): Zilliz partnership with PyTorch`