Intuitive deep dive of im2recipe related paper “Transformer Decoders with Multimodal Regularization for Cross-Modal Food Retrieval”
Welcome to part fourth and last part of the Food AI series of papers!
Part 1: “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”
Part 2: “Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA”
Part 3: “Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning”
As mentioned in previous articles, these explanations aim at charting the progress of research in a particular domain of machine learning. So, today, we will be looking at the paper titled “Transformer Decoders with Multimodal Regularization for Cross-Modal Food Retrieval” published in 2022. This paper furthers research on the im2recipe problem introduced here and explained in Part I of this series by primarily 1) using cross-modal vanilla transformers and attention, that is, attention over the image and text encodings together instead of just the text as in Part III; 2) using the very powerful Vision and Language Pre-trained (VLP) model CLIP; 3) using a dynamic triplet loss that mimics curriculum training; 4) using a multi-modal regularization technique instead of regularizing via the tasks of classification as in Part I and via generation as in Part III.
An aside here: VLP aims to learn multi-modal representations from a large dataset of image-text pairs. This pretrained model can then be used in downstream vision-language tasks such as this one after finetuning on the Recipe1M dataset. The VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and texts with a Transformer. A very good overview of VLP models is presented here
One: In previous papers we saw that the text encodings can either be encoded separately for each component (ingredient, instructions, title) or together. The improvement made here in this paper is sort of encoding the text encodings both independently and together. The authors use a hierarchical transformer module to encode the components separately first, and use those outputs to get the final text encoding.
Two: The image encoder is no longer just a convolutional neural network but rather a vision transformer pre-trained on VLP tasks. This ensures the power of the transformer is leveraged while learning image encodings as well, and not just text encodings, as before.
Three: Previously, we had a separate regularization module either through classification (incorporating Food-101 or other classification dataset information) or image generation (making generated image distribution similar to actual image distribution). Here, we again have a regularizer but this one is more complex. It uses vanilla transformers and cross-attention between the image and text encodings to make them as similar to each other as possible, for the same recipe. What’s more is that this module is used only during training, and removed during test and inference.
Four: Dynamic triplet loss that acts similar to curriculum training is introduced
The overall architecture can be seen above and is quite similar to all the other architectures we have seen till now. We have an image and an image encoder, the corresponding text and text encoder. The encodings are then projected onto a shared space. Some losses are applied, while these encodings are also passed to a “MMR” module. This module works as both an alignment and regularizer module thus complementing triplet loss and replacing GANs / semantic regularization via classification, respectively.
Image Encoder: As mentioned before, the image encoder is a Vision Transformer (specifically, finetuned CLIP ViT B/16) ensuring that we reap the benefits of a transformer for image encodings as well. As before, with transformers we use the output of the [CLS] token as the encoding for the image.
Text Encoder: The text encoder is hierarchical. As in part II, we have separate transformers T encoding the title, instructions and ingredients. Here, the instructions and ingredients are encoded one ingredient or instruction at a time. Next, we have separate transformers HT that encode the sequence of instruction and ingredient encodings. Now, obviously encoding each component separately will lead to the model not learning how they relate to each other. So, the authors use transformer decoders HTD to achieve this.
The way these decoders work is for each decoder, the query Q is the output of the corresponding HT while the key K and value V is the concatenation of the output of the other two HT. For example, for the ingredient decoder, the query is the ingredient and the key and value is the concatenation of the title and instruction. This means we are cross-attending the ingredients on both the title and instructions. This basically, achieves complete independence on how the different data formats are encoded while also being able to learn the relation between them.
Note here that for text, we work with the whole output of the transformer and not just the [CLS] token. The final recipe encoding is obtained by averaging values of all output tokens of the HTD, concatenating this averaged output across all HTD and projecting to a shared space.
Multimodal Regularization: Note that this module is used only during training time and it replaces the simple projection onto a shared space followed by contrastive (triplet) loss that was being done previously to align the text and image encodings. Instead we have a transformer decoder using cross-attention where the queries Q come from one modality and the keys and values K, V (which are the same) come from the other modality.
In this module, we have two sub-modules. Image Tokens Enhancement Module (ITEM) is a “transformer decoder that takes the image tokens as queries and text tokens as keys and values”. This “enriches the image tokens by attending to the text elements. The Multimodal Transformer Decoder (MTD) is the submodule that actually applies cross-attention on the recipe tokens and the enhanced image tokens. The matching score that measures the level of alignment between image and text tokens is obtained as given in below diagram. The matching score is not the cosine similarity as in previous papers but a score in the range [0,1] output by the decoder directly.
So, in summary we have a VLP pre-trained Vision Transformer for encoding images, a hierarchical transformer module to encode the text (first for individual components and then by cross-attention between these components). Both the image and text encodings are then projected to a shared space where: 1) the image encodings are passed through the encoding enhancer ITEM which attends to the text elements; 2) the enhanced image encodings and text encodings are passed through a multimodal transformer decoder that is used to calculate the matching or alignment score.
Image-Text Matching Loss: ITM is a BCE loss that tracks whether or not an image-text pair matches. The sampling process followed for calculating this loss is that of hard negative mining. We see below that the equation is pretty similar to the actual BCE loss function. y is 1 for a matching image-text pair and 0 otherwise and s is the score output from the decoder. The cross-entropy is then calculated between these two.
Triplet Loss / IncMargin Loss: The triplet loss proposed is the normal triplet loss with an adaptive margin. The margin is kept between an acceptable range with the margin being small (and hence easier to optimize) when starting and increasingly becoming larger.
Now, the authors also adapt the dynamic triplet loss (called IncMargin) by using an adaptive weighing strategy for the triplets as in the Adamine paper. Normally, how the triplet loss works is that the update for each mini-batch is obtained by averaging the gradient for each triplet in the mini-batch. The problem is that after many epochs, some of the triplets have converged and have zero gradient for the loss. Using some math magic (not gonna go into details here), the adaptive weighing strategy manages to give more weight to triplets that have not yet converged.
Semantic Loss: The authors also use another triplet-type loss which captures the semantics of a given query image and recipe images similar to the query. For example, any two pizzas should be closer in the latent space than a pizza and another item from any other class, like salad. This optimization is done directly in the latent space together with the other optimizations. This contrasts with using a separate classification regularization module as in Part I.
Here, xq is the query, xₚ belongs to the same semantic class as the query, and xₙ belongs to the semantic class different from the query.
Different models used in the experiments: ViT-B/16 and CLIP-ViT-B/16 as image encoders; for the recipe encoder, the transformer encoders have 2 layers and 4 heads for the hierarchical transformers T and HT. HTD uses a transformer decoder (wihout masking) with 2 layers and 4 heads. The hidden layer dimension is kept 512 in the recipe encoder. The image and recipe embeddings are obtained with a different linear layers of output dimension 1024. The image and recipe tokens are then projected using different linear layers of the same output dimension of 1024 before going to the MMR module. The ITEM module consists of a transformer decoder of only 1 layer and 4 heads with hidden size 1024. The MTD consists of a transformer decoder with 4 layers, 4 heads and hidden dimension 1024.
The actual numbers aside, we can see from the results in the paper that using CLIP-ViT initialized from the CLIP weights works best in this architecture. Another notable observation is that as the sample size for testing increases from 10k, the difference in performance between TFood and the others increases meaning TFood is more scalable.
From the ablation studies, the following points are validated: 1) HTD leads to better alignment hence proving that the recipe components are entangled and should not be encoded separately; 2) MTD brings additional improvement showing multi-modal transformers are a good module to add in problems like these. Furthermore, both adaptive triplet loss and ViT also bring significant additional improvements.
This paper looks very tough to understand but is actually very simple because the only thing it uses is transformers. I am of the opinion that it benefits very much from the discovery of Vision Transformers. The idea of using transformers everywhere also seems to have worked. This paper is also more satisfying to the mind because the authors have been able to separate the learning of every encoding, and also the relationship between these encodings, using the transformer decoder. One issue with the method in this paper, though, is the complexity in terms of computation and resources that is required.