Intuitive deep dive of im2recipe related paper “Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning”
Welcome to part three of the Food AI series of papers!
Part 1: “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”
Part 2: “Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA”
As mentioned in previous articles, these explanations aim at charting the progress of research in a particular domain of machine learning. So, today, we will be looking at the paper titled “Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning” published in 2020. This paper furthers research on the im2recipe problem introduced here and explained in Part I of this series by using a transformer based text encoder instead of LSTM in the original, and Average Word Embeddings in the paper here (explained in Part II). Furthermore, the authors of X-MRS use multilingual translation to regularize the model while adding multi-lingual functionality, and show the power of the learned embeddings through a generative model that generates food images.
As we have seen in previous parts of this series, previous work on the cross-modal recipe task featured either models that were trained end-to-end, or pre-trained encoders that were finetuned on the recipe data on top of which a cross-modal alignment module was applied. In the latter case, there are also methods which encoded each textual component (viz. title, ingredients and instructions) separately and then concatenated the individual embeddings.
Previous work also used a regularization module mainly using some external data to setup a classification or clustering task over the learnt representations. Some other papers which we have not seen, used GANs to regularize the model. How would this work, you ask. Well, based on the performance of the GAN, i.e., its ability to generate realistic food images, the representations being learnt would be modified. Here, the GANs effectively work as decoders (from embeddings to actual images), and “learning decoders in high-dimensional space is a complex task, which can lead to sub-optimal representations.”
The authors differ in their approach with the aforementioned ones by 1) they do not treat and encode the different components of the textual data independently; 2) their models are trained end-to-end and not pre-trained; 3) they do not use GANs or setup a classification task to regularize the learnt representations. Instead, they use multilingual encoding and back-translation for this.
This offers an opportunity to attend to the real-world use case of having the im2recipe application be applied to different languages around the world. Back-translation is a concept where for example, a sentence is translated from English to French and back to English. Then the original English sentence and the “back-translated” Englished sentence which are in effect paraphrases of each other, can be compared and used for regularization
Aside: Many famous and super-large generative image models in the world right now are text induced conditional image synthesis models. This means that these models are trained to generate images on the basis of encoding of some text description given to them. These current models can take in a long description that describes the image and synthesize the image, but the authors of the X-MRS paper refer to papers, and to a time when the textual descriptions for these kind of models had to be short to make them work. In addition, none of the textual descriptions had an inherent sequence to them being modified over time, as is the case for recipe ingredients and instructions. And this is the improvement the authors make when building their own image synthesis model from the learned embeddings.
For encoding the images, ResNet-50 is used. The “last fully connected layer of ResNet-50 is replaced by 2, 1024-dimensional fully connected layers with the last layer being shared with the textual encoder”.
The textual data is not treated separately for each component. Instead, the title, instruction and ingredient is treated as one long text description, tokenized using WordPiece tokenizer, embedded into a dimension size of 768 and passed through a 2-layer 2-head transformer encoder (emphasis: just the encoder of the transformer). Note here that the output of the tokenizer also includes a [CLS] (classification) token and both of them are trimmed to result in just 512 tokens to control memory footprint of the model.
Those who have worked with transformers would know that the [CLS] token can be thought of and used as an aggregate sequence representation, which is what the authors also do. The output of the [CLS] token is “passed through two 1024 dimensional FC layers, before a final 1024 dimensional FC layer projects the text encoding into the shared representation space”.
GAN Synthesis Module
The logic behind using a GAN is two-fold: 1) if the learnt representations are good enough, the GAN should be able to generate images that actually show the recipe ingredients in the image, apart from the image being realistic; 2) just using the trained GAN in a stand-alone application, we would be able to generate a realistic image of a given recipe if the ingredients and instructions are provided, even if its not in the training data.
The GAN module is based on StackGAN where the “intermediate resolution images are discarded”. The discriminator of the GAN also has a recipe classifier, similar to the one seen in Part I leveraging class information constructed from the Recipe1M and other datasets. Next, as discussed previously, the GAN has to be conditioned on text to generate an image. So, in the module, the recipe encoding is passed through a “conditional augmentation sub-network” to create a conditioning code which is passed to the decoder (along with noise) to generate the image. The discriminator then attempts to classify real from fake, and put the image into its correct recipe class.
Note: Added some notations for better understanding below. For all notations, please refer paper.
For learning the representations, the encoders use a simple triplet with margin loss where the similarity between the anchors, positives and negatives is calculated using cosine similarity function. The anchor can either be a image or text encoding and the positive will be the corresponding text or image encoding from the same recipe, while the negative will be the image or text encoding from a different recipe.
The sampling strategy followed for the anchors, positives and negatives is that of hard negative mining. In this, the negative that is selected is of a different class but it has the highest similarity with the anchor among all other negatives.
As mentioned before, the discriminator tries to classify real from fake and also classify the image into its correct recipe class. Losses corresponding to both of these tasks are added to get the final discriminator loss. The loss is a simple cross-entropy loss.
x₁ is the actual image encoding, G(v₂, z) is the generated image where z is Gaussian noise. E(x₁ ~ x₁data) represents that the data comes from the image encoder. E(v₂ ~ v₂data) represents that the data comes from the text encoder which is then passed onto the generator. Dᵣ and D𝒸 are sub-networks inside the discriminator representing the networks that classify real from fake, and recipe classes respectively.
The generator loss is basically one minus the discriminator loss plus the regularization terms from the Conditional Augmentation (CA) subnetwork and the supervision term ret. The CA loss is the KL Divergence of the Gaussian with the mean and standard deviation from the text encoding and the unit Gaussian, while the supervision term enforces that the generator generate the “correct” food images having the “correct” ingredients by forcing the distance between the representation of the (fake image, text) and (fake image, real image) be minimized. The latter part is pretty self-explanatory.
The experiments are conducted on the Recipe1M dataset. “R1M English (EN) recipes were augmented via back translation from German (EN- DE-EN) and Russian (EN-RU-EN). Translations from/to German, Russian and French were obtained using pre-trained models from the fairseq neural translation toolbox.”
Retrieval results are reported using the same metrics as in Part II. “For the GAN synthesis model, besides reporting the retrieval performance on synthetic food images, Fretchet Inception Distance (FID) score, which measures the similarity between real and synthetic image distributions is also calculated.” Lower FID values indicate better performance.
During training and evaluation, some augmentation performed by the authors is:
Image augmentation: 1) Random input image for sample recipes with more than one. 2) Randomly select, pad (with a random choice between zeros, edge duplication or image reflection) to square and resized to 256, or resized to 256. 3) Random rotation ±10◦ 4) Randomly cropped to 224. 5) Random horizontal flip.
Recipe augmentation: 1) Randomly select between original EN representation and back translation from EN-DE-EN or EN-RU-EN. 2) Random selection between previous EN choice and either KO, DE, RU or FR.
As always, exact results can be referred to from the paper. Here, we focus on the analysis of the results. One very significant result is that the X-MRS model with transformer-based text encoder is not able to surpass the baseline we walked through in Part II of this series.
The authors also perform ablation studies where they test retrieval performance using different textual data components. It is observed that using all the information leads to best performance (duh!), followed by instructions + ingredients. This makes the title the least informative. Again, not surprising! Very early research into food classification was already working with just titles and not getting far. One interesting thing here is using the full ingredient text as in the dataset, and not just the extracted ingredient names leads to better retrieval performance.
Retrieval results for different languages are also presented. It should be noted here that the paper’s overall best performing retrieval model that also beats the CkNN baseline is the one trained on multi-lingual data and uses English for retrieval. And it is not mentioned in the paper, exactly how the multi-lingual training is done.
For the synthesis module, the realism of the generated images and the retrieval performance of recipe given the synthesized image is tested. It is observed that there are models that do better in generating more realistic food images, but the proposed model trained on multi-lingual data is the best in retrieval. Plus, the retrieval using synthetic images is better than using real images.
Because the generator is trained to generate food images based on a full recipe embedding, meaning it also has the corresponding textual information, it is easier to have a higher recall than when using the original images (which do not contain textual information). So, to test the purity of the image-text learnt representations, the authors perform an experiment where they train the generator to generate images conditioned on the image embedding (and not the text embedding). They find that the realism of the generated images does not degrade, but the retrieval performance does. This makes sense because now the generated images do not contain textual information, showing that the textual and image embeddings are independent, but close enough in the shared space not to affect the realism of the generated images.
The fact that the X-MRS model without multi-lingual training cannot surpass the CkNN baseline from Part II goes on to show the power of a strong theoretically sound pipeline, no matter if it uses traditional methods or modern “cool” deep networks. That being said, the use of a transformer has been done properly in the pipeline and the idea of augmenting and regularizing with the help of multi-lingual data and back-translation is a cool one, which also works.
This is one of the papers that relies more on experimental analysis than theoretical intuitions. The analysis of generated recipe images shines a different perspective onto how the learnt embeddings can be analyzed and it even shows that if learnt properly, the image encoding and text encoding can be interchanged, as it should be.