Intuitive deep dive of im2recipe related paper “Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA”
Welcome to part two of the Food AI series of papers! Last week, I wrote an article explaining the seminal im2recipe paper that introduced cross-modality into dealing with machine learning related to food applications such as searching for the correct recipe using a photo, automatically determining the number of calories in a dish or improving the performance of various recipe recommendation and ranking systems. I refer the reader to that article for an introduction and motivation to the problem and for details regarding the Recipe1M dataset and evaluation metrics.
As mentioned in the previous article, these explanations aim at charting the progress of research in a particular domain of machine learning. So, today, we will be looking at the paper titled “Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA” published in 2019. This paper furthers research on the im2recipe problem introduced here and explained here by defining a different and better baseline for comparison of the retrieval task at hand. This baseline intends to replace the CCA baseline in the original paper.
The reasons for improving the baseline are: 1) “the CCA baseline for Recall@1 was 14.0 on a test set of 1,000 recipes. This result was eclipsed and quadrupled inside two years to reach 51.8” The authors argue that this has more to do with “mis-specified baselines” rather than genuine improvement in retrieval methods. 2) In the original im2recipe paper, the model was trained end-to-end and hence it was difficult to understand how each element of the model was performing individually(the elements being the image encoder, text encoder, etc). This is overcome by building a conceptually simple and interpretable model to serve as a baseline for the problem.
Specifically, the retrieval function in the model is a cross-modal adaptation of the kNN algorithm called the CkNN “applied on top of pre-computed image-text embeddings where the image and text encoders are not trained end-to-end with CkNN but independently using self-supervision.” This makes it easy to test and analyze their performance in separation.
Lastly, the authors extend this method to obtain a new state-of-the-art performance on the problem.
Some important related work that the paper references (apart from the im2recipe paper):
- AdaMine – The alignment module was improved by using “double triplet loss” as the loss function. Simply, this is triplet loss applied to two sets of triplets: one set that represents individual image-pair instances and helps the model learn to map embeddings from the different modalities for individual recipes together, and the other set representing classes of image-pair instances which helps the model learn, through the class information, which recipes are similar to each other. The latter is similar to semantic regularization in the im2recipe paper.
- R2GAN – used GANs to help learn better image and text embeddings
- ACME – This was the state-of-the-art when the CkNN paper was published and used “GANs + cross-modal triplet loss together + modality alignment using an adversarial learning strategy + a cross-modal translation consistency loss”. Not to get into the specifics of how it works (that is probably for another article), but the point is that a really competitive result to ACME can be achieved by CkNN, as we will see soon.
Text Encoder (Average Word Embedding): Average Word Embeddings are a very basic way of encoding text. It is depicted in the image below. For a document (which can be a sentence or a paragraph or anything else), the embedding for the document is generated by calculating an average of the word embeddings in the document. The word embeddings themselves can be generated in many ways.
In the paper, the way this is done is: “instructions and ingredients are treated as the document and the class labels are the most frequent unigrams and bigrams contained in the recipe title (class generation is similar to im2recipe)”. The word embedding size is d=300 and while it is randomly initialized, the training uses softmax activation with Binary Cross Entropy loss on the average of the word embeddings in the document to learn the word embeddings.
Image Encoder: The image embedding is obtained using the last layer before the classification layer of ResNet-50, again regularized using Binary Cross Entropy on a set of class labels generated using the same technique from the recipe title.
Note here that the set of labels generated for text encoder and image encoder are different (3453 for text and 5036 for image, to be precise). The reason for this is that since the encoders are being trained independently, we are not restricted to solely using image-text pairs as data but we can use data of recipes that include only text or only images. Hence, we have different number of data points for text and image and different classes.
CkNN – Cross-Modal kNN
CkNN like kNN, is a non-parametric model. It serves as an image-text embedding alignment module, and as with other components in this proposed system, it is applied independently. First the encoders are trained to get the image and text embeddings and then CkNN is applied on top of them. This differs from the im2recipe paper where the system was trained end-to-end and we had transformations to bring the embeddings to a shared space.
There are two parts to CkNN: one works in the text embedding space and the other in the image embedding space Let’s see how it works in the text embedding space (follow the numbers in image): 
- Encode a candidate text document (that is, recipe instructions and ingredients) T using the text encoder e(T)
- Find the kₜ nearest neighbours based on the text embeddings using cosine similarity in the text embedding space, denoted by Rₜ .
- Extract the set of images Iₜ associated with Rₜ leveraging the Recipe1M dataset
- Encode each I ∈ Iₜ in the image embedding space using the image encoder e(I).
- Return the mean of the obtained image embeddings as the result
The corresponding algorithm for the image embedding space would be:
- Encode a candidate image I using the image encoder e(I)
- Find the kᵢ nearest neighbours based on the image embeddings using cosine similarity in the image embedding space, denoted by Rᵢ .
- Extract the set of texts Tᵢ associated with Rᵢ leveraging the Recipe1M dataset
- Encode each T∈ Tᵢ in the text embedding space using the text encoder e(T).
- Return the mean of the obtained text embeddings as the result
Let’s call the component in the image space as Mᵢ and the component in the text space as Mₜ.
The loss function is basically a linear combination of the distance between the obtained image and text embeddings in the 1) image embedding space and, 2) text embedding space. Easy!
Not focusing on the exact results which can be seen in the tables in the paper, I will be focusing on their analysis. In short, the results show the retrieval performance measured in the same way as in the original im2recipe paper (plus also with a sample size of 10,000) for the new baseline CkNN and other methods that are used for comparison.
To prove that training the encoders separately and then applying CkNN to it works as well as training the system end-to-end as in ACME (the state-of-the-art at the time), different combinations of pre-trained image and text encoders from related works are used for generating embeddings, on which CkNN is then used to measure retrieval performance. It is shown that the end-to-end Recall@1 is 20.6 while with CkNN, it is 17.9, which is competitive. This analysis also shows that the quality of the encoders, and not cross-modal alignment is the primary driving force behind improving results in previous research.
Extension to SoTA
Finally, encouraged by the above competitive results with ACME, the authors add their own alignment module on top of the CkNN using triplet loss to create a joint embedding space of image and text.
“Two feed-forward neural networks with one hidden layer, dropout and batch normalization are trained with triplet loss with margin: one for image (gᵢ) and another for textual (gₜ) features.”
Reading this paper makes me think that deep learning is not the only way to go. Sure, the image encoder used here is deep but ResNet has been ruling that domain for some time now. And the actual alignment that is happening is a simple kNN. Adding a MLP to that gets you a state-of-the-art performance! This is a clever paper that takes the concept of cross-modality introduced in the im2recipe paper and instead of improving the results (as most papers would), asks the question from the other side: what if the baseline is too low? And, this gives you a sanity check.
In conclusion, this paper redefined the baseline for the im2recipe task. The new baseline is created using a very simple architecture which is not trained end-to-end which simplifies model exploration and model comparison. In the next post, we will look at how this new baseline is used in another paper that again improves SoTA performance. Stay tuned!