Pre-trained models help boost the popularity of semantic search. We can easily get embedding (aka. vector) of different media (i.e. text, image, audio) by applying the pre-trained models on the raw data. Those embedding can be indexed to support semantic search use cases. This article walks through the top 5 pre-trained models to get image embeddings, which is a lower-dimensional representation of the image.
We can get image embeddings via the following two steps:
- Given an image, convert to the format accordingly based on different deep learning frameworks, like building tfrecords for TensorFlow and Tensor in PyTorch.
- Apply a specific pre-trained model to the format representation and fetch the embedding.
The Pre-trained models are normally used for the image classification task on ImageNet. Figure 1 shows the state-of-the-art models from the paperswithcode website. As we can see, after AlexNet was introduced in 2012, more and more models have been developed to improve the accuracy. Those models can be downloaded from model repos and used immediately.
The top 5 list consists of:
The first four are convolutional neural network (CNN) models, while the last one is based on the transformer architecture. Based on a Google post, ViT can achieve state-of-art results with efficient computation. As there are numerous tutorials on each model, this article doesn’t belabor more. We can just pick one, like ResNet50, to get embedding and see how the semantic search results look like.