In this post, I’ll walk through a project that was done with the objective of using some pre-trained models and performing a basic level of transfer learning on it to achieve Face Recognition of the 7 members of the Korean band BTS.
Motivation — A big fan of BTS challenged me to identify the seven members of the band after just looking at a few music videos (and listening to her real-time audio annotations). It was a good challenge for me, and it took a while for me to get all seven members correct. Since it took me a few days to actually train myself to recognize the members, I decided to pass this challenge on to a machine (learning algorithm), and see how soon and how accurately it could learn to recognize their faces.
A brief flow of steps followed:
- Data Collection — Member images were downloaded using web-scraping on selenium
- Annotation and QC — The dataset was checked for labeling quality through manual checks
- Face detection using MTCNN 
- Convert the face images to vector embeddings using the pre-trained Facenet  model
- Train an SVM classifier using the embeddings as the features and the member’s names as classes
- Evaluate the classification results
- Classify the member’s faces in a music video using the classifier
The data that we require for face recognition is image data. Since we need to train a model to differentiate between the seven members of the bad, we need to scrape a set of images for each of the band members.
The data source that we are using for this is Google images, and we will be scraping the image URLs off the website using selenium for python after searching for the images of each of the seven band members.
Also, for the purpose of getting accurate results from Pinterest search, we will be using the real names of the band members instead of their band names. E.g. — Nam Joon BTS, instead of RM BTS
Let’s start by creating the folder structure for the same
Now that we have the folder structure created for test, train and all, let’s scrape the images off of google images and save them in the respective folders.
Once, we have all the images saved for all the members, we can go to the dat cleaning phase
In this stage, in order to ensure that only the relevant images were present in the dataset, a manual eyeball check was done. Since we already scraped and saved the band member images separately, we don’t need to manually annotate the images.
Also, since each image is annotated, we are going to proceed with only the images which have the relevant band member as per the labels. There are a few images in the dataset in which more than one band members were present in an image, we use mtcnn  to detect the faces, and the images with multiple faces detected were deleted from the dataset. Upon inspection, this does lead to a slight imbalance in the number of samples per class, but it wasn’t a big difference. Below is how we count the number of faces in each image and delete the images with multiple images.
Total number of images that we scraped across all the members = 1226
The next step requires us to split the entire image repository into train and test datasets. This step just requires us to copy the images from the “All” folder to the train and test folders. We will be using an 80–20 train-test split for this use case.
The train-test split has resulted in 978 in the training set and 248 in the testing set.
The dataset that we have so far is in the form of images. And the images consist of more than just faces. As a result, we need to first extract just the faces, and then embed them into numerical vectors, so that we can perform classsification on them.
- For the purpose of face extraction, we will be using mtcnn once again. But this time, instead of just counting the faces, we will be using the bounding box values for cropping just the face out of the image. Once we have the function for getting the faces loaded, we need to save the 2-dimensional pixel values in the images as arrays, which we can later use for getting face embeddings.
2. With these functions in place, we can now just create a wrapper function that just reads in the images for the different members and uses the above functions for face extraction and saving them as arrays.
Firstly, let’s load the pre-trained model with the weights. The pre-trained model that I got was a tensor-flow model, and I converted it to a Keras model referring to keras-inception-resnet-v2.
Once, we have the model loaded, we can use the face pixel values as input to the facenet model , and get the face embeddings as the model output.
The embeddings will serve as an input to training the face classification (detection) model. This model gives us 512 face embedding features to use per sample.
What is SVM ?
Support Vector Machines (SVM) is an algorithm that helps in creating a boundary between data points in a very high-dimensionality plane. Since, we have around 512 dimensions from our face embeddings with around 978 samples, using SVM is a really effective choice. Also, since our number of features are less than the samples, we do not need to worry too much about overfitting the model (or we can later look at regularization, if required).
Before, we get into model training, we will normalize the embeddings values to ensure that there’s no scale difference among the features that go into the model which might affect the model’s ability to judge the importance of the features. We will also encode the labels as numerical values using LabelEncoder to make it easy for the model to deal with the classes. We will be using sklearn  for some of the common transformations and treatments.
We will now train an SVM Classifier from sklearn library on the training set we have created.
Moment of truth: Let’s do some predictions and check out the accuracy of our face detection model.
Accuracy: train=82.515 test=84.677
Not a bad result for a simple model without any hyper-parameter tuning! Also, we can see that our model seems to be generalizing well and is not overfitting, since the test-set accuracy is higher than the train-set accuracy. Although, I will perform a K-Fold CV at a later point to ensure a balance of bias and variance.
Let’s also look at an individual image classification:
For most of the examples, we’re seeing good results. Time to use this on a music video and see this in action! I have downloaded a BTS music video — Life Goes On. It’s a little slow-paced compared to some other and I wanted the members’ faces to be a little stable to see it in effect.
For this purpose, we will be doing a frame by frame processing and rendering using moviepy package .
Tooks a little while for processing, and Voila! Below is a link to the output. It isn’t a 100% accurate, but it is a good POC for face recognition and similar applications.
 H. Taniai, “keras-facenet.” Aug. 30, 2022. Accessed: Aug. 30, 2022. [Online]. Available: https://github.com/nyoki-mtl/keras-facenet
 I. de P. Centeno, “ipazc/mtcnn.” Aug. 31, 2022. Accessed: Aug. 30, 2022. [Online]. Available: https://github.com/ipazc/mtcnn
 “selenium/py at trunk · SeleniumHQ/selenium,” GitHub. https://github.com/SeleniumHQ/selenium (accessed Aug. 30, 2022).
 J. Brownlee, “How to Develop a Face Recognition System Using FaceNet in Keras,” Machine Learning Mastery, Jun. 06, 2019. https://machinelearningmastery.com/how-to-develop-a-face-recognition-system-using-facenet-in-keras-and-an-svm-classifier/ (accessed Aug. 31, 2022).
 “scikit-learn/scikit-learn.” scikit-learn, Aug. 31, 2022. Accessed: Aug. 31, 2022. [Online]. Available: https://github.com/scikit-learn/scikit-learn
 Zulko, “MoviePy.” Aug. 31, 2022. Accessed: Aug. 31, 2022. [Online]. Available: https://github.com/Zulko/moviepy