Review of the algorithm for automatic anchor selection in YOLOv5 and YOLOv7
Some Convolutional Neural Networks, including later versions of YOLO, rely on anchors. So before starting to train your network, you need to decide what anchors you will use and make this decision based on the data you have. Anchor set is like a constant you pass to the model, like a prior that incorporates the nature of object sizes in the dataset.
According to MathWorks:
“Anchor boxes are a set of predefined bounding boxes of a certain height and width. These boxes are defined to capture the scale and aspect ratio of specific object classes you want to detect and are typically chosen based on object sizes in your training datasets.”
Selection of good anchors is important because YOLO predicts bounding boxes not directly, but as displacements from anchor boxes. Naturally, neural networks predict small displacements better (more accurately) than large displacements. So the better you choose the anchor boxes, the less “work” will be done by a neural network, and the higher the accuracy the model will produce.
Before training, the script checks how well provided anchors fit the data, and if they do not fit well, the script recalculates them; and the model is trained with new, more appropriate anchors. Sounds like an extremely useful feature, isn’t it?
And it is. This post is devoted to the Auto-anchor algorithm: how it works and what is the intuition behind it. If you are interested — continue reading 🙂
After reviewing the Auto-anchor code, I believe it is better to explain it as a 4-step algorithm:
Step 1. Get bounding box sizes from the train data
Step 2. Choose a metric to define anchor fitness
Step 3. Do clustering to get an initial guess for anchors
Step 4. Evolve anchors to improve anchor fitness
What you need is the height and width of all bounding boxes (labels) in all train images. Pay attention, that height and width should be calculated in pixels for already resized images (resized to model input size).
Model input size for YOLOv5 and YOLOv7 is 640×640 by default, which means that the image larger side is resized to 640, the aspect ratio is preserved, and the shorter side is padded. See the visualization below.
Next, we need a metric to compare sets of anchor boxes and understand which one of them fits the data better.
Ideally, the metric should be connected to the loss function (to the box loss, in particular): the better the metric, the lower the loss. And if anchor boxes are selected using this metric, the model starts training having lower loss already. Perfect!
This metric will be used in the evolutionary algorithm, which means that you may use literally any metric and forget about constraints that some optimization algorithms may impose.
Metric used in YOLO auto-anchor algorithm is tricky and maybe you do not need to know that many details, but for those, who are interested, an explanation is below:
- There is a threshold defined as a hyperparameter (called anchor_t, by default 4; sometimes used as 1/anchor_t, which is 0.25). This threshold means that if the anchor box is larger or smaller than the bounding box label no more than 4 times, we assume that it’s a good anchor box.
- We want each bounding box label to be as close as possible to at least one anchor box. And we want it to be close within the threshold (to be no more than 4 times larger or smaller).
- Good fitness is achieved on average, which means that some bounding boxes (probably outliers) may still be far from anchors.
- For each bounding box we select the best anchor, but its fit we calculate from the worse fitting side (hope, it makes sense).
I really recommend you to go through the calculations below, if you still want to understand how the metric is calculated.
In the YOLOv2, anchor boxes were calculated with a k-means clustering algorithm only. The typical distance metric for k-means is Euclidean distance, however, with this distance, larger boxes generate more error than small metrics, so the authors utilized 1-IoU (Intersection over Union) as a distance metric. IoU, by the way, is much more related to YOLO loss function than Euclidean distance.
Doing k-means clustering only is a good approach already, it will give you much better results compared to hand-picked anchor boxes. However, authors of later YOLO versions decided to go even further:
- K-means (with simple Euclidean distance) is used to get the initial guess for anchor boxes.
- An evolutionary algorithm (more on that in Step 4) is used to find the best anchor set with the respect to metric selected previously in Step 2.
How to select the number of clusters? YOLOv5 and YOLOv7 use 9 anchor boxes by default so the number of clusters should be 9.
Note for beginners. The K-means algorithm runs on all bounding box labels from the train set. Features used for clustering — width and height in pixels. Final cluster centers are anchor boxes.
Evolutionary algorithm is inspired by nature and beautiful in its simplicity. We take the anchor set from k-means, change slightly and randomly the height and width of some anchor boxes (mutate), then calculate the fitness metric. If a new mutated anchor set is better — the next mutation is performed on a new anchor set, otherwise, old anchors are used. If you prefer to consume information visually — below is a scheme of how this evolutionary algorithm works.
In YOLOv5 and YOLOv7 evolution runs for 1000 iterations (by default) and may change the initial anchor set by a lot.
Now you know how YOLO Auto-anchor works, in all the details. I am planning to review other parts of YOLO in the future, so if you are interested — subscribe and stay updated.
Meanwhile, I encourage you to look through some of my other posts: