A review of different approaches to vision-based rep counting
In this article, I try to explain my exploration of different vision-based repetition counting techniques and discuss their pros and cons. Specifically, I highlight five major ways in which computer vision has been employed for rep counting.
Wearable sensors have been quite popular for reps and set counting. Owing to the fact that these sensors are expensive and, in most cases, are only limited to tracking a particular body part, lately, a lot of focus has been on using vision-based approaches for rep counting.
From countless applications in activity monitoring, sports, and gaming to helping gain insight into the number of times a biological event (heartbeat, pulse count, etc.) occurs, Rep counting is a problem actively being solved in both academia and industry.
Keywords: Rep Counting, Computer Vision, Pose Estimation.
- RepNet: Class Agnostic rep counting in the Wild
- Rule-based exercise rep counting using Pose Estimation
- Exercise rep counting using ideas from Signal Processing
- Rep counting using a DL-based Optical Flow Approach
Most of the techniques we discuss in the blog won’t be generic but rather exclusive to a specific problem (for example, workouts). Also, for a deeper understanding of the technique, please refer to the references provided.
One of the most prominent works around Rep Counting has been the RepNet, an end-to-end deep learning model that can accurately predict counts on a broad range of repetitive movements.
The RepNet model takes in a video stream as input and predicts two outputs:
Per-frame period length: For each frame that is a part of repetitive action, we want to know the period length (in time units) of that action.
Per-frame periodicity: a score indicating whether the current frame is a part of repetition or not.
Some of the key highlights of the RepNet model include: A Temporal Self-similarity Matrix (TSM):
- TSM is the highlight of this rep counting technique. It is the information bottleneck of the RepNet architecture. This matrix helps relate the frames to each other by computing a pairwise similarity function between two embeddings.
- One can also infer (using heuristics) the number of repetitions from these TSMs, which makes predictions from the RepNet model interpretable.
- Diverse real-world repetition videos ensure these TSMs are quite diverse, and hence RepNet has a pool of applications besides just rep counting.
One of the most impressive things about this rep counting method is that it is class agnostic (generic) and useful to a wide range of repetitive motions. RepNet model is a classical application of popular Transformers in Computer Vision.
However, the model is constrained in the sense that the number of frames in the input video has to be limited. This can be attributed to the fact that the size of the TSMs is equal to the number of input frames.
The model is quite heavy and complex; hence deploying this on a mobile app or any production environment would be quite challenging and might have latency issues.
This is the most common idea used in industry. A number of health and fitness startups have been working on building accurate, lightweight, state-of-the-art pose estimation models which can be used to accurately count the reps during exercise and provide posture correction feedback, etc.
Major Steps involved:
- Given a specific exercise, you first come up with definitions (rules) for states in that exercise. There can be multiple states in an exercise. A squat exercise, for example, can be broken into two states, say a lower state and an upper state. During the course of movement, the person doing exercise will shift from one state to the other. These state rules can be thought of as representing activation regions during movement.
- E.g., for a squat, these rules can be (th refers to threshold values):
down: (left_knee_hip_dist_y < th1 and right_knee_hip_dist_y < th2)
up: (left_knee_hip_dist_y > th3 and right_knee_hip_dist_y > th4)
- During inference, we start by computing the metrics (angles, distances normalized) using pose-keypoints from the model in real-time and check whether a particular rule gets activated or not, and perform rep counting using the flag.
One of the major upsides of the approach is that rep counting is fast and accurate, and latency is very low. However, some major downsides include the following:
- It is not a generic rep counting.
- The pose estimation model is highly sensitive to background noise and hence rep counting as well.
- Scalability Issues: writing rules manually is a time-intensive process. We also need to test the rules with different variations in angle, orientation, posture, etc. Imagine writing rules for 100s of exercises in the corpus.
Goal: Use Signal Processing ideas like zero-crossing and peak detection to make an exercise rep counter.
This approach is very similar to rule-based rep counting except for the hassle of manually writing the rules for different states during the rep. This approach semi-automates the state calculation approach by inferring a reference line (which can be thought of as a state boundary) for a specific movement/exercise using a trainer’s video and then using the reference line for counting reps of any video of that exercise.
Here, we consider exercise as a set of waves of metrics of keypoints. These Metrics include angles and distances between a combination of different body keypoints, and the keypoints are computed using a pose estimation model (Tensorflow’s Movenet pose estimation model).
Major Steps involved:
- We first compute metrics (distances and angles) between a combination of keypoints using a trainer reference video (as input). These metrics represent a signal temporally.
- We filter out all the stationary signals and create a combined signal of the non-stationary ones. Then we compute the reference line using the mean of the summed-up signal.
- During inference, we start by again computing the metrics on the test user input video and compute an overall combined signal in real-time.
- We create a fixed-size moving window and check for the intersection of the overall signal (from 3) with the reference line (from 2). This intersection gives an indication that the rep is complete.
This approach is fast, easy to implement, and fairly accurate. However, some major downsides include the following:
- Rep counting is exclusive and non-generic.
- Highly sensitive to background noise.
- Scaling issues: One needs to calculate the zero-crossing line using a reference video for any activity (also ensuring the video does not have any noise).
GymCam is a vision-based system used for automated exercise rep counting and tracking. It is based on the assumption that any repetitive motion inside the gym is some sort of exercise. Again, here the input to the system is a video stream from the camera, and the output is several exercise-related metrics, including rep count.
Summary of the Steps Involved
- Detect all potential motion trajectories in a video using a dense optical flow algorithm. A motion trajectory might be a result of non-exercise activities, too, for example, warm-up, users’ gait, roaming here and there, etc.
- Detect all exercise motion trajectories in a scene. How do they do so? Firstly, they perform a feature extraction step that involves extracting handcrafted features from a 5-sec window of any trajectory. They use an MLP-based binary classifier model, which takes in the input feature and outputs a probability of whether that input trajectory (feature) is an exercise-related activity or not.
- Clustering exercise motion trajectories in space and time. After clustering, an average motion trajectory is generated by combining all trajectories belonging to a given cluster. Note here that the number of clusters is pre-defined. These average trajectories are then used for exercise rep counting and tracking.
- Rep Counting and Exercise Recognition: Average trajectories are then converted into feature vectors, which are then fed to an MLP Regressor and an MLP Classifier model to infer rep counts and exercise labels, respectively.
Some of the noteworthy features of this system are: It is an end-to-end system that performs rep counting in a real-world setting. Optical Flow identifies all movements, and hence it would be sufficient to track the exercise and perform rep counting even if the user is barely visible.
Issues with this system:
- Multiple users overlap in a video while doing the exercise. And hence it becomes very difficult to figure out the exact boundaries of these users and infer the rep counts.
- Noise Sensitive: noisy human behavior such as warming up, rest, user’s gait, etc., might exhibit periodicity and hence, can have an undesired contribution to the rep count.
- Rep counting is not generic: the system is limited to just the exercise rep counting.
Another interesting idea employing vision to solve rep counting is the Optical flow approach.
Major Steps Involved
- Find color-coded representations of video frames in a repetitive activity using a dense optical flow algorithm. Here, the catch basically lies in the idea that different states of a repetitive movement will have different color codings.
For details about the optical flow algorithm, please refer to the opencv doc here (along with the implementation).
2. Dataset Creation: Next step is to generate a dataset of color-coded images and videos and label them with different states of the movement (say up or down).
3. Model Training: Next step involves training a vanilla CNN model to perform a multiclass classification of the frames. At test time, color-coded frames from optical flow are then fed to the model, which predicts one of the movement states and also captures the class label. This is basically a color-matching problem but via a model, as the model is more robust.
The approach is accurate and easily deployable in production. However, the cons easily outweigh the pros of the approach:
- Rep counting is exclusive and class-dependent.
- Scaling issues: one needs to annotate the dataset and train a model each time a new exercise gets added to the corpus.
- Orientation Sensitive: Same movements in different orientations will have different color encodings resulting in a wrong model prediction. This is one of the major limitations of the approach.
- Noise Sensitive: Any slight noise in the background would change these color encodings and hence the model’s prediction.