I wanted to explore what I could build combining computer vision and LLMs. The majority of LLM’s use cases we come across were chat bots, search agents or summarization tools - I thought it would be fun to try something different.
I had a nagging side project on my list that seemed like a good fit - a bot that would look at your lifting technique and tell you if you were using good form or not. When I was new to the gym, I was worried about lifting with poor form (getting injured, learning bad habits etc…). In theory, one shouldn’t need to hire a coach for this kind of basic advice. Ideally, I’d like something that lets me hit the gym by myself, on my own schedule and is free. But still lets me know if I’m doing something wrong.
To keep things narrow, I defined my MVP as the following:
First, I had to see what I had to work with in terms of computer vision models. I came across the Posenet Tensorflow project, which gives realtime coordinates of key body parts as you move through the webcam. This felt like a promising start - although I wasn’t yet sure how to use those estimations to check lifting form.
I used this tutorial to bring Posenet into a React js web app, and start detecting poses in realtime.
While playing around with this code, I found the function that actually generates the pose estimations at some interval, as the user moves through the frame -
To figure out how I could work with these estimations, I first printed each estimation to the console, just to get a sense of what I was looking at. I saw that posenet estimations take the following shape:
Each body part gets an x and y value representing it’s position on the canvas, and an associated confidence score. And I get a new set of coordinates for each body part, for each posenet estimation as the user moves through the frame.
Next, I had to figure out how to use these coordinates to evaluate someone’s form. At first, I thought of downloading thousands of “good lifts”, have posenet extract patterns, and use that model to evaluate form. While that’s probably the smarter approach, I didn’t want to sink all that time into training a model so early in the project. So I tried a scrappier approach to start with-