I wanted to explore what I could build combining computer vision and LLMs. The majority of LLM’s use cases we come across were chat bots, search agents or summarization tools - I thought it would be fun to try something different.

I had a nagging side project on my list that seemed like a good fit - a bot that would look at your lifting technique and tell you if you were using good form or not. When I was new to the gym, I was worried about lifting with poor form (getting injured, learning bad habits etc…). In theory, one shouldn’t need to hire a coach for this kind of basic advice. Ideally, I’d like something that lets me hit the gym by myself, on my own schedule and is free. But still lets me know if I’m doing something wrong.

To keep things narrow, I defined my MVP as the following:

  1. A single lift - overhead press (correct form felt relatively easy to describe)
  2. The bot should be able to look at a single rep, and say if it was good form or not. (Giving advice or suggestions are out of scope for now - although interesting to explore for future iterations!)

Making computers see

First, I had to see what I had to work with in terms of computer vision models. I came across the Posenet Tensorflow project, which gives realtime coordinates of key body parts as you move through the webcam. This felt like a promising start - although I wasn’t yet sure how to use those estimations to check lifting form.

coachllamagoogle-ezgif.com-video-to-gif-converter.gif

I used this tutorial to bring Posenet into a React js web app, and start detecting poses in realtime.

While playing around with this code, I found the function that actually generates the pose estimations at some interval, as the user moves through the frame -

Screenshot 2024-07-25 at 2.24.42 PM.png

To figure out how I could work with these estimations, I first printed each estimation to the console, just to get a sense of what I was looking at. I saw that posenet estimations take the following shape:

Screenshot 2024-07-25 at 1.00.46 PM.png

Screenshot 2024-07-25 at 1.01.42 PM.png

Each body part gets an x and y value representing it’s position on the canvas, and an associated confidence score. And I get a new set of coordinates for each body part, for each posenet estimation as the user moves through the frame.

Creating Coach LLaMa 🧢🦙

Next, I had to figure out how to use these coordinates to evaluate someone’s form. At first, I thought of downloading thousands of “good lifts”, have posenet extract patterns, and use that model to evaluate form. While that’s probably the smarter approach, I didn’t want to sink all that time into training a model so early in the project. So I tried a scrappier approach to start with-

  1. When the user “starts recording”, save all the coordinates of relevant body parts to a list, and convert that to a string. (Stop adding new coordinates when the user ‘stops recording’)