Recently, I have been spent a lot of time on the Fast.ai Practical Deep Learning for Coders course, and at the same time I have been delving more deeply into the inner structures and workings of neural networks with Jovian's Deep Learning: Zero to GANs. Experiencing both of these approaches at the same time, bouncing back and forth between the two like a model with too high of a learning rate, and immersing myself in papers on convolutional layers, how they work down to the tiniest details, and so many other aspects of neural networks, has given me a deep relationship with this (what I consider to be) art form. In spite of all this, I was quite amazed at the methods utilized in this project and the accuracy of the results.
This project is one of two discussed in chapter 6 of Fast.ai's book for the course. At first, based on the concept, I was not sure how regression with these images would work exactly. I was so accustomed to specific types of models being used for specific tasks. But as I worked through the process and quickly saw it all come together, and it made complete sense. And the effectiveness of the model was rather mindblowing. So here is my interpretation and walkthrough of the project, which I did in order to more fully ingrain these concepts.
This dataset is a collection of over 15,000 images of 20 different men and women with their heads in various positions. It includes typical RGB images as well as depth imaging and annotations of each image. For the purposes of this project, I will be using the images as input, and the target is the very center of the subject's head/face, i.e. a coordinate point that represents the very center. So the model is basically predicting two key factors, the horizontal coordinate and the corresponding vertical coordinate for the person's face in the image.
Prepping the image-label data took a slightly more work than the typical computer vision projects I have worked on before, due to the way that the original files are organized and the very nature of the project itself. But that soon became a non-issue, and I was on my way to predicting with great precision the center of each subject's head.
I achieved great results with Resnet-34 on this project, with a final validation loss of 0.000064, and that only took 5 epochs and a little over 10 minutes of training. I experimented with the model some beyond that to see if I could gain even better results, but this ended up being the best for Resnet-34 in my experimentation.
Below are images of the predicted center points and the actual centers. It is a little tricky to see the tiny red dot marking the center. Zoom in for clarity.
I then decided to try Resnet-101 as well, which in hindsight did not make much sense given the data. But on every project I do, it is hard to stop myself from giving in to my curiosity. The results did not outshine Resnet-34 and actually resulted in a slightly higher final valdiation loss. I found this surprising. If I had experimented further, I might have gotten it at least to the level Resnet-34 achieved. But with such a good results on Resnet-34, I just did not see the purpose in beating this poor horse any further (Wow, did I really just use that metaphor? OUCH! I would backspace it, but...)
Overall, I found this project very enlightening, and it opened my eyes to so many more ways of using models than I had previously considered. I look forward to future projects of this nature and unexpected success in applying models that I might not have considered before this project.
Or you can scroll through the notebook here 👇