If you’re anything like me you’ve probably heard all the hype about AI, machine learning, neural networks, etc. but have no idea how any of it works. That’s okay, join the club.
I used virtual images to train a computer to recognize real-world objects.
This is an extension of Edje Electronic’s tutorial where he re-trains a neural network to classify playing cards. I wanted to expand his tutorial to include the entire card deck… but with a twist. Instead of the taking 1,000"s of images of cards and manually labeling them, I used the Unreal Engine to take virtual images and auto label them with the appropriate tag.
(It’ll drain your soul)
The bounding box is determined by converting the object’s max/mix bounds in 3D space into 2D screen-space bounds.
(Epic did most the work)
The cards (and their corresponding tags/labels) are updated real time so you can modify the type of card, location, backdrop, etc as you need.
I tried to automate the system with a camera rig that could capture as many images as needed. It allows for you to define certain behaviors (number of rotations, pitch increments, distance from object, etc). You can set whether there should be a random range added for each setting. I also added some “rules” to ensure you getting “good” images each time to help cut down on the post processing. In this instance I have the camera doing a ray trace to each corner of the card if it can get 3/4 it’s a “good” image. This allows for the system to track cards that are partially occluded by an object (like your hand).
Here’s a comparison of a virtual image and a real one:
It seems to work fairly well (I was honestly surprised it worked at all).
I’ll try to a more complicated scenario (object) now that I have a sort of benchmark to test with. I also wanted to test how “real” the images needed to be and how well I can match it in the engine.
I also need to post some of the training data to the tensorflow forums(?). There were definitely some differences between training on synthetic data vs real data… but it actually seemed like it worked better. Still, I don’t know enough either way to be sure. Any feedback or critiques are absolutely welcome. I don’t really know what I did or why I did it. It just seemed like an interesting experiment.