I’ve created an auto-encoder based image processing model which accepts a 256×256 pixel rgb image and outputs a 256×256 pixel rgb image. One of the uses of this model is single class segmentation, and I’ve successfully trained the model to identify people and generate a black and white output image where all of the pixels of the input image that the model believes are part of a person are white and all of the pixels that the model does not believe are part of a person are black.
The success of my model in the performance of this segmentation task proves that the model identifies features and makes decisions based upon them. This feature extraction and processing is carried out in the model’s encoder which utilizes a special combination of feature extraction techniques. The model’s decoder portion is much simpler, and really just transforms the encoder output into the desired 256x256x3 image output. This decoder section can be replaced with other submodels to use the encoder output for such tasks as classification, multi-class segmentation, style-transfer, and others.
For the single class segmentation which I’ve trained it for, the model utilizes 9 million total trainable parameters in 4 billion FLOPs (floating point operations), making it great for mobile applications. As a comparison, a Vortal photo filtering of a high resolution 3600x4800x3 image could utilize trillions of FLOPs to process the image. Now that I’ve made this new model, I could use this for developing a feature for Vortal which permits a user to decide not to apply the Vortal transformation to automatically detected people that are in the photo, or alternatively, only to the people.