Computer Vision in One Hour
Half of computer vision is about 3D reconstruction, and the other half is recognition. Reconstruction uses many images of the same scene to build a 3D model of it. Recognition labels images with words that describe what the images depict. Reconstruction: Two recent developments by Microsoft Research have brought 3D reconstruction into the mainstream. The project on Building Rome in a Day produced a system that reconstructs the 3D shape of famous buildings from tourist pictures harvested from the web. The KinectFusion project takes relatively noisy and narrow-field-of-view depth images from a $200 depth sensor made by Israeli company PrimeSense and stitches them into gorgeous, detailed, and accurate 3D models in real time. We will discuss underlying technologies, achievements, trade-offs, and opportunities. We will also clarify why it is still useful to do reconstruction the hard way, from 2D images, and why one still needs to do reconstruction even with a 3D sensor. Recognition is more in its infancy than reconstruction is. In 2012, a recognition system based on deep convolutional networks produced an impressive performance improvement over the state of the art in the ImageNet competition. The question of why performance is so good is still open, but there is no doubt that deep networks have shown empirically to be among the most promising architectures for image recognition. We will scratch the surface of these techniques and contrast them with other approaches. Required Readings: Optional Readings: |