Computer Vision in One Hour

    Half of computer vision is about 3D reconstruction, and the other half is recognition. Reconstruction uses many images of the same scene to build a 3D model of it. Recognition labels images with words that describe what the images depict.

    Reconstruction: Two recent developments by Microsoft Research have brought 3D reconstruction into the mainstream. The project on Building Rome in a Day produced a system that reconstructs the 3D shape of famous buildings from tourist pictures harvested from the web. The KinectFusion project takes relatively noisy and narrow-field-of-view depth images from a $200 depth sensor made by Israeli company PrimeSense and stitches them into gorgeous, detailed, and accurate 3D models in real time. We will discuss underlying technologies, achievements, trade-offs, and opportunities. We will also clarify why it is still useful to do reconstruction the hard way, from 2D images, and why one still needs to do reconstruction even with a 3D sensor.

    Recognition is more in its infancy than reconstruction is. In 2012, a recognition system based on deep convolutional networks produced an impressive performance improvement over the state of the art in the ImageNet competition. The question of why performance is so good is still open, but there is no doubt that deep networks have shown empirically to be among the most promising architectures for image recognition. We will scratch the surface of these techniques and contrast them with other approaches.

    Required Readings:

  • Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. Building Rome in a Day. Communications of the ACM  54(10):105-112, October 2011.
  • Yann LeCun, Koray Kavukcuoglu and Clément Farabet. Convolutional networks and applications in vision. Proceedings of the International Symposium on Circuits and Systems,  253-256, 2010.
  • Optional Readings:

  • Project page for Building Rome in a Day
  • Project page for KinectFusion
  • Carlo Dal Mutto, Pietro Zanuttigh, and Guido M. Cortelazzo. Microsoft Kinect Range Camera. Chapter 3, pages 33-47 in Time-of-Flight Cameras and Microsoft Kinect, Springer Briefs in Electrical and Computer Engineering, 2012.
  • Carlo Tomasi. Visual Reconstruction: Technical Perspective. Communications of the ACM 54(10):104, October 2011.
  • Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon. KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera. ACM Symposium on User Interface Software and Technology, October 2011.
  • Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems (NIPS), 1097-1105, December 2012.
  • Li Deng and Dong Yu. Deep learning: methods and applications, Foundations and Trends in Signal Processing, (7)3-4:197-387, 2014.
  • Home page for the 2014 ImageNet competition.