From Visual Recognition to Visual Understanding
This talk gives an overview of some of our recent work on visual recognition and visual understanding.
The first part of the talk considers visual recognition. It presents results of experiments in which we train convolutional networks on billions of weakly supervised web images. The results of our experiments reveal the benefits of this type of training: for example, we report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (increasing to 86.4% after resolution correction).
The second part of the talk raises the question whether recent successes in visual recognition also pave the way towards visual understanding. It highlights the challenges of visual understanding by uncovering representation biases in current image classification, visual question answering, and image captioning evaluations. To address these problems, we developed the BISON and CLEVR benchmarks in an attempt to provide tools to better tools for studying visual understanding. Finally, the talk presents a benchmark, called PHYRE, that builds on ideas from CLEVR but is intended for the study of systems that possess physical understanding.
Laurens van der Maaten is a Research Scientist at Facebook AI Research in New York. Prior, he worked as an Assistant Professor at Delft University of Technology (The Netherlands) and as a post-doctoral researcher at University of California, San Diego. He received his PhD from Tilburg University (The Netherlands) in 2009. With collaborators from Cornell University, he won the Best Paper Award at CVPR 2017. He is an editorial board member of IEEE Transactions of Pattern Analysis and Machine Intelligence and is regularly serving as area chair for the NeurIPS, ICML, and CVPR conferences. Laurens is interested in a variety of topics in machine learning and computer vision.