Syllabus

The textbook for this course is Computer Vision — Models, Learning, and Inference by Simon Prince, Cambridge University Press, 2012. Numbers in the topic bars below are page numbers in the textbook. Additional materials will be posted below as appropriate, and a '+' will then be added to the relevant page range. Please see the book's accompanying web page for a PDF version of the book, a list of errata, and complementary material.

This syllabus is a plan, not a commitment. Depending on class interest and the time needed to cover the various topics, it may be necessary to skip some of the topics below. Page numbers in the topic bars will be updated to reflect these choices, and the topic in color is the topic currently being covered.

Paper references in the topics below are specified through their Digital Object Identifier, when available, or through a link that recognizes Duke affiliation. These links will get you to the full article if you or your institution have proper access privileges. For Duke students, this typically means that the link will work from a Duke computer, but not from elsewhere.

To separate the conceptual aspects of parts I and II of the textbook from their technical aspects, we cover chapters 2-6 in the following order: 2, 6.1, 6.2 (with aspects of 9.1 and 9.12.1), 4.1, 4.2, 4.3, 3, 5, remainder of 4, remainder of 6. We skip chapters 8 and 9. After covering chapters 7 and 13, we come back to the parts we need from chapters 8 and 9 while reading chapter 20.

all   open all close all

Recognition is half of computer vision, and the other half is reconstruction. We briefly review recent success stories in visual recognition and the conceptual history of the field. Many examples of what works well today are based on machine learning (ML). After examining the key steps in ML, we introduce a broad taxonomy of ML approaches to computer vision. Specifically, we consider classification and regression problems, and the differences between supervised, unsupervised, semi-supervised and weakly supervised learning. These slides from the first lecture may jog your memory about some of the topics we discussed.

Probability is a prerequisite for this course. If you have never had probability, this refresher may be too fast for you. Chapters 2-5 are a blend of general probabilistic concepts and specific formulas for basic distributions. You need to understand the former and have a general sense for the latter, but you won't need to memorize formulas. An early homework assignment will help you determine if this course fits your background. These notes may help with the basic definitions for probabilities, both discrete and continuous. If you understand chapter 2 of the book as is, you need not read these notes. Here are the slides I used in some of the lectures.

This module is a preview of things to come, and outlines a taxonomy of machine learning problems and approaches for computer vision.

In particular, learning, or training, is the problem of estimating the parameters of a model that relates a world quantity of interest to what we observe in the images. Once these parameters are determined, the model is used to garner information about the world quantity from a particular image measurement. This second step is called inference.

The model itself is said to be a classification problem when the set of possible values for the world quantity is discrete (and typically finite), and a regression problem when the set is continuous. Either type of model is said to be deterministic when it is in the form of a function, and probabilistic when it is expressed through probability distributions. Specifically, a discriminative probabilistic model describes the posterior probability of world given image, and a generative probabilistic model describes the joint probability of world and image.

A short note integrates material in chapter 4 about conjugacy for categorical and Dirichlet distributions.

Probability densities based on hidden variables add richness to probabilistic models. Specifically, mixtures of Gaussians introduce multiple modes, Student's t distribution accommodates data outliers, and factor analysis models distributions that live in small subspaces of larger spaces. Expectation Maximization (EM) is the key algorithm for estimating the parameters of these densities from data.

Simple applications to face detection, recognition, and pose estimation as well as to object recognition and image segmentation illustrate the role of hidden variables and the use of EM. Results won't be great, but even complex techniques build on simple concepts.

Regression learns functions that estimate the values of continuously-varying parameters from data. For example, one might want to construct a function that estimates the joint angles of a human body from an image silhouette of it. In the simplest case, a deterministic linear function is learned. To account for model uncertainty, a more general model envisions the function parameters as random variables instead.

Nonlinear regression functions can be built as linear combinations of fixed nonlinear transformations of the training samples. In the so-called dual representation, the regression parameters are also expressed as linear combinations of the training samples. The resulting (still linear) regression problem can be written in terms of inner products of the transformed data points, without computing the nonlinear transformations explicitly. This is called the kernel trick, and the resulting regression method is called Gaussian process regression.

Better generalization can be obtained by imposing a prior — constructed as the product of one-dimensional t-distributions — on the coefficients of the regression function that has high ridges along the axis directions in parameter space and is low elsewhere. This prior favors sparse regressors, that is, linear combinations with few nonzero parameters. The resulting method is called sparse regression when sparsity is in the primal space of regression function coefficients, and relevance vector regression when sparsity is in the dual space of training samples.

Regression techniques are demonstrated on the estimation of human body pose from images and on video motion estimation.

Classification learns functions that associate discrete values to data. These values are typically interpreted as class labels. One of the simplest classifiers is based on what is called logistic regression, which leads to a linear discrimination boundary and to a convex training problem. Nonlinear classifiers can be built as linear classifiers of nonlinear functions of the data, and the dual representation, kernel trick, and sparsification ideas used for regression apply to classification as well. The resulting classifier is called a relevance vector classifier.

Rather than using suitable priors, sparsity can also be achieved by the greedy method of adding one data dimension (primal formulation) or data sample (dual formulation) at a time and stopping when the classifier is good enough. This approach leads to space- and time-efficient learning algorithms, and is called incremental fitting in primal space and boosting in dual space.

Rather than using kernels, complex classification boundaries can be obtained by partitioning data space into nested regions, and using a different classifier in each region. This idea underlies classification trees and forests.

Classification techniques are demonstrated on the classification of face images into male or female, on the detection of faces or pedestrians in images, and by studying the techniques used in the Kinect system to recognize human body parts in depth images.

Regression and classification systems that use raw image pixels as their input are sensitive to changes in viewing parameters and potentially inefficient. In most successful vision systems, the input to a regressor or classifier is not the raw image, but a set of features, that is, of quantities that describe either the whole image or part of it. Features reduce data dimensionality and, more importantly, they attempt to preserve what is relevant to the task and to discard what is not. Several descriptors frequently used in computer vision will be discussed. A note on histogram equalization derives the surprisingly simple formula for this operation, and a note on image filtering goes a bit deeper into convolution and image differentiation. A brief note on Gaussian and Laplacian pyramids introduces these useful structures in a straightforward manner. A paper by David Lowe describes the SIFT image feature (only section 6.1 is required reading). A paper by Navneet Dalal and Bill Triggs describes the HoG feature.

Visual words, also known as bags of features, describe a whole image or a part of it by counting the frequency of occurrence of prototypical features in it. Techniques for learning prototypical features and summarizing images with them are inherited from document retrieval and have become pervasive in computer vision. Constellation models start from similar descriptors, but account for the relative position of the features in the image.

Chapter 20 of the textbook outlines the principles. An early paper by Josef Sivic and Andrew Zisserman digs into some of the realities, and a more recent paper by Pedro Felzenszwalb, David McAllester, and Deva Ramanan describes a state-of-the-art constellation model. My class notes on deformable-part models are not required reading.

While image features for computer vision are traditionally hand-crafted, some recent neural-net systems for visual classification learn their features through unsupervised methods. This leads to multi-layer or deep neural-net systems in which the first few layers detect learned features from the images at increasing levels of complexity and abstraction, and the last few layers implement a classifier. While training these systems is still a challenge, their performance has been shown empirically to improve the state of the art in classification by significant margins.

COMPSCI 527, Duke University, Site based on the fluid 960 grid system