How to Solve Problems in Computer Vision? Preface
Latest Update: May 20, 2010
The Feynman Problem-Solving Algorithm:
(1) Write down the problem
(2) Think very hard
(3) Write down the answer
When writing the article series “How to come up with new research ideas in computer vision?”, I think that there is another thing which is also essential to the research process, namely, “How to solve problems in computer vision?”. It may not be a difficult task for junior researchers (especially the imaginative ones) to come up with new research ideas. However, it often takes years of research experiences to know how to describe and analyze a problem via scientific approaches, deal with data uncertainty though statistical approaches, and realize an effective and efficient implementation using engineering approaches.
The gap between the abstract thinking and the problem-solving skills usually forms the major communication barriers between graduate students and their advisors. For example, students might feel that the feedback from the advisor useless or impractical, while advisors might worry about the fact that students could have been lost in technical details. In fact, these two things (i.e., knowing what to do and how to do) are both of great importance and complement with each other. As the MIT motto says
Mens et Manus (Mind and Hand) - MIT's motto
In “How to come up with new research ideas?” I addressed the Mens (mind) part. In this series, I will focus on the Manus (hand) part.
After introducing the basic concepts in computer vision, I will follow the Feynman Problem-Solving Algorithm to show how to solve problems in computer vision. The three problem-solving steps will be accompanied with various examples in the following articles. Through examples, people can learn how these vision problems had been addressed and know how to apply these methodologies on their own problems. In other words, I am a believer in this quote:
Note: The content of this article series is inspired from the new computer vision textbook: Computer Vision: Algorithms and Applications, by Richard Szeliski.
Three Level of Analysis in Computer Vision
David Marr, one of the pioneers in computer vision, proposed to understand vision (i.e., an information processing system) at three different levels of analysis.
1. Computational Level
What does the system do, and why?
2. Representation and Algorithm Level
How does the system do what it wants to do? What are the representations and what are the algorithms to build and manipulate the representations.
3. Implementation Level
How does the system be physically realized?
Even after thirty-years, Marr’s view on vision remains useful and serve a good guide for formulating and solving problems in computer vision.
What Kinds of Problems Are There in Computer Vision?
Computer vision processes and interprets visual information (e.g., images or videos). We can roughly classify vision problems into three categories by the level of the output labels: 1) Low-level, 2) Mid-level, and 3) High-level vision.
(Note that this is my own interpretation. There may be other viewpoints on how to define low-, mid-, and high-level vision.)
The output from low-level vision problems is a label map, i.e., one label for each pixel. Mid-level vision problems produce meaningful representations from images or videos. High-level vision predicts (semantic) class labels.
Input：Visual information (e.g., image or video)
Output：A label map
Example problems Label means
Depth estimation Depth from the viewer
Figure / ground estimation Foreground or background
Edge detection Edge or non-edge
Segmentation Region membership
Motion (optical flow) Motion vector
Intrinsic image Reflectance
Image restoration High-resolution, clear, clean images
Output：Representation of images
Human/head pose estimation
Hand gesture representation
(Snakes / Active shape models / Active appearance models)
Texture representation and synthesis
(Bag-of-features / Spatial pyramid, i.e., coding and pooling of low-level features)
Input：Semantic label of an image
Object detection / localization
Optical character recognition
Hand-written digit recognition
Relationship between Computer Vision and Computer Graphics
From the three-level of vision problems above we know that all computer vision problems aim to recover information Y we need by input visual information X. In other words, the goal of computer vision problems is to accurately estimate the conditional probability P(Y|X). On the other hand, computer graphics cares P(X|Y), i.e., given the information Y, what’s the realistic image X would be?
Take the Natal Project for an example, computer vision attempts to the recover the underlying human pose (or facial expression) from image and depth sensors, while computer graphics generates the corresponding animation (the one you saw in your TV) from the estimated pose.
Another practical example is 3D city in the Google Earth. Computer vision reconstructs the 3D structure from a large collection of photos, while computer graphics renders realistic scenes from these 3D models.
Over the last decades, we have witnessed an increasing interaction between computer vision and computer graphics, such as image-based modeling and rendering, morphing, light field capture and rendering, panoramic image stitching and computational photography.
How to Solve Problems in Computer Vision?
How to solve these difficult inverse problems in computer vision? We will follow Richard Feynman’s Problem-Solving Algorithm and show how three different kinds of approaches, namely, scientific, statistical, and engineering approach are applied in this problem-solving algorithm.
The Feynman Problem-Solving Algorithm
Step 1: Write down the problem.
Step 2: Think very hard.
Step 3: Write down the answer.
The problem-solving algorithm seems trivial at the first glance. However, it can be widely applied to solve complex problems. Below is a tentative outline:
1. Write down the problem (Scientific)
- Low-level vision
- Mid-level vision
- High-level vision
2. Think very hard (Statistical)
- Bayesian Modeling
- Inference: Maximum Likelihood (ML), Maximum a Posteriori (MAP) and Minimum Mean Squared Error estimation (MMSE)
- Approaximate inference: Monte Carlo methods, Variational methods, and (loopy) belief propagation.
- Structured prediction: Markov random Field, Conditional random field, Max-Margin Markov Networks, Structured SVM, Joint Kernel Support Estimation
3. Write down the answer (Engineering)
- Implementation algorithms
- Testing: synthetic experiments, noise and abbreviation from the model, real-world images