Introduction

Computer vs human vision

What is (computer) vision? First let’s discuss human vision that’s closely related. Objects bombarded by light (image) -> sensors in the eye -> brain interpret image -> interpretations.

Then what’s computer vision? Sensor is not an eye. They are spread across sensing elements on the image plane inside the camera. Then the sensor responses are analyzed by computer vision algorithms.

Vision vs Graphics

Computer graphics is almost dual of computer vision. For computer graphics, we start with computer models. Computer vision starts with images, and tries to infer models. There’s a huge intersection between these two areas: modeling.

Moreover, computer vision is a highly interdisciplinary area. Images are obtained in different sciences: biology, physics and so on. Computer vision is associated with computer science of course. Also, there are some optimization, numerical questions.

The goal of computer vision is to bridge the gap between pixels and “meaning”. The computer actual sees 2d table of numbers/pixels.

Applications:

optical character recognition
Object Detection and tracking
Segmentation (pixel accurate labeling)
User-assisted Segmentation (photo/video editing)
Activity recognition
and so on.

Image Modalities

Camera obscura (aka dark room)

Consider a model consisting a source light, and it arrives at the surface of the film. This system has a major flaw. The right dot and yellow dot reach not only one point, but also other points on the film. In the end, this film will not register an interesting image. There will be no variation of information across the surface of the film, and it will be overexposed.

there should be a image...

Consider a better idea which requires only one additional element. There is a barrier in the middle, which stops most of the light from the object except light passing the hole. Then the barrier blocks off most of the rays, reduces blurring.

there should be a image...

Picture adapted from topic 2 slide, CS 484.

With some intuition, we know that distant objects are smaller.

Moreover, the size of the aperture affects the quality of the image. From the examples, it seems that the smaller the aperture, the better. Why not make it as small as possible? Because less light gets through, and diffraction effects. Once the size of the aperture is comparable to the wavelength of the light waves, we will have diffraction effects and blur.

Lenses

The aperture suffers from the weak light. We cannot increase the size of the aperture because of the blur. Lenses offers a solution to this problem. It redirects lights, and converging to a point. Note that the lens must be large, so the light passing through is bigger.

Also note that only objects in a specific depth that “in focus” will be clear. Other points will result in “circle of confusion”. Changing image distance changes this depth.

If the object is far enough away from the lens, we can say the lights come in parallel. Lens’ focal length is image distance where objects at infinity appear in focus. Also note that to find the location of the source, we can just use the light among the parallel lights, passing the center of the lens.

Then we use simplified “pin hole” model instead of lens. For pin hole camera model “focal length” (f) is defined as image distance (to the “hole”). Now we will use a model consisting camera’s “optical center” or “view point” and a (virtual) image plane, which tells us the direction where the camera looks.

Projection geometry

Now we need a camera-centered world coordinate system for 3D world points. And we need an image coordinate system for image pixels locations, 2D. For camera-centered world coordinate system, we choose \(C \) (optical center) to be the origin, \(x \) and \(y \) are parallel to the image plane. Moreover, we choose \(x \)-axis to be parallel to the basis \(u \) in image plane, and \(y \)-axis parallel to the basis \(v \). We define \(z \)-axis to be optical axis, intersecting image at its coordinate center.

there should be a image...

Picture taken from Topic 2, CS 484. \(f \) is the focus length.

The conversion between two coordinate system is \[ (x,y,z)\to \left( f {x\over z}, f{y\over z} \right) \] Observe that \(z \) is the depth of the point. Size of any 3D object image is inversely proportional to object’s distance from the camera (z-coordinate value).

If we take a look at the biological model of the eye, the human eye is a camera.

Digital image formation

We can think of an image as a function \(f(x,y) \) where \( (x,y) \) are continuous coordinates in an image plane. Here we can think of \(f \) as intensity. Here we can define \[ f(x,y) = \text{reflectance}(x,y) \times \text{illumination}(x,y) \] where reflectance \(\in [0,1] \), illumination \(\in [0,\infty) \)