An experimental system
for reconstructing a scene from the image sequences produced by a moving
camera
Jiri Walder,
jiri.walder.fei@vsb.cz
Technical University of Ostrava
Ostrava / Czech Republic
Abstract
In this paper, a method and an experimental computer
program for reconstructing a threedimensional static scene from the sequence
of images produced by a moving camera are presented. A method for determining
the camera trajectory and for reconstructing the coordinates of feature
points from multiple images is proposed. The efficiency and the robustness
of this method are verified experimentally. The feature points and the
correspondence relation are detected automatically by a heuristic algorithm.
Keywords: scene reconstruction,
camera calibration, image processing, computer vision, point tracking.
1. Introduction
The paper focuses on the following problem: Consider
a threedimensional static world that will be referred to as a scene.
A camera moving in this world produces a sequence of images called frames.
The frames are twodimensional pixel arrays. The twodimensional coordinates
of points can be measured in the frames. The sequence of the frames is
used to create a model of the scene. The threedimensional coordinates
(in a global reference system) of the feature points that are visible in
the frames are computed.
The solution to the problem of reconstructing a scene
from a pair of stereo images is known for a long time [5]. The reconstruction
from three and multiple images was later studied from various viewpoints
[4][6]. The difficult problem involved in reconstruction is to detect the
correspondence (we say that in different frames, the images of a unique
point of a scene correspond to one another). Finding the correspondence
by a computer algorithm is hard: (1) In order to achieve a good accuracy
of reconstruction, the images should be taken from camera positions that
are distant enough. (2) As the distance between the camera positions increases,
the images become more and more different and the correspondence is more
and more difficult to find automatically. If a whole sequence of frames
produced by a moving camera is available, the problem becomes easier. Since
the change between two consecutive frames is usually small, the correspondence
relation for the consecutive frames can be found easily. Also, the effects
of occluding can be detected reliably. The correspondence relation for
nonconsecutive frames can be determined as the product of the relations
for the consecutive frames. We suppose that the sequence of frames is long
enough. From this sequence, a certain set of significant frames can be
chosen for reconstruction.
In this paper, a method and an experimental computer
program for reconstructing a scene from the sequence of images produced
by a moving camera are presented. A method for determining the camera trajectory
(i.e., for calibrating the camera) and for reconstructing the coordinates
of points of interest from multiple images is proposed. The system performing
the scene reconstruction consists of the following parts: (1) Detecting
the points of interest in the particular frames. (2) Tracking the points
in the sequence of frames. (3) Determining the camera trajectory. (4) Reconstructing
the coordinates of the points of interest. Figure 1 shows the block scheme
of the system.
The paper is organised as follows: In Section 2,
the model of the camera and camera motion is described. The method for
calibrating the camera and the method for reconstructing the coordinates
of points are presented in Sections 3.1 and 3.2, respectively. In Sections
3.3 and 3.4, the problems of detecting and tracking the points of interest
are discussed.
Figure 1. The block scheme of the system.
2. The model of the camera
The camera model considered is the classical pinhole
model. The basic assumption is that the relationship between the world
coordinates and the pixel coordinates is linear projective. We suppose
that the possible nonlinear distortions of the images are corrected in
advance, before the images are used for calibration and for reconstruction.
As the camera moves in a threedimensional scene, it produces a sequence
of frames. We suppose that the sequence I0,I1,...,In
of significant frames taken in times t0,t1,...,tn
is available for calibration and for reconstruction. The sequence need
not include all the frames produced by the camera. Instead, we may wait
until the changes in the frames are large enough to give new information.
The information in the dropped frames is then used indirectly (Section
3.3). Let (O,x,y,z) be a global coordinate
system in the scene (Figure 2). For each ti,
a camera coordinate system (Oi,xi,yi,zi)
is considered (Oi is the centre of projection,
zi
is the optical axis of the camera). Let
X be a point in the scene,
let x=(x,y,z)T and xi=(xi,yi,zi)T
represent X in (O,x,y,z) and (Oi,xi,yi,zi),
respectively. We have
,

(1)

where Ri describes
the rotation of (Oi,xi,yi,zi)
into (O,x,y,z), and oi
represents Oi in (O,x,y,z).
We define Ri,j,bi,j
such that
.

(2)

Obviously, Ri,i=I
(unit matrix), and bi,i=0.
It is easy to see that (we consider
i<j and we define)
,

(3)

.

(4)

Recall that Ri,j
is an orthonormal matrix and that Ri,j =Rxi,jRyi,jRzi,j.
Rxi,j,Ryi,j,Rzi,j
are the matrices corresponding to the rotations around the xj,yj
and zjaxis, respectively. We introduce the
vector ji,j
= (jxi,j,
jyi,j,
jzi,j)^{T}
containing the rotation angles. Let fi = dist(Oi,Oi')
be the focal length of the camera in ti. In
each significant frame, we introduce a coordinate system (Oi',xi',yi')
(Figure 2). The image coordinates are measured in a coordinate system (Zi,ui,vi).
Let Xi be the image of X in the ith
frame, and let ui=(ui,vi,1)^{T}
represent Xi in (Zi,ui,vi).
In (Oi',xi',yi'),
Xi
is
represented by Qiui
where Qi describes the transformation
of (Zi,ui,vi)
into (Oi',xi',yi').
It follows that
,

(5)

where
, .

(6)

The values u0i,
v0i
are the coordinates (in (Zi,ui,vi))
of the point in which the optical axis pierces the projection plane. The
angle qi
models a possible deviation from orthogonality of the cell array of the
sensor or a possible misalignment of the projection plane with respect
to the optical axis. In practice, qi
is usually close to p/2.
The scaling factor si takes into account the
possibly different sizes of the sensor cells along the u and
vaxis
and the distortion introduced by the frame grabber; li
is a real parameter. The parameters in Fi,
Qi
(fi,
u0i,
v0i,
qi,
si)
are the camera intrinsic parameters. The parameters describing the trajectory
of the camera in space (Ri,oi
or Ri,i+1,bi,i+1)
are the extrinsic parameters.
3. The description of the system parts
3.1 Camera calibration
The purpose of camera calibration is to determine the intrinsic and the
extrinsic parameters of the camera. Once these parameters are known, threedimensional
information can be inferred from twodimensional images and vice versa.
The calibration method proposed in this paper focuses on the problem of
finding the extrinsic parameters. Thus, we suppose that Fi,Qi
are known for each i. Note that a variety of methods for determining
the camera intrinsic parameters were presented [3].
Figure 1. The model of the camera and camera motion.
The information needed for calibration is provided
by detecting and tracking the points of interest (Sections 3.3,3.4). In
this section, we suppose that from the significant frames I0,I1,...,In,
the lists of the points of interest along with their image coordinates
and along with the description of the correspondence relation can be obtained.
From the set of the pairs of corresponding points of interest, reliable
pairs are chosen for calibration. Consider two different values ti,
tj
(i<j) and a point X in the scene. Let pi,
pj
represent the directions of lines
áOiXiñ
and
áOjXjñ,
respectively, in (Oi,xi,yi,zi).
Since áOiXiñ,
OjXjñ,áOiOjñ,
are coplanar, the condition pi×(bi,j´pj)=0
must be satisfied. The conditions yields [1]
.

(7)

Let s = (b^{T}0,1,
j^{T}0,1,
b^{T}1,2, j^{T}1,2,
... , b^{T}n1,n,
j^{T}n1,n)^{T}
be the vector of the values that are to be found during the calibration.
Let q be the number of calibration points, let X^{(k)}
be the kth such point. We use u^{(k)}
to denote the vector of coordinates of the images of X^{(k)}
in all the frames in which X^{(k)}
is used for calibration, and, finally, we introduce the vector u
= (u^{(1)T}, … ,u^{(q)T
})^{T}
of the image coordinates of all the calibration points. The coplanarity
equation Eq.(7) can be rewritten as follows
.

(8)

Usually, much more points than is the minimum required
for calibration are detected in the frames. This can be used to reduce
the influence of noise, i.e., the fact that the coordinates in the frames
are measured with an error can be taken into account. Let u contain
the exact coordinates. We use u0 to denote
the vector of coordinates observed in the frames, and we introduce the
vector Du=uu0
of the differences and a diagonal matrix W expressing the reliability
of the particular observations. The vector s then can be found by
minimising the value
.

(9)

The following coplanarity equations must be satisfied
.

(10)

We solve the nonlinear problem formulated by Eqs.(9),
(10) by linearisation. Suppose that an initial estimation s0
of s is known. Let Ds
be the correction of s0, i.e., s=s0
+Ds.
By the Taylor expansion of Eq.(10), neglecting the higher order terms,
we obtain
,

(11)

where
The minimisation in Eq.(9) with the condition in Eq.(11) may be solved
using the method of Lagrange multipliers. Let m
be the vector of Lagrange multipliers. The problem is solved by minimising
the function
.

(13)

The computation yields the result
.

(14)

We use Eq.(14) in the iterative process. In each step of this process,
we compute S,U,e and we use Eq.(14)
to determine the new value of Ds.
We
then actualise s0 ¬s0+Ds.
As a rule, only few iterations are needed (usually, no more than ten iterations).
Note that in the practical implementation, we employ a heuristic deciding
which points (from the set of all the points of interest detected in the
frames) will be used for calibration. For the particular points, the heuristic
also chooses the pairs of frames for which the coplanarity equations will
be assembled. To obtain the initial value of s0,
we use the classical two image approach [7] where the significant frames
are processed in the pairs Ii,Ii+1.
Note that the multiimage calibration process described
in this section can be used simultaneously (and repeatedly) with making
the initial estimation of s0. Whenever the
initial values of bi,i+1,
ji,i+1
(for a certain i) are estimated by the two image method, the multiimage
calibration process using a chosen set of significant frames can be launched
to immediately improve this initial estimation.
3.2 Reconstruction
Providing that a point is observed in at least two significant
frames, its coordinates can be reconstructed. Let X be such a point,
let x be the sought vector of its coordinates in (O,x,y,z),
and let ui=(ui,vi,1)^{T}
represent Xi, the image of X, in (Zi,ui,vi).
Theoretically, X should lie on all the lines oi
+ liRiFiQiui
simultaneously. If noise is present, X will not lie on the lines
exactly. We measure the distances between X and the lines and minimise
the sum of their squares, i.e., we carry out
.

(15)

The computation yields the result
,

(16)

where
.

(17)

3.3 Detection of points of interest
The points of interest extracted from the frames both
for calibration and for reconstruction are corners. The corners abound
in the images of both natural and manmade scenes. They often correspond
to the corners of threedimensional objects and can be unambiguously localised
in the frames. For detecting the corners, we use the Beaudet corner detector
[2].
In 1978, P. R. Beaudet proposed the first corner
detector which works directly with image intensity. The Beaudet detector
locates the corners in an image at local maxima of the Gauss curvature.
Firstly, Beaudet’s algorithm applies the operator called DET to the whole
image. In this way, we obtain the image in which we search for the local
maxima. The operator DET has the local maxima at positions where the corner
points are situated. The maxima should be greater than a certain threshold.
The operator DET is very simple and can be written as follows
,

(18)

where g=g(u,v) is the input image which is represented
by a twodimensional function that describes the image intensity and guu,
gvv,
guv
are the partial derivatives. Figure 3 shows an example of the points of
interest detected in an image.
Figure 3. Points of interest detected in an image.
3.4 Tracking the points of interest
Once a point of interest is detected in a frame, it
is tracked in the subsequent frames, i.e., in the subsequent frames, we
seek for the images of the same point of the scene. For tracking, we use
all the frames produced by the camera, i.e., not only the significant frames.
In this section, therefore, by 1,2,...,n we mean the indices in
the sequence of all frames. Let uik,...,ui
be the coordinates of the images of a point X in the frames Iik,...,Ii
(k>0). The task of finding the coordinates ui+1
in Ii+1 of the image of
X will be easier if some estimation u^{*}i+1
of ui+1
is available. We use the Kalman filter for this purpose. Recall that in
Kalman filtering, the system model, which describes the expected evolution
of the vector si of state variables
over time, is of the form si+1=
Fi+1,isi+
Didi
+ Gixi.
The model of measurement is zi=
Hisi+hi.
The values si, di,
zi,
xi,
hi
are the state vector, deterministic input, vector of measured values, process
noise and measurement noise, respectively. In order to determine
the system model for tracking the points in the frames we use Eqs.(2),(5).
We obtain
.

(19)

Hence
.

(20)

As can be easily seen, Eq.(20) is of the form that is
expected by the Kalman filter. Its use, however, is not easy. In order
to determine the value of ui+1,
the matrices Qi+1,
Fi+1,
Ri,i+1
and the vector bi,i+1
are needed. These, however, are not known at the time when the value of
ui+1
is to be computed. Therefore, we use an approximate model. The model
is based on the assumption that the acceleration of the move of the points
of interest is constant. We use two separate Kalman filters for the u,v
coordinates of each point. In the following text, we restrict ourselves
to the ucoordinate. The state vector is si=(ui,ui',ui'')^{T}.
The matrices of the filter are of the form
,

(21)

, ,

(22)

where Dt=ti+1
 ti. In our model, Gi
is a diagonal matrix, constant over time, compensating for the difference
between the exact and the approximate model. Note that besides predicting
the coordinates of points in future frames, the Kalman filter is also used
for determining the values of coordinates used for calibration and reconstruction.
The filter reduces the influence of noise and gives the coordinates in
subpixel accuracy, which improves the precision achieved in calibration
and reconstruction.
The algorithm for tracking the points of interest
in the sequence of frames will be presented in the following text. Consider
the sequence of frames I1, I2,
...,
In produced by a camera in times t
= 0, 1, ...,
n. The correspondence is a symmetric and transitive
relation between two images of the same point. Finding all corresponding
images of one point is equal to calculating the transitive closure. We
implement the detection of correspondence as finding pairs of corresponding
points in consecutive images It and It+1.
The algorithm we propose solves this task. In time t, we find the
correspondence between the points Xt and Xt+1.
To understand the algorithm correctly, we have to awake that we have already
found the point Xt in time t1. In time
t1,
this point was needed for finding the correspondence with the point Xt1.
The detailed description of the algorithm for finding the correspondence
follows. We will describe the steps that are executed for every frame taken
in time
t. The algorithm consists of three sections, denoted by
A, B, C.
A. Preparing the image It+1.
(1) We prepare the image It+1
for next processing. The image is read from a file produced by a camera
or taken by any other form.
(2) In the image It+1,
we find all the points of interest. The points of interest are detected
by one of the corner detectors. Every point found in this way is inserted
into a search array and is denoted by a unique index. The corresponding
points have the same index.
B. Finding the filtered value and prediction
We denote by X¢t
the value corrected by the Kalman filter in time t, and by X^{*}t+1
the prediction provided by the Kalman filter in time t+1.
In this section, the algorithm falls into two parts. If a new point Xt
is found in the frame It, we continue by Step
3. If the point Xt is found as a point that
corresponds to the point Xt1,
we continue by Step 4.
(3) If a new point is found, we initialise two Kalman
filters. One for the coordinate u and the other for the coordinate
v.
Both the prediction X^{*}t+1
and the filtered value X¢t
are equal to the coordinates of the new point Xt.
(4) We actualise the Kalman filters for both coordinates
u
and v of Xt which has been found as
corresponding to Xt1.
The filters provide the prediction X^{*}t+1
and the filtered value X¢t
in which the influence of noise is reduced.
C. Finding the point corresponding to Xt
(5) For point Xt, we
determine a vicinity. In this vicinity, we will find the candidates for
correspondence. The vicinity is a circular area whose radius is r
and whose centre lies in the predicted position X^{*}t+1.
We choose the size of r to be inversely proportional to time t
since as t grows, the prediction X^{*}t+1
of the Kalman filter is more accurate. For the size of r, we can
write
.

(23)

If Xt is a new point, we
must choose the size of vicinity large enough to ensure reliably finding
first candidates for correspondence. The starting size of the radius is
rmax.
Later, the size is computed by Eq. (23). The size r is never less
then a minimum rmin. The parameter k
determines how fast the size of vicinity decreases.
(6) All points Y that were found in the vicinity,
are regarded as candidates for correspondence and the correspondence between
Xt
and the candidates is then checked up. In a sequence of frames, there are
usually only small differences between the consequent images, hence we
can say that the image functions in vicinities of corresponding points
are approximately the same. Therefore, we compare the image function of
a current frame in a vicinity of tracked point Xt
with the image function of the consequent frame in vicinities of all candidates
Y.
In practice, we use a rectangle area of size w´
h. Both w and h are set to
appropriate value, the recommended size is 39. The vicinities are compared
by a function rel that gives the function values from the interval
0¸1. If the
value of rel is one, the images in both vicinities are the same,
if the value is zero, the images are different. Let ut,
vt
be the coordinates of the tracked point Xt
in time t and let ut+1
and vt+1 be the coordinates
of the candidate Yt+1
in time t+1. Let l
denote one of the colour components. In a colour image, each pixel has
three colour components
r, g, b. We introduce L={r,
g,
b}.
Suppose that the values of the colour components of pixel are chosen from
the interval Jmin¸Jmax.We
defined the function rel as follows
,

(24)

where

(25)

The value sl
is a difference of the vicinities of Xt and
Yt+1,
Jl,t
and Jl,
t+1
are the image functions in the particular frames. The function card returns
the number of elements in the set L.
If we know the value of rel for all candidates Y, we choose
the candidate for which the value is considerably greater than for the
others. If the function (24) gives the same value for two or more candidates,
we choose the candidate that is closer to the predicted position X^{*}t+1.
If the value of rel is under a threshold th, we say that
for the tracked point, we cannot find the corresponding counterpart. Otherwise,
the point Yt+1 is declared
to be a point corresponding to Xt and is given
the same index as Xt. The value of th
depends on the number of candidates, denoted by
n. For determining
th,
we propose the following expression
,

(26)

where thmin is a minimum
threshold, which is chosen from the interval 0¸1.
(8) For the radius r of a scanned vicinity,
we introduce an additional condition. If it happened that in Step 6, there
was not found any point of interest in the scanned vicinity, the radius
of vicinity is extended twice and we return back to Step 6 again. At the
same time, the minimum threshold thmin is modified
because we require greater reliability of function rel for candidates.
Therefore, the value of thmin changes according
to the expression
.

(27)

This dependence has been designed by experiments.
(9) If in the search array, any point that was not
classified as corresponding remains, it is regarded as a new point. Every
such point gets a unique index.
After the computation for every image It,
we get an array of indexed points. We can now determine the trajectories
of points for all images Ii_{ }, it.
The above described procedure is heuristic, but our
experiments have shown, that it works well. We tested this procedure on
a certain number of both synthetic and real sequences. Generally, the trajectories
are identified correctly. In some cases, however, certain problems may
occur. It may happen, for example, that the algorithm starts to track a
bad trajectory. Heuristics revealing and correcting these and similar erroneous
situations are implemented. Examples of trajectories that were found in
a real sequence of images are depicted in Figure 4.
Figure 4. Trajectories of points.
4. Conclusion
In this paper, a method and a computer program for reconstructing
the threedimensional scenes from the images produced by a moving camera
were presented. A method for calibrating the camera and for reconstructing
the scene from multiple images was proposed, implemented and verified experimentally.
As expected, in comparison with the classical two image approach, the multiimage
approach reduces the sensitivity to noise, which is the key problem in
reconstruction. On the other hand, as the number of significant frames
simultaneously used for calibration increases, the method becomes computationally
expensive. Fortunately, only few significant frames in the time window
(e.g., 310 frames) usually suffice to guarantee a good precision of reconstruction.
Although we intended our implementation as an experimental tool and although
we did not aim at constructing a realtime vision system, we are aware
of the fact that the capability to work in real time is important. The
current implementation, however, does not work in real time. The most computationally
expensive proved to be finding and tracking the points of interest. In
calibration and reconstruction, the computational load can be reduced by
processing only an adequate number of significant frames in the time window.
5. References
[1] 
Sojka, E.: Reconstructing ThreeDimensional Objects
from the Images Produced by a Moving Camera, in Proc. 8^{th}ICECGDG,
Austin,
Texas, USA, pp. 160164,1998. 
[2] 
Beaudet, P., R.: Rotationally Invariant Image Operators,
in Proc. Fourth Int. Joint Conf. on Pattern Recognition. Tokyo,
pp. 579583, 1978. 
[3] 
Daniilidis, K. and Ernst, J.: Active intrinsic calibration
using vanishing points, in Pattern Recognition Letters, Vol. 17,
No. 11, pp. 11791189, 1996. 
[4] 
Ito, M. and Ishii, A.: Range and shape measurement
using threeview stereo analysis, in IEEE Transactions on Pattern Analysis
and Machine Intelligence, Vol. 8, No. 4, pp. 524532., 1986. 
[5] 
LonguetHiggins, H., C.: A computer algorithm for
reconstructing a scene from two projections, in Nature, Vol. 293,
pp. 133135, 1981. 
[6] 
Luong, Q., T. and Faugeras, O., D.: Selfcalibration
of a moving camera from point correspondences and fundamental matrices,
in International Journal of Computer Vision, Vol. 22, No. 3, pp.
261289, 1997. 
[7] 
Bosak, R.: Relative calibration and solving the
calibration equations, Diploma thesis, Technical
University of Ostrava, Czech
Republic, 1998. 