BGU - Computational Vision Course Home Page

GLOBAL TRANSFORMATION ESTIMATION VIA LOCAL REGION CONSENSUS

Final project by

Erez Farhan

erezfarhan@gmail.com

1. Introduction and Problem Description

One of the key variabilities in the appearance of an image is its underlying geometry. Images of the same scene from different angles, may look very different. For the purpose of point matching, which is the process of finding corresponding scene points in two images, this variability has been intensively explored in the last decade [1, 2, 3, 4], Specifically regarding local patches. For many applications, there are sufficiently good solutions in terms of the amount of matches, the portion of correct matches, their accuracy in locations and the computational demand. But, there are many other applications requiring improvement in on or more of these aspects. The geometric variability of a local patch can be approximated as affine transformation. All the popular point matching methods [2, 4, 1] take into account some geometric variations, while accounting for the full affine model has proved to be beneficial [3, 1]. However, trying to estimate the geometric transformation from small patches is of limited accuracy, insufficient in many cases, and computationally demanding. The key difficulty in estimating affine transformation from small regions lie in the insufficiency and low quality of available information. The flip side of the local affine approximation, is that estimating richer transformations, such as projective transformations of planes, cannot be accurately estimated from small regions.

In this work, we show how forging a consensus between several local affine transformations of regions that share the same scene plane, can be utilized to improve the local transformation estimation and also allow accurate estimation of perspective transformation of the plane.

2. Approach and Method

This work relies on the idea that estimating a geometric transformation of regions between images, can be much more accurate when derived from larger regions. For this purpose, the methodology presented here is to ensemble several small regions with their underlying transformations, to form a unified region with a richer and much more accurate transformation estimation. This is done through a process consensus forging, illustrated on the case of given estimated affine transformation of small regions to yield perspective transformation estimations of unions of regions. A high-level description of the system is described in the following chart:

Top-level flowchart for the system

2.1 Detect&Match MSERS

We assume that we are provided region matches given by MSER after affine normalization. The normalization is typically done using the measurements of the first and second order moments, to bring all detected regions to a canonical reference frame, up to an unknown rotation which is solved using standard methods [3]. The matching between the normalized regions is then done by the SIFT algorithm. All in all, the derived matches are expected to provide some affine invarinace.

2.2 Extract Local Affine Transformations

From each normalized region match, the affine transformation between the matching (un-normalized) regions is easily derived by composing the normalizing transforms of both regions. By now, we have regions matches between images, with an underlying affine transform between them. We call the result untli now phase zero region matches.

2.3 Consensus For Region Unions

Using the affine transformation of each region, we can start looking for mutual relations between proximate regions. The transformation of every phase zero region is used to predict the transformation of neighboring regions, and then be validated using the estimated transformations of these neighboring regions. This is done as follows:

From each region in the source image [call it anchor region], we try to estimate the pixel locations of other regions in the target image, using the region's local transform estimation.

Region prediction

Since local transformation are not expected to be accurate on farther regions, even if lying on the same plane, we consider the distance between the current region to predicted regions in order to adjust the expected prediction errors.
The expected prediction error is also adjusted to the extent of the current region.
Using this knowledge and an assumption of the typical affine estimation errors of local regions, we can determine a maximum expected prediction error in estimating the locations of neighboring patches in the target image.
Once we have an expected error, we compare the predicted location with the location derived from the phase zero match of the neighboring region.
This comparison gives an estimate of the prediction error which is normalized by the expected prediction error to get the normalized prediction error.

Region validation

The same process is done vise verse, with the neighboring region predicting the location of the anchor region in the target image. Yielding another normalized prediction error.
The maximum value of the two normalized prediction errors is then used as an agreement measure between the regions. If it is smaller than a fixed threshold, we consider both regions to have a co-planar consensus, and we unify them to one bigger phase one region.
This process is repeated until we cannot find any new consensus pairs.

Region validation

Finally, resulted phase zero regions that does not meet the required extent demands for estimating projective transformation are discarded.
We should note that outlier phase zero region matches are much less likely to forge a consensus with neighboring regions. This prevents these regions from being considered as a part of bigger union region, and are thus naturally discarded (phase 10).

2.4 Perspective Estimations From Unions

After applying the process described in 2.3 on the group of all regions, we have unions of regions that appear to be co-planar and to agree on local transformation. We can use these unions to estimate the perspective transformation of the union from the source frame to the target frame. This can be done very accurately, since the extent of the unions is very large, and thus the expected error very small.

2.5 Predict arbitrary point locations

Since we have collected many co-planar local transformations to a more global transformation of the plane, we can now predict the target image locations of arbitrary points in the source image that lie on plane with now attached perspective transform estimation. The problem now is to determine which estimated perspective transform is appropriate for applying on the desired point. Different unified regions may correspond to different scene planes, so the choice of the appropriate region and thus perspective transform may be crucial. In this work, we make a reduction, that the scene has only one highly-textured plane, and thus all unified groups, as well as the desired point, lie on the same plane. This reduces the decision to choosing the most appropriate phase one region with its underlying transform estimation, to predict the desired point location in the target image. We choose the region that has the smallest normalized expected prediction error (see full article) in predicting the target location of desired point. Naturally, this results in regions that situate relatively close to the desired point in the source image, and have a long extent in the direction of the translation vector between the region and the desired point.

3. Application: Matches of Arbitrary Points on the Plane

Since we have collected many co-planar local transformations to a more global transformation of the plane, we can now predict the target image locations of arbitrary points in the source image that lie on plane with now attached perspective transform estimation. The problem now is to determine which estimated perspective transform is appropriate for applying on the desired point. Different unified regions may correspond to different scene planes, so the choice of the appropriate region and thus perspective transform may be crucial. In this work, we make a reduction, that the scene has only one highly-textured plane, and thus all unified groups, as well as the desired point, lie on the same plane. This reduces the decision to choosing the most appropriate phase one region with its underlying transform estimation, to predict the desired point location in the target image. For this task, again the notion of normalized expected prediction error comes to aid. We choose the region that has the smallest normalized expected prediction error in predicting the target location of desired point. Naturally, this results in regions that situate relatively close to the desired point in the source image, and have a long extent in the direction of the translation vector between the region and the desired point.To illustrate this ability, a MATLAB application has been created, in which 2 images of a mostly planar scene can be loaded, and correspondences of arbitrary points on that plane can be found.

3.1 How to use the Matlab application

The use of the dedicated GUI is pretty straight forward, here are the standard steps:

Extract the code package, enter its root directory in matlab, and run the script 'setup.m'.
Run 'ConsensusGui.m'. A simple GUI should be opened. Notice that 'Try it!' button is initially deactivated.
Now hit 'Load Source Image' and choose a source image to work with. This will do some process (finding phase zero regions), so be patient.
Same goes for hitting 'Load Target Image'.
The two images should now appear in one figure along side each other.
Now press process - this runs the actual algorithm proposed here.
Now 'Try it!' button is enabled, press it. Now you need to pinch a desired point in the left image. The corresponding point immediately appears in the right image.
Hit 'Try it!' again to pinch another point.
If you want do load different images, make sure you hit process after loading the images.

Region validation

4. Results

We have checked the performance of the point matching application on several standard feature matching benchmark images given in [6]. To verify the contribution of the proposed consensus algorithm, for each source image of the benchmark, we have sampled 100 random points and tried to predict its location in the target image. Since we have the ground truth homography from [6], we can measure the pixel error in the target image. We compare this error to the error produced by predicting the target point using the collection of phase zero region matches. We collect all the results from all the 10 benchmark image challenges that introduce viewpoint changes ('graffity' and 'wall') and present the pixel error histogram of both methods. Notice that error above 100 pixels were discarded, for a more convenient presentation. Errors above 100 pixels were seen only in the method without consensus, if we were to regard them, the mean error of the methods without consensus would have been above 66 pixels. It is clearly evident how the consensus mechanism prevents cases of very big error [all errors below 15 pixels], and significantly improves the overall accuracy of arbitrary point predictions.

Region validation

5. Summary and future ideas

The ability to estimate the image transformations accurately, using a simple existing technique as initializer and then considering region consensus, enabled us to accurately predict the location of arbitrary points in the target image. We saw how the consensus mechanism increased accuracy and enabled the consideration of more rich transformations like the perspective transform. The mechanism also contributed for rejecting large prediction errors. The work here was under a reduction to mostly planar scenes, but the idea can be certainly extended to more complicated situations, and richer transformations. An immediate future development would be to detect situations of different multiple planes, and find an accurate consensus to every plane. This idea could also be accustomed to elastic transformations, tough possibly with some algorithmic variations regarding the estimation model.

6. Additional Information

Full project report (or download it in PDF )
Oral presentation slides
Downloadable source code.

References

[1] Bay, Herbert and Tuytelaars, Tinne and Van Gool, Luc, "Surf: Speeded up robust features", in Computer Vision--ECCV 2006 (Springer, 2006), pp. 404--417.
[2] Bentolila, Jacob and Francos, Joseph M, "Affine consistency graphs for image representation and elastic matching", in Image Processing (ICIP), 2012 19th IEEE International Conference on (, 2012), pp. 2365--2368.
[3] Lowe, David G, "Distinctive image features from scale-invariant keypoints", International journal of computer vision 60, 2 (2004), pp. 91--110.
[4] Matas, Jiri and Chum, Ondrej and Urban, Martin and Pajdla, Tomás, "Robust wide-baseline stereo from maximally stable extremal regions", Image and vision computing 22, 10 (2004), pp. 761--767.
[5] Mikolajczyk, Krystian, "Affine Covariant Features".http://www.robots.ox.ac.uk/~vgg/research/affine/index.html
[6] Mikolajczyk, Krystian and Schmid, Cordelia, "Scale & affine invariant interest point detectors", International journal of computer vision 60, 1 (2004), pp. 63--86.
[7] Morel, Jean-Michel and Yu, Guoshen, "ASIFT: A new framework for fully affine invariant image comparison", SIAM Journal on Imaging Sciences 2, 2 (2009), pp. 438--469.