A Sunday on La Grande Jatte, Georges Seurat, 1884-1886 (image source)
This article explains how to map pixel distances on 2D images to the corresponding real-world distances in 3D scenes using homography estimation and applies this approach to a practical problem as a use case.
Basic knowledge of linear algebra, e.g., matrix multiplication and system of linear equations, is recommended but not required to follow this article.
Measuring world coordinates; the challenges
Let us start off with an example to illustrate the problem that we are trying to solve here. Suppose that you want to measure the distance between people who cross a particular street using the videos captured by CCTV cameras installed there (we will refer to this example a few times throughout the rest of this article). Even if you manage to detect pedestrians in each video frame, calculating the real-world distance (according to the real-world coordinate system) between the people using the pixel coordinates in the video frame is not straightforward.
Some challenges arise when you want to calculate the world coordinates (or distances) using the pixel coordinates. For instance, in a 2D image, since the third dimension cannot be displayed, we cannot take the depth of the scene into account easily. This issue results in some inaccuracies when calculating world coordinates from pixel coordinates. In Figure 1, the two men passing by each other are keeping a distance of almost three meters. However, the distance seems less in the image view.
In some applications, this error cannot be ignored. For example, if you want to find out if people are adhering to social distancing, you need to make sure that your calculations are exact and accurate. Otherwise, your application reports inaccurate social distancing violation cases, and therefore, will be ineffective.
We need a stable approach to map 2D pixel coordinates to 3D world coordinates of each point in order to calculate the world distances of the points of interest. This article explains camera calibration using homography estimation to mitigate the challenge of mapping pixel coordinates to world coordinates without losing much accuracy.
From 3D scene to 2D image and vice versa
Cameras use perspective projection to map the world coordinates in 3D scenes into pixel coordinates in 2D images. Figure 2 illustrates an example of how cameras project the object of interest into a 2D plane using perspective projection.
We are interested in doing the exact opposite of what cameras do; to find a mapping from pixel coordinates of a point in an image to world coordinates of that point in the actual 3D scene.
In the social distancing application example that we mentioned earlier, if we find the pedestrians’ world coordinates, calculating the world distance between the people becomes trivial. To achieve the world coordinates, we need an approach to map the 2D coordinates on the image to 3D coordinates in the real-world scene, exactly in the opposite way of what cameras do.
Pinhole camera model
Let us quickly explain the mathematics behind projecting a point from the 3D scene onto the 2D image. We will assume that we are working with an ideal pinhole camera with an aperture described as a point (see Figure 3).
The relationship between the world coordinates and the image coordinates can be expressed using the following formula:
On the left-hand side, we see scale factor and a vector representing the image coordinates. Note that we can always factorize this vector by such that the third element becomes equal to 1. We calculate based on the calculations on the right-hand side to map 3D world coordinates to 2D image coordinates.
On the right-hand side, we see two matrices, , and a vector . The vector contains the world coordinates of the point that we are trying to map to the 2D space with the fourth element set equal to 1.
The relationship between the world coordinates and the image coordinates depends on both intrinsic and extrinsic characteristics of the camera. Therefore, two matrices and are introduced into the formula to capture these characteristics. Let us have a closer look at the parameters included in the intrinsic and extrinsic matrices:
The intrinsic matrix describes the camera’s internal specifications, such as the focal length and the principal point.
: camera’s focal length (meters/pixels). Two parameters are introduced to describe cameras with rectangular pixels. If your camera uses square pixels, set .
The extrinsic matrix defines the camera’s external properties, i.e., the camera’s position and rotation angle in the real-world scene.
: camera’s rotation matrix that describes the camera’s rotation angle in the installed environment.
: the camera’s offset from the origin of the world coordinate system.
Once you measure the mentioned parameters, you can plug them in the formula and do a simple matrix multiplication to obtain image coordinates.
Besides direct methods of measuring the camera’s intrinsic and extrinsic parameters, some alternative approaches can be used to estimate these parameters, such as Zhang’s method and Tsai’s algorithm. We encourage you to investigate these papers (1, 2) to learn more about these approaches.
The pinhole camera model maps the 3D world coordinates to 2D image coordinates by parameterizing the intrinsic and extrinsic characteristics of the camera. If we want to do the opposite and find the 3D world coordinates from the 2D image coordinates, we should calculate the inverse of matrices and and multiply the image coordinates vector by and to obtain values of the world coordinates. These calculations can be described using the following formula:
You can now plug in image coordinates into this formula and obtain the corresponding world coordinates by doing simple matrix multiplications.
Don’t need that much information? Here’s the easier way.
The previous approach maps every point in the image coordinates to the corresponding point in the real-world’s coordinate system. However, in some applications, we do not need that much information to reach our final goal.
Let us picture the example of calculating the distances between the people again. If we want to measure the distances between the pedestrians using images (or videos) taken with a camera, do we need to map every point in the image to its corresponding coordinates in the world scene?
We know that the pedestrians are walking on the same plane, which is the ground. If we represent each pedestrian with a point on the ground plane and find the world distance between these points, the problem would be solved because these points are all on the same plane. Thus, we only need to calculate the world coordinates of the points on the ground plane for each pedestrian in order to find the pairwise distances between them.
In other words, we want to map the points that lie on one plane from the image coordinates to the world coordinates. Therefore, we will be using 2D vectors to represent both image coordinates and world coordinates that reside on the same plane. In Figure 4, for example, mapping the points that reside on the blue plane is sufficient to calculate the distance between the people who pass the corridor.
In this example, the problem is reduced to determining the world coordinates of the points residing in a two-dimensional plane instead of mapping the whole space. In such cases, we can apply a more straightforward method, called homography estimation, to find the world coordinates of points that lie on a plane (rather than the whole space) using fewer parameters.
In this approach, we use a matrix, called the homography matrix, to map the points residing on a plane from world coordinates to the corresponding image coordinates. Note that all the points that we are mapping are coplanar; therefore, we can represent the world coordinates with 2D vectors. The homography matrix is a 3*3 matrix that contains the parameters . If we manage to figure out the values of through somehow, we can use the following formula to map the world coordinates to image coordinates:
In this formula, is a normalized vector with , and is the coordinates of a point within the image. Similarly, is a normalized vector with , and describes the coordinates of in the world coordinate system. By substituting the values of . in the homography matrix , we can calculate the image coordinates for each given point expressed in world coordinates.
If we want to do the mapping in the opposite direction, i.e., from the image coordinates to the world coordinates, we need to calculate the inverse of the homography matrix to get and multiply the image coordinates vector by to obtain the corresponding world coordinates. The following formula shows how we can find the world coordinates based on the image coordinates:
But how can we compute ?
Suppose that we have the world coordinates of four points with their corresponding mapping in the image coordinate system, where and . Using these pairs of points, we can write a linear equation with eight equations and eight unknowns that describe , and normalize the last parameter by considering .
To solve this system of linear equations, we rearrange the equations to gather all the variables on one side and form a equation as follows:
We can now use linear algebra algorithms, such as SVD, to solve this system for . Having calculated , we can find the image coordinates from world coordinates using formula 3. Finding world coordinates from image coordinates is also possible by plugging in in formula 4.
All of the mathematics we explained above are implemented in a function in the OpenCV library. You can use this function to estimate the homography matrix in a single line of code. Let us explain how this method works.
Homography estimation in a single line of code
To restate the problem, you have installed and fixed your camera somewhere, and you want to obtain the world coordinates (or distances) using a single image captured by your camera.
To do so, first, mark the four corners of an imaginary square (a 1m x 1m square is recommended for ease of calculations) on the ground and choose one of these points as the origin of the world coordinate system. Then, measure the coordinates of these four points according to the origin you just set. We call these numbers the source points.
Then, without changing the camera position or the heading angle, calculate the coordinates of the four points from the previous step, but this time in image coordinates. Let us call these coordinates the destination points.
We are trying to estimate the homography matrix by finding a mapping between the source points and the destination points. Luckily, the OpenCV library has a function that does the calculations for us. The
findHomography function takes in two sets of numbers, the source points and the destination points, and estimates the homography matrix based on the corresponding coordinates of these two sets of numbers. We can use this function to calculate in a single line of code as follows:
h, status = cv2.findHomography(pts_src, pts_dst)
You can refer to the OpenCV documentation to learn more about this function. Figure 5 illustrates how this function works in practice. The four blue dots are the reference points to calculate . The green dots computed using the homography matrix show the ground plane. The key point is that each two consecutive green dots has the same distance in the world coordinates, and the closer the points are to the camera, the more distant they are.
Camera calibration using homography estimation; a use case
To see how this camera calibration method works in practice, we implemented this method in our open-source Smart Social Distancing application. This application calculates the world distances between people in an input video to measure how well social distancing is being practiced. You can learn more about this application here.
The Smart Social Distancing application implements three methods to calculate the world distance between the people; 1- the calibration-less method comparing bounding box center points, 2- the calibration-less method comparing bounding box corners, and 3- the camera calibration method using homography estimation. The calibration-less methods are explained in this article in detail.
The user can select the method they want to use by specifying the method name in the application configuration file. If the camera calibration method is selected, the user should also specify the path to a
.txt file that contains the inverse of the homography matrix () in the config file. In future updates, we will implement an interface for the user to mark the corners of a square that is 1 meter long on each side. The user can leave the rest of the calculations to the app.
Our pedestrian detection model outputs a rectangular bounding box around each detected pedestrian. For each bounding box, the bottom side of the rectangle lies on the ground plane. Thus, to measure the distance between the pedestrians, we take the middle point of each rectangle’s bottom side to estimate the intersection of the person with the ground and find its world coordinates (by multiplying with the pixel coordinates vector). Finally, we calculate and report the pairwise Euclidean distance as the distance between the detected pedestrians.
Note that since the reference square that we marked on the ground was a 1m x 1m square, the calculated distances preserve the real-world scale and are in meters. Therefore, no further calculation, such as rescaling or normalization, is needed.
Cameras project 3D scenes onto 2D images. The pinhole camera model shows the relationship between the world coordinates and image coordinates using the intrinsic and extrinsic matrices that capture different characteristics of the camera.
If we want to find the relationship between the image coordinates and world coordinates of points that reside on the same plane (coplanar points), we can use homography estimation. The homography matrix can be estimated using two corresponding sets of points, one in world coordinates and the other in image coordinates. The OpenCV library provides a useful function that takes in the two sets of numbers as input and returns the homography matrix as output.
We have implemented the homography estimation algorithm for camera calibration in our Smart Social Distancing application. In future updates, we will add an interface for camera calibration in our app.