Using easyar in 3d engines
Using easyar in a 3d engine requires rendering the camera image and virtual objects. Rendering virtual objects requires alignment with the camera image. When rendering the camera image, some parameters during image generation and display may not match, such as the physical camera's position, orientation, frame size, aspect ratio, etc., which may differ from the display screen. These factors need consideration during rendering. If integrating easyar with unsupported 3d engines, special attention should be paid to the following details.
Cropping of camera image boundary padding
Image cropping, transposition, and encoding all require significant computational resources. To minimize computation and reduce latency, raw formats are generally used. To facilitate video encoding, images output by physical cameras are often aligned to grids such as 8x8, 16x16, 32x32, or 64x64. For example, when selecting a resolution of 1920x1080 on some mobile phones, the output image might become 1920x1088 because 1080 is not a multiple of 64.

This requires removing these excess padded areas during rendering. Several approaches are possible: one involves specifying the width when uploading the image to video memory, achievable in OpenGL using glPixelStorei(GL_PACK_ROW_LENGTH, ...); another involves manually calculating UV coordinates in the fragment shader and truncating the sampled areas beyond the valid image boundaries.
Rendering with screen rotation direction
On mobile phones, images captured by the physical camera are typically fixed relative to the device body and do not change with the screen orientation. However, changes in the device's physical orientation affect our definition of top, bottom, left, and right directions for the image. During rendering, the current screen display orientation also influences the direction of the displayed image.
Usually, when rendering, it is necessary to determine the rotation angle of the camera image relative to the screen display orientation.
If we denote \(\theta_{screen}\) as the clockwise rotation radians of the screen image relative to the screen's natural orientation, \(\theta_{phycam}\) as the clockwise rotation radians required for the physical camera image to display correctly on a naturally oriented screen, and \(\theta\) as the clockwise rotation radians required for the physical camera image to display on the current screen.
For the rear camera, we have:
For example, on an Android phone used in natural orientation, \(\theta_{screen} = 0, \theta_{phycam} = \frac{\pi}{2}\), thus \(\theta = \frac{\pi}{2}\).
For the front camera, if a left-right flip is performed after rotation, we have:
Note
When the screen image rotates, \(\theta\) must be recalculated immediately in the first frame after rotation occurs. Otherwise, a momentary incorrect screen image orientation may appear.
Camera background and virtual object rendering
To render virtual objects on a mobile phone, the virtual objects must be aligned with the camera image. This requires placing both the rendering camera and the objects in a virtual space that exactly corresponds to the real space, and using the same field of view and aspect ratio as the physical camera for rendering. The perspective projection transformations applied to the camera image and the virtual objects are almost identical, with only one difference: the perspective projection transformation of the camera image mostly occurs in the physical camera, while the perspective projection transformation of the virtual objects is entirely a computational process.
The following adopts the OpenGL convention. If other conventions are used, corresponding coordinate axis mappings are required. Assume the camera coordinate system is defined as follows: the x-axis points to the right, the y-axis points upward, and the z-axis points outward from the screen. The clip coordinate system is defined as follows: the x-axis points to the right, the y-axis points upward, the z-axis points outward from the screen, and the w-axis is a virtual axis.
At this point, the perspective projection transformation matrix required to render the camera image is as follows:
Among them, 'flip' refers to whether the screen is flipped left or right, with a value of 1 when flipped and 0 when not flipped; $\ theta $is the clockwise rotation angle of the image, measured in radians; $s_x $and $s_y $are scaling coefficients used for proportional scaling or proportional filling, which vary with $\ theta $. This transformation matrix first scales the camera image, then rotates it, and finally flips it. When rendering, a rectangle should be used to fill the screen. For example, in OpenGL, the vertices of the rectangle can be placed at $(-1, -1, 0) $, $(1, -1, 0) $, $(1, 1, 0) $, $(-1, 1, 0) $, with UV coordinates set at the corresponding four corners, and then rendered using this perspective projection matrix.
The perspective projection matrix required to render virtual objects is as follows:
Where:
- \(n\) and \(f\) are the near and far clipping parameters commonly used in 3D rendering perspective projection matrices;
- \(w\) and \(h\) are the pixel width and height of the camera image;
- \(f_x\), \(f_y\), \(c_x\), \(c_y\) are the intrinsic parameters commonly used in the camera model, where \(f_x\) and \(f_y\) are the pixel focal lengths, and \(c_x\) and \(c_y\) are the principal point pixel positions.
This projection matrix performs the following transformations in sequence:
- The perspective projection transformation of the camera intrinsics (since the y and z-axis directions in the OpenCV image coordinate system are opposite to those in the OpenGL camera coordinate system, two coordinate system transformations are performed);
- The transformation from the image pixel coordinate system to the image rectangle coordinate system;
- The near and far clipping transformations;
- The perspective projection transformation when rendering the camera image.
After simplification, we obtain:
From the above process, it can be seen that rendering typically needs to be performed in two passes: one for the camera image and one for the virtual objects, with the virtual objects overlaid on top of the camera image.
Some 3D engines represent the perspective projection matrix using parameters such as the horizontal field of view and aspect ratio. If rotation and flipping are ignored, and the principal point offset is disregarded, it can be calculated as follows:
- Horizontal field of view: \(\alpha=2 \arctan{\frac{w}{2 f_x}}\);
- Aspect ratio: \(r=\frac{w}{h}\).
Note that this process does not account for camera distortion, as the distortion in most modern mobile phone cameras is very slight.