Table of Contents

Using Easyar in a 3D engine

To use EasyAR in a 3D engine, rendering the camera feed and virtual objects is required. Aligning virtual objects with the camera feed is necessary for rendering. When rendering the camera feed, parameters like the physical camera's position, orientation, frame size, aspect ratio, etc., may not match those of the display screen, requiring consideration during rendering. If integrating EasyAR into an unsupported 3D engine, special attention to the following details is crucial.

Camera feed boundary padding cropping

Cropping, transposing, and encoding images require significant computation. To minimize calculations and reduce latency, raw formats are typically used. For video encoding convenience, images output by the physical camera often align to blocks like 8x8, 16x16, 32x32, or 64x64. For instance, selecting a 1920x1080 resolution on some phones might output a 1920x1088 image because 1080 isn't a multiple of 64.

image with padding

This necessitates removing the excess padded portions during rendering. Several approaches exist: one specifies the width during image upload to video memory (e.g., using glPixelStorei(GL_PACK_ROW_LENGTH, ...) in OpenGL); another manually calculates UV coordinates in the fragment shader and truncates samples beyond the valid area.

Rendering according to screen rotation

On mobile devices, images captured by the physical camera are typically fixed relative to the device body and do not change with the screen orientation. However, changes in the device's orientation affect the definition of up, down, left, and right in the image. During rendering, the current screen display orientation also influences the displayed image's direction.

Usually, determining the rotation angle of the camera image relative to the screen's display orientation is required.

Let \(\theta_{screen}\) represent the clockwise rotation radians of the screen image relative to its natural orientation. Let \(\theta_{phycam}\) represent the clockwise rotation radians the physical camera image needs to display correctly on a naturally oriented screen. Let \(\theta\) represent the clockwise rotation radians the physical camera image needs to display on the current screen.

For the rear camera, we have:

\[ \theta = \theta_{phycam} - \theta_{screen} \]

For example, on an Android phone used in its natural orientation, \(\theta_{screen} = 0, \theta_{phycam} = \frac{\pi}{2}\), thus \(\theta = \frac{\pi}{2}\).

For the front camera, if a left-right flip is applied after rotation, we have:

\[ \theta = \theta_{phycam} + \theta_{screen} \]
Note

When the screen image rotates, \(\theta\) must be recalculated immediately in the first frame after the rotation occurs. Otherwise, a momentary incorrect screen image orientation may happen.

Rendering the camera background and virtual objects

Rendering virtual objects on a mobile device requires aligning them with the camera feed. This necessitates placing the rendering camera and objects in virtual space corresponding exactly to real space and rendering with the physical camera's identical field of view and aspect ratio. The perspective projection transformation applied to the camera feed and virtual objects is almost identical, differing mainly in that most of the perspective transformation for the camera feed occurs within the physical camera, while for virtual objects, it's entirely a computational process.

The following uses the OpenGL convention. Other conventions require corresponding coordinate axis mapping. Assume the camera coordinate system axes are defined as: x-axis points right, y-axis points up, z-axis points out of the screen. The clip coordinate system axes are defined as: x-axis points right, y-axis points up, z-axis points out of the screen (away), w-axis is virtual.

The perspective projection matrix required for rendering the camera feed is:

\[ P_i=\left( \begin{array}{cccc} (-1)^{\text{flip}} & \phantom{0} & \phantom{0} & \phantom{0} \\ \phantom{0} & 1 & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & 1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right)\left( \begin{array}{cccc} \cos (-\theta ) & -\sin (-\theta ) & \phantom{0} & \phantom{0} \\ \sin (-\theta ) & \cos (-\theta ) & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & 1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right)\left( \begin{array}{cccc} s_x & \phantom{0} & \phantom{0} & \phantom{0} \\ \phantom{0} & s_y & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & 1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right) \]

Where: flip indicates whether the image is flipped left-right (1 for flipped, 0 for not flipped); \(\theta\) is the clockwise rotation angle of the image in radians; \(s_x\), \(s_y\) are scaling factors for proportional scaling or padding, which vary with \(\theta\). This transformation matrix first scales the camera image, then rotates it, and finally flips it. Render using a rectangle covering the entire screen, e.g., in OpenGL, place rectangle vertices at \((-1, -1, 0)\), \((1, -1, 0)\), \((1, 1, 0)\), \((-1, 1, 0)\) with UV coordinates at the corresponding corners, then render using this perspective projection matrix.

The perspective projection matrix required for rendering virtual objects is:

\[ P=P_i\left( \begin{array}{cccc} 1 & \phantom{0} & \phantom{0} & \phantom{0} \\ \phantom{0} & 1 & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & -\frac{f+n}{f-n} & -\frac{2 f n}{f-n} \\ \phantom{0} & \phantom{0} & -1 & \phantom{0} \\ \end{array} \right)\left( \begin{array}{cccc} \frac{2}{w} & \phantom{0} & \phantom{0} & \phantom{0} \\ \phantom{0} & \frac{2}{h} & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & 1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right)\left( \begin{array}{cccc} 1 & \phantom{0} & \phantom{0} & \phantom{0} \\ \phantom{0} & -1 & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & -1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right)\left( \begin{array}{cccc} f_x & \phantom{0} & c_x & \phantom{0} \\ \phantom{0} & f_y & c_y & \phantom{0} \\ \phantom{0} & \phantom{0} & 1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right)\left( \begin{array}{cccc} 1 & \phantom{0} & \phantom{0} & \phantom{0} \\ \phantom{0} & -1 & \phantom{0} & \phantom{0} \\ \phantom{0} & \phantom{0} & -1 & \phantom{0} \\ \phantom{0} & \phantom{0} & \phantom{0} & 1 \\ \end{array} \right) \]

Where: \(n\), \(f\) are the near and far clip parameters used in typical 3D rendering perspective projection matrices; \(w\), \(h\) are the pixel width and height of the camera image; \(f_x\), \(f_y\), \(c_x\), \(c_y\) are intrinsic parameters common in camera models, with \(f_x\), \(f_y\) being the pixel focal lengths and \(c_x\), \(c_y\) being the principal point pixel locations. This projection matrix performs the following transformations sequentially: the intrinsic perspective projection transformation (including two coordinate system transformations due to the opposite y and z axis directions between the OpenCV image coordinate system and the OpenGL camera coordinate system), transformation from the image pixel coordinate system to the image rectangle coordinate system, near and far clip transformation, and the perspective projection transformation used when rendering the camera feed.

Simplifying, we get:

\[ P=P_i\left( \begin{array}{cccc} \frac{2 f_x}{w} & \phantom{0} & 1-\frac{2 c_x}{w} & \phantom{0} \\ \phantom{0} & \frac{2 f_y}{h} & -1+\frac{2 c_y}{h} & \phantom{0} \\ \phantom{0} & \phantom{0} & -\frac{f+n}{f-n} & -\frac{2 f n}{f-n} \\ \phantom{0} & \phantom{0} & -1 & \phantom{0} \\ \end{array} \right) \]

From this process, we see rendering typically requires two passes: one for the camera feed and one for virtual objects, with virtual objects overlaid on top of the camera feed.

Some 3D engines represent the perspective projection matrix using parameters like the horizontal field of view angle and aspect ratio. Ignoring rotation, flip, and principal point offset, it can be calculated where the horizontal field of view angle \(\alpha=2 \arctan{\frac{w}{2 f_x}}\) and the aspect ratio \(r=\frac{w}{h}\).

Note that this process does not account for camera distortion, as distortion is generally very minor in most modern phone cameras.