AR-driven 3D rendering

The development of AR applications requires solving a fundamental issue: the rendering of AR content. This article will take plane image tracking as an example to describe the basic modules, processes, and rendering implementation of AR applications.

Typical AR Application Flow

A typical AR application usually involves recognizing specific images, objects, or scenes from camera images, tracking their position and orientation, and rendering virtual content (3D models) according to this position and orientation.

image tracking

For example, the above image shows an AR application with planar image tracking.

The following is a schematic diagram of the application flow.

flowchart TD  
    CameraDevice[Camera Device]  
    Tracker[Tracker]  
    Renderer[Renderer]  

    CameraDevice -->|Image Frame| Tracker  
    Tracker -->|Image Frame + Tracked Pose| Renderer

The flow includes the following modules.

Module	Function
Physical camera	Provides a sequence of input image frames. Image frames include the image, the timestamp when the image was generated, and sometimes the position and orientation of the camera in space
Tracker	Calculates the position and orientation of the tracked target from the image frame. Depending on the tracking target, there are various types of trackers, such as planar image trackers and 3D object trackers
Renderer	Used to render the camera image and the 3D model corresponding to the tracked object onto the screen. On some AR glasses, the camera image may not be rendered, and only the 3D model is rendered

Rendering on mobile devices

Rendering on mobile devices is divided into two parts: rendering the camera feed and rendering virtual objects.

Rendering the camera image

camera image

When rendering the camera image, there are some parameters to pay attention to.

Scaling mode

Usually, the camera image needs to fill the entire screen or a window, which may lead to an aspect ratio mismatch between the camera image and the screen/window.

Assuming we want the camera image to align with the center of the screen/window while maintaining the aspect ratio, there are two common scaling modes: aspect fit and aspect fill.

Scaling mode	Effect
Aspect fit	Displays all content on the screen but may leave black bars on the sides or top/bottom
Aspect fill	No black bars, but may crop parts of the image on the sides or top/bottom

Camera image rotation

On mobile devices, the image captured by the physical camera is usually fixed relative to the device body and does not change with the screen orientation. However, changes in the device's orientation affect our definition of the image's up, down, left, and right directions. During rendering, the current screen orientation also affects the displayed image's direction.

Typically, when rendering, it is necessary to determine a rotation angle for the camera image relative to the screen orientation.
Camera image flip

In some cases, the front-facing camera is used, and the image is often flipped horizontally to make it appear like a mirror.

Rendering of virtual objects

virtual object

To render virtual objects on a mobile phone, it is necessary to align the virtual objects with the camera view. This requires us to place both the rendering camera and the objects in a virtual space that exactly corresponds to the real space, and to use the same field-of-view and aspect ratio as the physical camera for rendering. The perspective projection transformations applied to the camera view and the virtual objects are identical, except that most of the perspective projection transformation for the camera view occurs in the physical camera, while the perspective projection transformation for the virtual objects is entirely a computational process.

Rendering on head-mounted displays

Rendering on head-mounted displays differs somewhat from rendering on mobile devices and can be divided into two scenarios.

VST

Video See-Through refers to AR technology where the headset captures images through physical cameras and then displays the camera images along with virtual content on the headset's screen. A typical example is the Vision Pro. Usually, the perspective projection matrices for both the camera images and virtual content are set by the SDK provided by the headset, while external applications only need to set the position and orientation of the virtual content. The physical cameras used for tracking and the cameras rendering the images on the screen may be in different locations, with coordinate transformations applied during rendering.
OST

Optical See-Through refers to AR technology where the headset's screen is transparent, and the headset only displays virtual content on the screen. A typical example is the HoloLens. Usually, the perspective projection matrices for the virtual content are set by the SDK provided by the headset, while external applications only need to set the position and orientation of the virtual content. The physical cameras used for tracking and the cameras rendering the images on the screen may be in different locations, with coordinate transformations applied during rendering.

Platform-specific guides

AR-driven 3D rendering is closely tied to the platform. Please refer to the following guides based on your target platform for development: