Input frame data requirements for external frame data sources

To ensure the proper functioning of external frame data sources, the most critical yet challenging task is to guarantee data accuracy. This document outlines the input frame data requirements for external frame data sources.

Before you begin

Understand basic concepts such as cameras, input frames.
Understand the basic concepts and common types of external frame sources.

Input frame data types

In Unity, external frame data sources often need to receive different data at two different times. Based on the external data input timing and data characteristics, we refer to these two sets of data as:

Camera frame data
Rendering frame data

Different types of external frame data sources have varying requirements for these two sets of data:

Image and device motion data input extensions: require both camera frame data and rendering frame data
Image input extensions: only require camera frame data

Camera frame data

Data requirements:

ExternalDeviceFrameSource
ExternalImageStreamFrameSource

Timestamp (timestamp)
Raw physical camera image data (raw camera image data)
Intrinsics (intrinsics, including image size, focal length, principal point. Distortion model and parameters are required if distortion exists)
Extrinsics (extrinsics, Tcw or Twc, calibrated matrix expressing physical camera's offset relative to device/head pose origin)
Tracking status (tracking status)
Device pose (device pose)

Data timing:

Physical camera exposure midpoint

Data usage:

API call timing: May vary based on external code design. A common approach used by most devices is to query during 3D engine rendering updates, then determine whether to proceed with data processing based on device data timestamp
API call thread: 3D engine's game thread or any other thread (if all external APIs used are thread-safe)

API call examples in Unity are as follows:

ExternalDeviceFrameSource
ExternalImageStreamFrameSource

void TryInputCameraFrameData()
{
    double timestamp;

    if (timestamp == curTimestamp) { return; }
    curTimestamp = timestamp;

    PixelFormat format;
    Vector2Int size;
    Vector2Int pixelSize;
    int bufferSize;

    var bufferO = TryAcquireBuffer(bufferSize);
    if (bufferO.OnNone) { return; }
    var buffer = bufferO.Value;

    IntPtr imageData;
    buffer.tryCopyFrom(imageData, 0, 0, bufferSize);

    var historicalHeadPose = new Pose();
    MotionTrackingStatus trackingStatus = (MotionTrackingStatus)(-1);

    using (buffer)
    using (var image = Image.create(buffer, format, size.x, size.y, pixelSize.x, pixelSize.y))
    {
        HandleCameraFrameData(deviceCamera, timestamp, image, cameraParameters, historicalHeadPose, trackingStatus);
    }
}

void TryInputCameraFrameData()
{
    double timestamp;

    if (timestamp == curTimestamp) { return; }
    curTimestamp = timestamp;

    PixelFormat format;
    Vector2Int size;
    Vector2Int pixelSize;
    int bufferSize;

    var bufferO = TryAcquireBuffer(bufferSize);
    if (bufferO.OnNone) { return; }
    var buffer = bufferO.Value;

    IntPtr imageData;
    buffer.tryCopyFrom(imageData, 0, 0, bufferSize);

    var historicalHeadPose = new Pose();
    MotionTrackingStatus trackingStatus = (MotionTrackingStatus)(-1);

    using (buffer)
    using (var image = Image.create(buffer, format, size.x, size.y, pixelSize.x, pixelSize.y))
    {
        HandleCameraFrameData(timestamp, image, cameraParameters);
    }
}

Render frame data

Data requirements:

ExternalDeviceFrameSource
ExternalImageStreamFrameSource

Timestamp (timestamp)
Tracking status (tracking status)
Device pose (device pose)

Data timing:

On-screen display moment. TimeWarp is not calculated. Device pose data at the same moment will be used by external systems (e.g., device SDK) to set virtual camera transform for rendering the current frame.

Note

TimeWarp (sometimes called Reprojection or ATW/PTW) is a common latency-reduction technique in VR/AR headsets. It re-distorts images based on latest head pose after rendering completion to compensate for head movement during rendering. EasyAR requires the moment corresponding to the pose used to set the virtual camera at rendering start, not the actual on-screen moment after TimeWarp.

Data usage:

API call timing: Each rendering frame of the 3D engine
API call thread: 3D engine's game thread

Here is the API call example in Unity:

private void InputRenderFrameMotionData()
{
    double timestamp = 0e-9;
    var headPose = new Pose();
    MotionTrackingStatus trackingStatus = (MotionTrackingStatus)(-1);
    HandleRenderFrameData(timestamp, headPose, trackingStatus);
}

Details of data requirements

Physical camera image data:

Image coordinate system: Data captured when the sensor is level should also be level. Data should be stored with the top-left corner as the origin in row-major order. Images should not be flipped or inverted.
Image FPS: Normal 30 or 60 fps data is acceptable. If high fps has special effects, the minimum acceptable frame rate for reasonable algorithm performance is 2. FPS higher than 2 is recommended, and the original data frame rate is typically sufficient.
Image size: For better computational results, the longest side should be 960 pixels or larger. Time-consuming image scaling in the data pipeline is discouraged; using raw data directly is recommended unless copying full-size data becomes unacceptably time-consuming. Image resolution must not be smaller than 640*480.
Pixel format: Prioritizing tracking effectiveness and overall performance, the typical format priority order is YUV > RGB > RGBA > Gray (Y component in YUV). When using YUV data, complete data definitions are required, including data packing and padding details. Mega performs better with color images compared to single-channel images, though other features are less affected.
Data access: Data pointer or equivalent implementation. Eliminate all possible non-essential copies in the data pipeline. In HandleRenderFrameData, EasyAR copies the data once for asynchronous use, and the image data is no longer used after this synchronous call completes. Note data ownership.

Timestamp:

All timestamps must be clock-synchronized, preferably hardware-synchronized. The unit is seconds, but precision should reach nanoseconds or as high as possible.

Tracking state:

The tracking state is defined by the device and must include a state for tracking loss (VIO unavailable). More levels are better if available.

Device pose:

All poses (including the transform of the virtual camera in the 3D engine) should use the same origin.
All poses and extrinsic parameters should use the same coordinate system.
In Unity, the coordinate system type for pose data should be the Unity coordinate system or the EasyAR coordinate system. If the input extension is implemented by EasyAR and uses other coordinate system definitions, a clear coordinate system definition must be provided, or a method to convert to the Unity coordinate system or EasyAR coordinate system must be given.
In Unity, if using the Unity XR framework, only compatibility with the XROrigin.TrackingOriginMode.Device mode is required.

Intrinsic parameters:

All values must match the image data. Scaling of intrinsic parameters should be done before inputting to EasyAR if necessary.
If the input extension is implemented by EasyAR, it should be specified whether the intrinsic parameters change per frame (determining if the corresponding API should be called once or every frame).

Extrinsic parameters:

Real data must be provided on head-mounted displays.
It is a calibration matrix expressing the physical offset of the physical camera relative to the pose origin of the device/head. If the device's pose and the physical camera pose are the same, it should be the identity matrix.
The corresponding interface for Apple Vision Pro is: CameraFrame.Sample.Parameters.extrinsics. Note that its data definition differs from the data required by the interface. EasyAR internally converts it before use.
In Unity, the coordinate system type for extrinsic parameters should be the Unity coordinate system or the EasyAR coordinate system. If the input extension is implemented by EasyAR and uses other coordinate system definitions, a clear coordinate system definition must be provided, or a method to convert to the Unity coordinate system or EasyAR coordinate system must be given.
In head-mounted display devices, multiple coordinate systems with different definitions often exist. These differences may include origin, orientation, left/right-handed expression, etc. Extrinsic parameters should be calculated within the same coordinate system. This interface data requires a coordinate transformation within the same coordinate system, not a transformation matrix between two differently defined coordinate systems.

Performance:

Data should be provided with optimal efficiency. In most implementations, API calls occur during the rendering process. Therefore, it is recommended not to block API calls even if underlying time-consuming operations are required, or to use these APIs in a reasonable manner.
If the input extension is implemented by EasyAR, all time-consuming API calls must be documented.

Multi-camera:

Data from at least one camera is required. This camera can be any of the following: RGB camera, VST camera, positioning camera, etc. On a head-mounted display, if only one camera's data is input, it is generally recommended to use an RGB or VST camera located centrally or near the eyes.
Using multiple cameras can enhance EasyAR algorithm performance. Camera frame data from all available cameras at a given moment should be input simultaneously at the same point in time.

Multi-camera is not yet fully supported. Contact EasyAR for more details.

Next steps

Create an image and device motion data input extension
Create an image input extension
Create a headset extension package

EasyAR coordinate system
Image input extension example Workflow_FrameSource_ExternalImageStream

Table of Contents