Meta unveils HOT3D dataset for advanced computer vision training

Meta unveils HOT3D dataset for advanced computer vision training

Meta releases new dataset to train computer vision algorithms
HOT3D overview. The dataset includes multi-view egocentric image streams from Aria [13] and Quest 3 [41] annotated with high-quality ground-truth 3D poses and models of hands and objects. Three multi-view frames from Aria are shown on the left, with contours of 3D models of hands and objects in the ground-truth poses in white and green, respectively. Aria also provides 3D point clouds from SLAM and eye gaze information (right). Credit: Banerjee et al.

While most humans can innately use their hands to communicate with others or grab and manipulate objects, many existing robotic systems only excel at simple manual tasks. In recent years, computer scientists worldwide have been developing machine learning-based models that can process images of humans completing manual tasks, using acquired information to improve robot manipulation, which could in turn enhance a robot’s interactions with both humans and objects in its surroundings.

Similar models could also be used to create human-machine interfaces that rely on computer vision or broaden the capabilities of augmented and virtual reality (AR and VR) systems. To train these machine learning models, researchers need access to high-quality datasets containing annotated footage of humans completing various real-world manual tasks.

Researchers at Meta Reality Labs recently introduced HOT3D, a new that could help accelerate machine learning research to analyze hand-object interactions. This dataset, presented in a paper published on the arXiv preprint server, contains high-quality ego-centric 3D videos of human users grabbing and manipulating various objects, taken from an egocentric point of view (i.e., mirroring what the person completing the task would see).

“We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D,” wrote Prithviraj Banerjee, Sindi Shkodrani and their colleagues in their paper.

“The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects.”







Credit: Project Aria

The new dataset compiled by the team at Meta Reality Labs contains simple demonstrations of humans picking up and observing objects, as well as placing them back down on a surface. Yet it also includes more elaborate demonstrations showing users performing actions commonly observed in office and household environments, such as picking up and using kitchen utensils, manipulating various foods, typing on a keyboard, and so on.

The annotated footage included in the dataset was collected using two devices developed at Meta, namely Project Aria glasses and the Quest 3 headset. Project Aria resulted in the creation of prototype lightweight sensing glasses for augmented reality (AR) applications.

Project Aria glasses can capture video and while also tracking the eye movements of users wearing them and collecting information about the location of objects in their field of view. Quest 3, the second device used to collect data, is a commercially available (VR) headset developed at Meta.

  • Meta releases new dataset to train computer vision algorithms
    Example results of 2D segmentation of in-hand objects. Credit: arXiv (2024). DOI: 10.48550/arxiv.2411.19167
  • Meta releases new dataset to train computer vision algorithms
    Motion-capture lab. The HOT3D dataset was collected using a motion-capture rig equipped with a few dozens of infrared exocentric OptiTrack cameras and light diffuser panels for illumination variability. Credit: arXiv (2024). DOI: 10.48550/arxiv.2411.19167

“Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects,” wrote Banerjee, Shkodrani and their colleagues. “Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner.”

To assess the potential of the HOT3D dataset for research in robotics and computer vision, the researchers used it to train baseline models on three different tasks. They found that these models performed significantly better when trained on the multi-view data contained in HOT3D than when trained on demonstrations capturing a single point of view.

“In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects,” wrote Banerjee, Shkodrani and their colleagues. “The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.”

The HOT3D dataset is open-source and can be downloaded by researchers worldwide on the Project Aria website. In the future, it could contribute to the development and advancement of various technologies, including human-machine interfaces, robots, and other computer vision-based systems.

More information:
Prithviraj Banerjee et al, HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos, arXiv (2024). DOI: 10.48550/arxiv.2411.19167

Journal information:
arXiv


© 2025 Science X Network

Citation:
Meta unveils HOT3D dataset for advanced computer vision training (2025, January 3)
retrieved 3 January 2025
from https://techxplore.com/news/2025-01-meta-unveils-hot3d-dataset-advanced.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Leave a Reply

Your email address will not be published. Required fields are marked *