A Minecraft-based benchmark to train and test multi-modal multi-agent systems

A Minecraft-based benchmark to train and test multi-modal multi-agent systems

A Minecraft-based benchmark to train and test multi-modal multi-agent systems
More than 30 target objects or resources are used in TeamCraft tasks. Credit: UCLA.

Researchers at the University of California- Los Angeles (UCLA) have recently developed TeamCraft, a new open-world environment for the training and evaluation of algorithms for embodied artificial intelligence (AI) agents, including teams of multiple robots. This benchmark, introduced in a paper published on the arXiv preprint server, is based on the popular videogame Minecraft.

“There is a lack of multi-modal, multi-agent benchmarks for open-world environments,” Qian Long, Ph.D. a student at UCLA, told Tech Xplore.

“Minecraft, one of the most popular games, offers a multidimensional, visually immersive realm characterized by procedurally generated landscapes and versatile game mechanics. Its dynamic nature supports a wide range of activities, which made it an ideal platform for creating our visually rich multi-agent benchmark: TeamCraft.”

TeamCraft, the platform created by Long and his colleagues, can be used to train algorithms on four different types of tasks, namely building, clearing, farming and smelting. As part of their study, the researchers also used their platform to evaluate existing vision-language models (VLMs), which allowed them to better understand their limitations.

“TeamCraft is a multi-modal, multi-agent benchmark that addresses a significant challenge for AI,” said Zhi Li, Ph.D. Student at UCLA. “Specifically, it helps to address the question: How well can embodied agents collaborate in complex environments with human-like perception?”

A Minecraft-based benchmark to train and test multi-modal multi-agent systems
Agents collaborate to cook mutton in a desert village. Credit: UCLA.

In the TeamCraft benchmarking platform, every agent is provided with first-person RGB data and status information, which mirrors what a human agent would perceive in the . AI agents can be trained and tested on various tasks that require them to collaborate with each other, understand the environment via first-person vision and utilize available tools.

To complete each task, the agents need to perform specific actions, similar to those that a would perform in Minecraft. These actions are pre-defined (i.e., can be picked from a limited set of options) and self-descriptive (i.e., clearly named/labeled).

“The first advantage of TeamCraft is that it enables multi-modal task specification,” explained Li. “Unlike prior systems such as ALFRED and MineDojo, which rely solely on text instructions, TeamCraft supports multi-modal prompts. This expands the scope for richer and more diverse task specifications.”

Another unique characteristic of TeamCraft is that it equips agents with first-person RGB vision while they navigate the visually rich Minecraft environment. This is in contrast with previous approaches such as Watch&Help and RoCoBench, which relied on state-based observations, Neural MMO 2.0, which provides simplified pixel-based visuals, and Overcooked-AI, which only allows agents to view 2D worlds.

“While most prior works like MineDojo and VIMA-Bench focus on single-agent setups, TeamCraft prioritizes multi-agent environments to better simulate real-world challenges requiring collaboration,” said Li.

“It supports both centralized and decentralized control strategies, enhancing flexibility in agent coordination and challenging capabilities of model understanding.”

The tasks included in TeamCraft are designed to assess the agents’ planning, coordination and execution while they navigate a dynamic setting.

In contrast with some other benchmarks, like FurnMove, the system does not only support the evaluation of agents that are equally capable across tasks, but also of agents with different responsibilities.

In other words, it allows users to distribute different roles to different agents in a team, by providing them with distinct capabilities. It can also be used to train and test the agents’ decision-making skills in real-time and their adaptability to changing environments.

TeamCraft features a total of 55,000 task variants. These variants are defined based on various factors, including Biomes (i.e., distinct regions within the open-world environment), base blocks, task goals, target materials, agents counts and unique inventories.

“Operating in the Minecraft environment, TeamCraft enables agents to perceive, think, and act like human players without perfect information,” said Li.

“Unlike prior systems that provide agents with complete data (e.g., unseen teammate locations), TeamCraft requires agents to actively explore their surroundings. This shift fosters more realistic behaviors and reduces dependence on artificially perfect data, enabling agents to better handle real-world scenarios and reduce the gap of deploying models to real world application.”

The benchmark created by the researchers also includes ‘plug-and-play’ interfaces. This means that it can be used both to test existing models or train new ones, all within a single standardized environment. It can also serve as a gym-like playground to train reinforcement learning (RL) algorithms that support multi-agent collaboration.

“TeamCraft demonstrates the possibility of vision-based multi-agent collaboration in the open-world video game Minecraft,” said Ran Gong, former Ph.D. student at UCLA.

“Minecraft’s rich and procedurally generated world provides a challenging yet flexible platform to explore collaborative problem-solving, resource management, and task execution among multiple AI agents. By focusing on vision-based inputs, TeamCraft emphasizes how agents can interpret complex visual cues to make decisions, coordinate actions, and achieve shared goals, all without relying on predefined rules.”

By running tests on TeamCraft, the researchers demonstrated the existence of data scaling laws, which are a key aspect of AI model performance. These laws show that there is a consistent pattern in the training of AI models, where an agent’s ability to perform complex tasks and coordinate with other agents improves as the training data it has access to increases.

“This finding suggests that one of the most promising avenues for developing a more effective and robust system is to scale up the amount of high-quality training data,” said Gong. “By leveraging larger datasets, models can learn richer patterns, adapt better to diverse scenarios, and enhance their collaborative capabilities.”

In the future, TeamCraft could be used by computer scientists worldwide to train and evaluate their machine learning-based models. In addition, it could aid the design of new AI-based general-purpose videogame characters, which could collaborate better with other characters or assist human players as they are playing a game.

“Through natural interactions, these AI agents can help human players strategize, solve challenges, and enjoy a more engaging gaming experience,” said Gong. “Such advancements could redefine the role of AI in gaming, transforming it into an intelligent teammate or assistant capable of adapting to human behavior and preferences.”

The code underpinning the TeamCraft benchmark is open-source and can be downloaded on GitHub. The new benchmark could soon inspire the development of other open-world environments to train or test AI agents, which also support multi-modal multi-agent interactions.

“Currently, the agents in TeamCraft rely on implicit communication to coordinate their actions,” added Xiaofeng Gao, former Ph.D. student at UCLA.

“Enabling the agents to communicate explicitly via natural language would be an interesting direction to explore. Moreover, we plan to make TeamCraft a testbed for human-AI collaboration by including human players in the games.”

More information:
Qian Long et al, TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft, arXiv (2024). DOI: 10.48550/arxiv.2412.05255

Journal information:
arXiv


© 2025 Science X Network

Citation:
A Minecraft-based benchmark to train and test multi-modal multi-agent systems (2025, January 10)
retrieved 10 January 2025
from https://techxplore.com/news/2025-01-minecraft-based-benchmark-multi-modal.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Leave a Reply

Your email address will not be published. Required fields are marked *