DeepMind's SIMA 2: A Gemini-Powered AI Agent for Virtual Worlds

Google DeepMind's SIMA 2, the successor to the scalable instructable multiworld agent, represents a significant advancement in AI capabilities. Powered by Gemini, DeepMind's flagship large language model, SIMA 2 can perform a wide range of tasks within virtual worlds, solve challenges independently, and engage in user dialogues. This AI agent also learns and improves through repeated attempts at complex tasks, marking a milestone in AI development.

According to Joe Marino, a research scientist at Google DeepMind, video games are invaluable for AI research due to their complexity. Even simple actions, like lighting a lantern, require the agent to solve multiple tasks, showcasing the intricacy of AI training.

Applications and Future Goals

The ultimate objective is to create AI agents capable of following instructions and executing open-ended tasks in environments more complex than web browsers. DeepMind envisions these agents eventually controlling real-world robots, with skills like navigation, tool usage, and human collaboration learned by SIMA 2 forming the foundation.

Unlike previous AI agents such as AlphaZero and AlphaStar, SIMA 2 is designed for open-ended games without predefined objectives. It learns to follow human instructions through text chat, voice, or screen drawings, processing visual input to determine actions.

Training and Capabilities

SIMA 2 was trained using footage of humans playing commercial video games like No Man’s Sky and Goat Simulator 3, as well as custom-built virtual worlds. Its integration with Gemini has dramatically improved its ability to follow instructions, ask questions, and autonomously solve complex tasks.

In testing, SIMA 2 demonstrated adaptability by navigating environments generated by Genie 3, DeepMind's world model. Gemini provided new tasks and guidance, allowing SIMA 2 to learn through trial and error.

Challenges and Expert Perspectives

Despite its progress, SIMA 2 faces challenges with multi-step tasks and retains only recent interactions in memory. Its proficiency with a mouse and keyboard is still inferior to human capabilities.

Julian Togelius, an AI researcher at New York University, notes the difficulty of training a single system to perform well across multiple games using only visual input. Matthew Guzdial, an AI researcher at the University of Alberta, is skeptical about transferring SIMA 2's skills to real-world robotics due to the complexity of real-world visuals compared to video games.

However, Marino and his colleagues aim to explore Genie 3's potential to create endless virtual training environments for SIMA 2, guided by Gemini's feedback.