DeepMind's SIMA 2: Gemini-Powered AI Agent Masters Virtual Worlds, Eyes Real-World Robotics

Google DeepMind's SIMA 2, the successor to the "scalable instructable multiworld agent" first demoed last year, showcases a substantial upgrade in capability thanks to its integration with Gemini, DeepMind's flagship large language model. This new agent can perform a wider range of tasks within virtual worlds, independently devise solutions to challenges, and engage in dialogue with users. It also demonstrates the ability to learn and improve through repeated attempts at difficult tasks. According to Joe Marino, a research scientist at Google DeepMind, video games are a powerful tool for AI agent research. He highlighted the complexity involved in even seemingly simple in-game actions, such as lighting a lantern, which require the agent to solve a series of complex tasks. The ultimate goal is to develop AI agents capable of following instructions and executing open-ended tasks in more complex environments than simple web browsers. DeepMind envisions these agents eventually driving real-world robots, with skills learned by SIMA 2, like navigation, tool usage, and human collaboration, forming crucial building blocks. Unlike previous game-playing AI like AlphaZero and AlphaStar, SIMA is designed to play open-ended games without pre-defined objectives. It learns to follow human instructions communicated through text chat, voice, or screen drawings, processing the game's visual input to determine the necessary actions. SIMA 2 was trained using footage of humans playing various commercial video games, including No Man’s Sky and Goat Simulator 3, along with custom-built virtual worlds. Hooking up to Gemini has enabled the agent to improve dramatically at following instructions, asking questions, and autonomously figuring out how to accomplish more intricate tasks. In tests, SIMA 2 demonstrated its adaptability by navigating and executing instructions within environments generated from scratch by Genie 3, DeepMind's world model. Moreover, Gemini was used to generate new tasks for SIMA 2, providing tips and guidance for improvement upon initial failures, allowing the agent to learn through trial and error. Despite progress, SIMA 2 still faces challenges with complex, multi-step tasks and retains only recent interactions in its memory. Its proficiency in using a mouse and keyboard lags behind human capabilities. Julian Togelius, an AI researcher at New York University, acknowledges the difficulty of training a single system to perform well across multiple games using only visual input. He notes that previous attempts, like DeepMind's GATO, struggled to transfer skills across diverse virtual environments. Matthew Guzdial, an AI researcher at the University of Alberta, expresses skepticism about the transferability of SIMA 2's skills to real-world robotics, citing the complexity of interpreting real-world visuals compared to the easily parsable visuals in video games. Still, Marino and his colleagues aim to explore the potential of Genie 3 to create an endless virtual training environment for SIMA 2, guided by Gemini's feedback.