o3 AI Model Tops Science Question Ranking
Source: nature.com
AI Model o3 Ranks First in Science Question Answering
o3, an artificial intelligence model from the creators of ChatGPT, has been ranked as the top AI tool for answering science questions across multiple fields. This is according to a benchmarking platform that launched last week.
SciArena, which was developed by the Allen Institute for Artificial Intelligence (Ai2) in Seattle, Washington, assessed 23 large language models (LLMs). The ranking was based on the models' answers to scientific questions. 102 researchers voted on the quality of the answers.
o3, created by OpenAI in San Francisco, California, received the highest ranking for its answers to questions in natural sciences, health care, engineering, and humanities and social science. This was determined after more than 13,000 votes. DeepSeek-R1, built by DeepSeek in Hangzhou, China, was ranked second in natural sciences and fourth in engineering. Google’s Gemini-2.5-Pro was third in natural sciences and fifth in both engineering and health care.
Arman Cohan, a research scientist at Ai2, suggests that users might prefer o3 because it provides detailed information on the literature it cites and gives technically nuanced responses. However, explaining why models perform differently is difficult because most are proprietary. He says that differences in training data and optimization could partially explain the variations.
SciArena is a platform for evaluating AI models on specific tasks, and it uses crowdsourced feedback to rank performance on scientific tasks. Rahul Shome, a robotics and AI researcher at the Australian National University in Canberra, says that SciArena is a positive step that encourages careful evaluation of LLM-assisted literature tasks.
To rank the 23 LLMs, SciArena asked researchers to submit scientific questions. The platform provided answers from two randomly selected models, using references from Semantic Scholar, an AI research tool also created by Ai2. Users then voted to determine if one model gave the best answer, if the two models were comparable, or if both performed poorly.
The platform is publicly available, allowing users to ask research questions for free. Users receive answers from two models and can vote on their performance. The leaderboard is updated frequently but only includes votes from verified users who agree to the terms.
Jonathan Kummerfeld, an AI researcher at the University of Sydney in Australia, believes that the ability to ask LLMs about science topics and trust the answers will help researchers stay updated with the latest literature in their fields. He added, “This will help researchers find work they may have otherwise missed.”