AI Models Struggle With Spatial Reasoning in Blueprint-Bench

A new benchmark called Blueprint-Bench has exposed significant limitations in AI models' ability to perform spatial reasoning tasks. The benchmark, developed by Andon Labs, evaluates AI systems on their capacity to convert apartment photographs into accurate 2D floor plans, a task requiring advanced spatial intelligence.

Leading AI models, including GPT-5, Claude 4 Opus, and image generation systems like GPT-Image, were tested using a dataset of 50 apartments. The results showed that most models performed at or below a random baseline, highlighting a major blind spot in current AI capabilities.

The Blueprint-Bench Evaluation

Blueprint-Bench focuses on spatial reasoning, a critical aspect of AI that involves understanding and interpreting physical spaces. The benchmark requires models to infer room layouts, connectivity, and scale from a set of interior images. A custom scoring algorithm measures the similarity between the generated floor plans and ground-truth data, based on room connectivity graphs and size rankings.

The dataset includes 50 apartments, each with approximately 20 images. Human evaluators significantly outperformed AI models, demonstrating the complexity of the task and the current gap in AI spatial reasoning capabilities.

Key Findings

The study revealed that image generation models struggled with following instructions, while agent-based approaches failed to improve significantly over single-pass generation. These results suggest that converting visual information into accurate spatial representations remains a challenging problem for existing AI architectures.

Blueprint-Bench provides a numerical framework for comparing spatial intelligence across different AI models. Ongoing evaluations are planned to monitor improvements in spatial reasoning as new models and community submissions emerge.

Implications for AI Research

The performance gap between humans and AI models in spatial reasoning tasks underscores the need for further research and development in this area. While AI systems have made significant strides in other domains, spatial intelligence remains a critical weakness that must be addressed to achieve more robust and versatile AI systems.

Blueprint-Bench serves as both a benchmark and a call to action for the AI community to prioritize spatial reasoning in model development and evaluation.