Blueprint-Bench Exposes Spatial Reasoning Blind Spot in Leading AI Models

Andon Labs has unveiled Blueprint-Bench, a benchmark designed to evaluate spatial reasoning capabilities in AI models. The benchmark focuses on converting apartment photographs into accurate 2D floor plans, a task requiring genuine spatial intelligence such as inferring room layouts, understanding connectivity, and maintaining consistent scale. The study evaluated leading language models (GPT-5, Claude 4 Opus, Gemini 2.5 Pro, Grok-4), image generation models (GPT-Image, NanoBanana), and agent systems (Codex CLI, Claude Code) on a dataset of 50 apartments, each with approximately 20 interior images. A custom scoring algorithm measured the similarity between generated and ground-truth floor plans based on room connectivity graphs and size rankings. Results indicated a significant blind spot in current AI capabilities, with most models performing at or below a random baseline. Image generation models struggled with instruction following, while agent-based approaches showed no meaningful improvement over single-pass generation. Human performance, however, remained substantially superior. Blueprint-Bench provides a numerical framework for comparing spatial intelligence across different model architectures, with ongoing evaluations planned for new models and community submissions to monitor the emergence of spatial intelligence in generalist AI systems. The benchmark's dataset consists of 50 apartments, each with approximately 20 interior images. Each apartment has a ground truth floor plan image adapted from the apartment listing’s official floor plan image. The Blueprint-Bench results highlight that current AI systems struggle significantly with spatial reasoning tasks, even when the input modality is well-represented in their training data. The performance gap between humans and all tested models suggests that converting visual information into accurate spatial representations remains a challenging problem for existing architectures. "Blueprint-Bench reveals that current AI systems struggle significantly with spatial reasoning tasks," the researchers noted.