Study Benchmarks Environmental Efficiency of AI Models for COVID-19 X-Ray Diagnosis

A recent paper from Liam Kearns at AuraQ examines the integration of AI tools, particularly large language models (LLMs) like ChatGPT and Claude, into medical applications for improved diagnostic efficiency. The study benchmarks 14 different model configurations, comparing both accuracy and environmental impact in detecting COVID-19 in chest X-rays within a Mendix application. The findings indicate that while smaller, custom models reduce the carbon footprint, their output can be biased towards positive diagnoses with less confidence. Restricting LLMs to probabilistic outputs led to poor performance. The most efficient solution identified was the Covid-Net model, which, despite having a slightly larger carbon footprint than other small models, demonstrated a 99.9% reduction in carbon footprint compared to GPT-4.5-Preview while achieving an accuracy of 95.5%—the highest among all models tested. The research highlights major accuracy and carbon footprint concerns when using LLMs for probabilistic outputs in classifying diseases from X-rays, and emphasizes the benefits of using local models deployed alongside applications to reduce both. Knowledge bases for LLMs were found to increase detection accuracy but had varying impacts on carbon footprint. The study contributes to the understanding of generative and discriminative models in COVID-19 detection and highlights the environmental risks of using generative AI tools for classification tasks.