Microsoft Research has unveiled MMCTAgent (Multi-modal Critical Thinking Agent), a novel AI system designed for sophisticated reasoning across large video and image collections. Overcoming the limitations of existing models, MMCTAgent leverages Microsoft's AutoGen framework to integrate language, vision, and temporal understanding for complex analytical tasks. Key Features of MMCTAgent

Dynamic Multimodal Reasoning: Employs iterative planning and reflection for in-depth analysis.
AutoGen Framework: Built upon Microsoft’s open-source multi-agent system.
Modality-Specific Agents: Features ImageAgent and VideoAgent with specialized tools for each modality.
Planner-Critic Architecture: Enables structured self-evaluation and refinement of conclusions.

MMCTAgent's architecture includes two coordinated agents: a Planner that decomposes queries and a Critic that validates the Planner's reasoning. This iterative loop enhances the accuracy and consistency of AI responses. The system is designed for extensibility, allowing developers to integrate domain-specific tools. The VideoAgent component processes long-form video through a pipeline including transcription, key-frame identification, semantic chunking, and multimodal embedding creation. All structured metadata is indexed via Azure AI Search for scalable semantic retrieval and downstream reasoning. The ImageAgent mirrors this structured approach for static visual analysis, integrating recognition, detection, and OCR tools. Evaluations show significant performance gains using MMCTAgent, such as an increase from 60.20% to 74.24% accuracy on the MM-Vet dataset when integrating appropriate tools with GPT-4V. MMCTAgent is available on GitHub and Azure AI Foundry Labs. Looking Ahead Microsoft aims to enhance efficiency and adaptability within MMCTAgent, extending its applications to real-world domains through projects like Project Gecko.