News

AI Learns Vision and Sound Connection

Source: news.mit.edu

Published on May 22, 2025

Updated on May 22, 2025

AI connecting vision and sound through machine learning

AI Learns to Link Sight and Sound

Researchers from MIT and collaborators have developed a groundbreaking approach to enhance AI models, enabling them to learn by connecting sight and sound. This advancement mirrors human sensory learning, where visual and auditory cues are intrinsically linked. For instance, humans can observe someone playing the cello and immediately associate the musician's movements with the music being produced.

This innovation holds significant potential in fields like journalism and film production, where AI could streamline the process of curating multimodal content through automated video and audio retrieval. In the long term, it could enhance a robot’s ability to understand and navigate real-world environments, where auditory and visual information are often intertwined.

CAV-MAE Sync: A New Approach

The researchers introduced a method that helps machine-learning models align corresponding audio and visual data from video clips without the need for human labels. By refining the training process, the model learns to establish a more precise correspondence between specific video frames and the audio occurring at those moments. Architectural adjustments were also made to balance two distinct learning objectives, resulting in improved performance.

These enhancements increase the accuracy of video retrieval tasks and improve the model’s ability to classify actions within audiovisual scenes. For example, the new method can automatically and accurately match the sound of a door slamming with the visual of it closing in a video clip.

Andrew Rouditchenko, an MIT graduate student and co-author of the research paper, explained that the goal is to develop AI systems capable of processing audio and visual information simultaneously and seamlessly. He noted that integrating this audio-visual technology into tools like large language models could open up new applications and possibilities.

Advancements in Model Training

This work builds on a machine-learning method developed by the researchers a few years ago, which provided an efficient way to train multimodal models to process audio and visual data without human labels. The model, called CAV-MAE, encodes visual and audio data separately into tokens. Using the natural audio from the recordings, the model learns to map corresponding pairs of audio and visual tokens closely together within its internal representation space.

The researchers discovered that using two learning objectives balances the model’s learning process, enabling CAV-MAE to understand corresponding audio and visual data while improving its ability to retrieve video clips that match user queries. However, CAV-MAE treats audio and visual samples as a single unit, meaning a 10-second video clip and the sound of a door slamming are mapped together, even if the audio event occurs in just one second of the video.

In their improved model, named CAV-MAE Sync, the researchers split the audio into smaller windows before the model computes its representations of the data. This allows the model to generate separate representations for each smaller window of audio. During training, the model learns to associate one video frame with the audio occurring during just that frame. Edson Araujo explained that this approach helps the model learn a finer-grained correspondence, which improves performance when aggregating this information.

The model incorporates a contrastive objective, where it learns to associate similar audio and visual data, and a reconstruction objective, which aims to recover specific audio and visual data based on user queries. In CAV-MAE Sync, the researchers introduced two new types of data representations, or tokens, to enhance the model’s learning ability. These include dedicated 'global tokens' that assist with the contrastive learning objective and dedicated 'register tokens' that help the model focus on important details for the reconstruction objective.

Araujo added that they allowed the model more flexibility to perform each of these two tasks—contrastive and reconstructive—more independently, which benefited overall performance. The researchers’ enhancements improved the model’s ability to retrieve videos based on audio queries and predict the class of audiovisual scenes, such as a dog barking or an instrument playing. The results were more accurate than their previous work and outperformed more complex methods requiring larger amounts of training data.

In the future, the researchers aim to incorporate new models that generate better data representations into CAV-MAE Sync, which could further improve performance. They also plan to enable their system to handle text data, an important step toward developing an audiovisual large language model.