How Audio and Visual Signals Move Inside Multimodal LLMs
How Audio and Visual Signals Move Inside Multimodal LLMs The paper "From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs" was released on arXiv and asks a basic but important question: if a multimodal large language model can both hear and see, how do audio and visual signals actually travel through the model and shape its final answer? The work focuses on the internal pathways of audio-visual large language models rather than only their output behavior. [S1] [S1] Paper overview: what it studies This paper studies the internal information flow of audio and visual perception in multimodal LLMs, specifically in audio-visual large language models. According to the arXiv abstract, the authors start from the observation that these systems are increasingly used in research and practical settings, yet the pathways through which audio and visual tokens influence predictions remain poorly understood. In other words, the paper is not on...