The video titled “AI Reasoning is Textual, not VISUAL #superintelligence,” uploaded on October 4, 2025, by the “Discover AI” channel, delves into the intriguing notion that reasoning in AI crystallizes in textual embedding rather than visual cognition. The video is based on a paper titled “Learning to See Before Seeing” from Meta Superintelligence Labs and the University of Oxford. The core argument presented is that learning, when conducted through textual pre-training, can transcend into a form of reasoning that doesn’t initially require visual input, challenging conventional beliefs about multimodal training data.
The notion of intertwining text and visual data, formatted elegantly with logical and intellectively dense content, highlights the Platonic Representation Hypothesis, asserting that robust AI systems converge on similar latent representations within mathematical vector spaces.
What stands out in this rhetorical exploration is the premise of pre-training with predominantly logical and reasoning-centric data, constituting up to 75% of the total, complemented by a minimal 15% of visual information. This composition is claimed to be critical in optimizing the AI’s capability in blending logical reasoning with visual comprehension, as proven through extensive computational experiments at a scale of one trillion tokens and half a million GPU hours.
However, while the video aptly presents the synergy between text and visual embeddings, it could benefit from explicating the limits of current AI systems in terms of reasoning and visualization, which have broader implications in understanding the AI’s interpretative limitations.
The narrative convinces that despite the marginal requirement for visual data, the depth and breadth of reasoning-centric information significantly influence AI’s cognitive abilities. This challenges preconceived notions that substantial multimodal data is mandatory for accurate AI reasoning, positing instead that textual logic, when properly mapped, creates a versatile reasoning agent capable of inherent visual integration.