
Today’s artificial intelligence systems often demonstrate perplexing inconsistencies. For instance, one might request a video of a dog only to find its collar disappears as it moves behind a love seat. This glitch arises from the predictive nature of many AI models—similar to those powering ChatGPT—which are designed to forecast the most probable visual or text-based outcomes without a well-defined, continually updated understanding of the world.
However, researchers across various AI domains are beginning to develop what are known as “world models.” These models promise to extend AI’s capabilities significantly, impacting realms such as augmented reality, robotics, autonomous vehicles, and the pursuit of artificial general intelligence (AGI). The concept can be better understood through the notion of 4D models, which incorporate three dimensions plus time.
To visualize 4D modeling, consider the film Titanic, which adopted stereoscopic techniques to create a 3D representation. If each frame of this film were captured in 3D, it would allow viewers to navigate through different timeframes and perspectives, thus generating numerous new variations of the film. Recent research, such as the preprint titled “NeoVerse: Enhancing 4D World Model with in-the-Wild Monocular Videos,” outlines methods for transforming videos into 4D models, enabling new video creation from various viewpoints.
Applying 4D modeling techniques can also enhance video stability. For example, in the earlier scenario of the dog behind the love seat, a continuously updated 4D world model will assist in keeping the dog’s collar visible and in maintaining the identity of the furniture. This capability is still in its nascent stages but points to a broader ambition of creating AI systems with dynamic, internal scene maps that evolve over time.
4D models are crucial not only for video content but also for augmented reality systems like Meta’s prototype glasses. By establishing an evolving map of a user’s environment, these models facilitate stability of virtual objects, realistic lighting, and the maintenance of a spatial memory of recent events, including the critical concept of occlusions—where digital items overlap with real-world ones. As detailed in a 2023 paper, achieving effective occlusion necessitates a solid 3D model of the physical environment.
Looking towards AGI, establishing a world model takes on even greater importance. Current leading large language models, like ChatGPT, possess an implicit understanding of the world drawn from their training data but cannot integrate new information in real time. As noted by Angjoo Kanazawa from UC Berkeley, there is substantial room for growth in developing an intelligent LLM vision system capable of real-time updates based on new experiences. The challenge remains to create a world model that incorporates spatial-temporal memory, enhancing the capabilities of LLMs.
Several prominent AI researchers are now pivoting towards world models. Notably, Fei-Fei Li established World Labs in 2024, introducing the Marble software for generating 3D environments from various media. Additionally, Yann LeCun, after leaving Meta, aims to develop systems specializing in understanding the physical world, reasoning, and planning through his startup AMI Labs. His previous works emphasize the significance of internal modeling, suggesting that humans excel in unfamiliar situations partly due to their ability to construct dynamic world models.
Recent advancements in AI research, such as those reported in an April 2025 study highlighting DreamerV3, point to significant changes ahead. By acquiring a world model, AI agents improve their performance by “imagining” potential future scenarios. Thus, while the term “world model” encompasses comprehensive internal representations of physical reality, the development of 4D models may serve as the building blocks that facilitate emerging AI systems in understanding and navigating their environments more effectively.