Advanced AI systems that can process and generate information across multiple data modalities, such as text, images, audio, and video.
A Large Multimodal Model (LMM) is a neural network trained on a vast dataset of images, text, and audio, which can generate new images, captions, and even spoken words based on a given prompt.