State-of-the-Art Natural Language and Computer Vision Models by Llava

The article introduces LLaVA-NeXT, an advancement over LLaVA-1.5, featuring enhanced reasoning, OCR, and world knowledge capabilities. It boasts better performance on benchmarks, improved visual reasoning and OCR with a new data mixture, and efficient deployment with SGLang. The largest 34B variant of LLaVA-NeXT is trained with less than 1M visual instruction tuning samples and maintains the minimalist design of its predecessor. The open-source release includes code, data, and model, with some components to be made available soon.

  • Highlights: LLaVA-NeXT achieves state-of-the-art performance, offers zero-shot Chinese capability, and is trained with low computational and data costs. It outperforms other open-source and commercial LMMs on selected benchmarks.
  • Technical Details: The article details technical improvements like dynamic high-resolution input, a high-quality user instruct data mixture, and scaling of the LLM backbone. It also provides a model card for LLaVA-NeXT variants, detailing model sizes, components, and training data.
Ollama
Not Applicable
March 3, 2024
LLaVA-NeXT: Enhanced Reasoning and OCR Capabilities
Exploring Vision Models: A Comprehensive Guide