The Holistic Evaluation of Language Models project by Stanford CRFM introduces the crfm-helm Python package, which includes a variety of features for evaluating text-to-image generation models. This initiative addresses the urgent need to understand the capabilities and risks of these models, which are increasingly used in real-world applications. The HEIM benchmark evaluates models across 12 different aspects crucial for real-world deployment, offering a more comprehensive assessment than previous evaluations focused solely on alignment and quality. The results, available on the project’s website and paper, highlight that no single model excels in all aspects, underscoring the importance of this holistic approach.