The Allen Institute for AI (Ai2) announced the launch of Molmo, a family of state-of-the-art multimodal models. This family includes our best Molmo model, closing the gap between close and open models, an open, efficient, and powerful multimodal model. Currently, most advanced multimodal models can perceive the world and communicate with us, Molmo goes beyond that to enable one to act in their worlds, unlocking a new generation of capabilities, from sophisticated web agents to robotics:

  • Exceptional image understanding: Molmo can accurately understand a wide range of visual data, from everyday objects and signs to complex charts, messy whiteboards, clocks, and menus.
  • Actionable insights: To bridge the gap between perception and action, Molmo models can point to what they perceive, empowering capabilities that require spatial knowledge. Molmo can point to UI elements on the screen, enabling developers to build web agents or robots that can navigate complex interactions on screen and within the real-world.

Molmo was designed and built in the open and Ai2 will be releasing all model weights, captioning and fine-tuning data, and source code. Select model weights, inference code, and demo are available, providing open access to enable continued research and innovation in the AI community.

https://molmo.allenai.org/blog