Gemini Omni is Google’s first any-to-any multimodal “world model.” You give it any combination of text, image, audio, and video, and it generates video grounded in real-world knowledge. The first model, Gemini Omni Flash, is shipping now. βοΈ
Text β video. Describe a scene; get a 10-second clip with synced audio. πΌοΈ
Image β video. Upload a photo β a product, a person, a sketch β and bring it to life.ποΈ
Video β video. Restyle, edit, or add effects to footage you already have.π΅
Audio β video. Upload a song or voiceover and get video that matches its pacing, emotion, and beats β genuinely unique to Omni. π¬
Conversational editing. Refine a clip by chatting: “make the lights dimmer,” “remove the violin,” “change the angle.” Every instruction builds on the last, and the scene remembers what came before. π
World-model physics. It has an intuitive grasp of gravity, kinetic energy, and fluid dynamics β so a marble rolling down a track bounces and sounds right.
Think of it less as a video generator and more as a creative collaborator that already understands how the world looks, moves, and sounds.
