Tutorials

How to run Genmo Mochi 1 video generation on a cloud GPU

Deepak Manoor
Posted : November, 12, 2024
Posted : November, 12, 2024

    Video generation is the next frontier for generative AI. Unlike generating images or text, generative video is harder because it needs more compute, has less accessible training datasets, and includes more variables such as smooth motion, temporal coherence and frame aesthetics.

    Models like Llama, Pixtral, Flux, and many others have demonstrated how open-source AI drives faster and more widespread innovation across the field. That’s why Genmo’s announcement of the Mochi 1 model is a key step forward in advancing generative AI. 

    Here’s a snapshot of Mochi specs:

    Mochi 1 Specifications
    Model ArchitectureDiffusion models built on Asymmetric Diffusion Transformer (AsymmDiT) architecture
    Parameters10B
    ContextContext window of 44,520 video tokens
    Resolution480p
    FramesUp to 30 frames per second and a length of 5.4 seconds
    LicensingApache 2.0 - Personal and Commercial

    Genmo AI's benchmark results showcase state of the art (SOTA) performance in both prompt adherence and ELO scores. These metrics indicate how close the result is to the user’s prompt and fluidity in motion.

    Mochi 1 Prompt Adherence

    Source: Genmo AI

    Connect with our team and other AI builders

    Join Ori on Discord
    Ori on Discord

    Deploy Genmo Mochi Video with ComfyUI on an Ori GPU instance

    ComfyUI is an open-source AI tool developed and maintained by Comfy Org for running image and video generation models. Check out their Github here.

    Pre-requisites:

    Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 SXM with 80 GB VRAM for this demo, but the optimized ComfyUI version also runs on GPUs with lesser memory. A powerful GPU with more memory enhances the variational autoencoder (VAE), but ComfyUI will switch to tiled VAE automatically if memory is limited. 

    Install ComfyUI:

    Step 1:

    Bash/ShellCopy
    1sudo apt update
    2sudo apt install git

    Step 2: Download the ComfyUI files

    Bash/ShellCopy
    1git clone https://github.com/comfyanonymous/ComfyUI.git 
    2cd ComfyUI

    Step 3: If you didn’t add the init script for Pytorch, when creating the virtual machine

    Bash/ShellCopy
    1  pip install torch torchvision torchaudio

    Step 4: Install dependencies

    Bash/ShellCopy
    1pip install -r requirements.txt

    Step 5: Install ComfyUI Manager which helps you manage your custom nodes and instance.

    Bash/ShellCopy
    1cd custom_nodes
    2git clone https://github.com/ltdrdata/ComfyUI-Manager.git

    How good is Genmo Mochi 1?

    Mochi 1 Preview demonstrated strong prompt adherence and impressive video dynamics. Reducing the iteration count and frame count can significantly shorten video generation time, which could otherwise take 45 minutes for a 5-second clip at 200 iterations and 30 fps.

    We recommend testing the model several times to find a frame rate and iteration count that balance your needs and efficiency. We observed that using detailed prompts—including specifics on camera angles, motion type, lighting, and environment—yielded better results.

    With the right prompts, Mochi 1 showed remarkable flexibility in frame aesthetics. While the motion occasionally appeared glitchy, it was generally smooth and fluid. The Genmo Mochi AI model also excelled in executing close-up shots with impressive clarity.

    The model struggles with text insertion, a challenge that has also affected image generation models in the past. However, recent improvements in image models show promise, and we expect Genmo to enhance this feature in future updates. Each model iteration takes about 14 seconds which means generating a 5 second clip with 200 iterations takes more than 45 minutes (14*200 seconds), whereas closed-source video generation models are typically much faster.

    Though Genmo currently limits the model to 480p resolution and 5-second videos, we’re excited about their upcoming 720p version of the text-to-video model and the future of open source video generation. 

    Imagine another AI reality. Build it on Ori.

    Ori Global Cloud is the first AI infrastructure provider with the native expertise, comprehensive capabilities and end-to-endless flexibility to support any model, team, or scale. Here’s what you can do on Ori:

    Share