Meta’s new Make-a-Video AI can generate quick movie clips from text prompts

Meta unveiled its Make-a-Scene text-to-image generation AI in July, which like Dall-E and Midjourney, utilizes machine learning algorithms (and massive databases of scraped online artwork) to create fantastical depictions of written prompts. On Thursday, Meta CEO Mark Zuckerberg revealed Make-a-Scene’s more animated contemporary, Make-a-Video.

As its name implies, Make-a-Video is, “a new AI system that lets people turn text prompts into brief, high-quality video clips,” Zuckerberg wrote in a Meta blog Thursday. Functionally, Video works the same way that Scene does — relying on a mix of natural language processing and generative neural networks to convert non-visual prompts into images — it’s just pulling content in a different format.

“Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage,” a team of Meta researchers wrote in a research paper published Thursday morning. Doing so enabled the team to reduce the amount of time needed to train the Video model and eliminate the need for paired text-video data, while preserving “the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models.”

Blog