Text-to-video AI is already abundant

Signs point to a general-use Sora-like model coming very soon, maybe even with open-weights.

I’m bringing you this a day early as I’m off on Wednesday this week. Cheers!

Now that AI is mainstream, companies are too strongly incentivized to show us what they have built before we get to use it. For the previous transformative AI modalities, in the form of Stable Diffusion for images and ChatGPT for text, net capabilities came onto the scene with a blog post and general access. Video generation models are poised to be the next major transformation in how people use, and in this case, consume AI-generated content. With the rise of TikTok and its downstream transformation (a.k.a. shortification) or our media internet, video is now the default medium. AI is ready to accelerate this arena.

OpenAI’s text-to-video model Sora was announced on February 16th (the same data as Gemini’s million-token context length), and it spawned substantial discussions on everything from whether OpenAI should release the model to whether video can be used as a world model. It seemed like these models were a thing of the future that we wouldn’t get any time soon — mostly due to cost and locked-down technical complexity.

Ships in Coffee by OpenAI Sora.mp4

The assumption in February seemed to be that OpenAI wouldn’t be willing to take the risk to make this model publicly available before the 2024 American election due to destabilization concerns, whether or not you think that is good. Today, looking at the text-to-video space makes it seem like a publicly available model at Sora caliber is around the corner, and may even be released via open weights (more on this later).

Text-to-video now is the most concentrated area of foundation model competition. The capabilities are all remarkably similar. What changed?

This week, Runway ML announced their Gen-3 model, which they also brand as a general world model, similar to OpenAI. I don’t think these models capture any reality of dynamics given my background in linear systems and particularly some chaos theory. For short timescales and dynamics that are low-priority, these do fine. They’re not creating a world that is measurable, though. They’re great at creating background scenes and randomization to use within many other AI stacks.

gen-3-alpha-output-005.mp4

Just before Runway’s announcement, a Chinese company Kling AI opened up a Sora competitor to influencers and public figures with a Chinese phone number.

P52s9WfWkh_94hQp.mp4

Back at Google IO this year, they announced Veo, which seemed extremely similar to Sora.

veo_example_014_jellyfish.mp4

Just last week, Luma Labs announced and released their model, which you can sign up for on their website. Their model seems a half tier below some of the likes of Sora and Veo in quality, but the general availability matters more. Now that people understand generative video is coming, they want to get their hands on it and learn what it can do. Sora cannot fulfill those needs.

yfCg9tz3XQ3bSalD.mp4

A similar player to Luma is Pika. Pika was a bit earlier to make their model public, doing so last December, but the quality then was clearly a level behind Sora, so it didn’t make the breakthrough splash (like many research papers on the topic before Sora).

Untitled-noaudio.mp4

The text-to-video space is expansive if you start including the long tail of companies doing similar things. Eventually, I expect the companies who win here to have a specialty but need to have offerings in many other modalities in order to please their users. The most important thing these companies can do is acquire users. If adding audio or editing your video is with a competitor, that will be a lost customer.

Some similar offerings include Apparate’s video avatars, Wayve’s reconstructive imaging for self-driving, ElevenLab’s or Google’s text-to-sound-effects, image-to-video tools, video-to-video tools, Viggle’s character rendering that is built on Discord like Midjourney, my friend’s startup Cartwheel for animations, Common Sense Machines for rendering worlds, and likely many other integrations with text.

We know there are more big players that will join the space. Midjourney’s CEO has said it will “begin training its video models starting in January,” and this seems closely aligned with Meta’s core business so we can expect them to have an offering shortly.