Last week OpenAI posted an update on its Sora text-to-video AI project. Specifically, it shared a technical paper along with a series of exemplary video results.
Today I’d like to share some thoughts, specifically from my future of higher education viewpoint.
To begin with, please note that Sora is NOT available for general users yet. They’re doing “red team testing” on it, presumably with a mix of in-house staff and selected (doubtless NDA-covered) external consultants. It’s not clear when we’ll be able to use it.
Now, the given examples are fascinating. They are 10 to 60 second clips, each with text prompt, and some just look amazing. For example, for “A young man at his 20s is sitting on a piece of cloud in the sky, reading a book“ we see this:
It is surreal in a way, but all textures look tangible and the effect is comforting rather than distancing.
I can’t help but smile at “Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee”:
The most ambitious clip the Sora team shared is the longest, both in length (a full minute) and prompt detail and size:
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
For me, this one enters the uncanny valley with closeups of the woman’s skin. Otherwise it works pretty well.
We should take these examples with some grains of salt. We can’t tell to what extent people have edited and revised them to make them look good. We don’t know how much work it took to get to these results from the initial prompts - how much prompt iteration, how many failed prompts. We also don’t know how representative these examples are of the application as it now stands. They could be the best 1% and users will have to wade through an awful lot of uncanny to get to similar output.
OpenAI hedged their bets carefully with this release. They openly admitted to limitations and mistakes:
The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.
The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.
They also showed problematic examples, like this one for “Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care”:
(I actually enjoy the surrealism here, but you see the problem.)
Note the spatial dimension of these clips. Objects move in three dimensions, as does the point of view, sometimes.
What else can we tell about Sora from this limited publication?
The creative potential is huge. We’ve already seen people developing all kinds of results with LLM text and image generators, so it’s not a stretch to imagine users generating all kinds of video content. (For me, one milestone is the capacity to create a feature length video on demand. I’ve already posted about some ideas along this line.)
It clearly ramps up disinformation possibilities as well. Our increasingly visual society loves images and adores video. We should plan on some people turning Sora to produce all kinds of fakes for political, economic, and personal purposes.
So what does Sora mean? How might it unfold in the world?
If we look at OpenAI’s brief history alongside the history of other generative AI projects and extrapolate, we might expect some of the following:
a series of releases, each trying to outdo the priors in terms of input size and output quality.
there should be missteps and hideous output as well. There will be unrepresentative, insensitive, and cruel results.
connections to other services, like we’ve seen with ChatGPT and DALL-E blending, or Microsoft and Google seeding all kinds of preexisting applications with their respect AI efforts. One way this might work is for a video generator to build on images created by tools like MidJourney.
more copyright lawsuits. This time I’d watch for the movie industry as well as the music world (as they produce videos). NB: OpenAI hasn’t identified Sora’s training dataset yet, according to Ars Technica.
torrents of hyper and counter-hype. Sora will confirm pro- and anti-LLM sides in their beliefs.
the creation of tiered pricing plans, sometimes in the freemium model.
more efforts to create watermarks, now for video.
some video creators claiming their work is AI-free.
security problems, starting with users entering data in prompts which they shouldn’t.
We should expect to see other text-to-video services appear. Several companies have posted about their own, as yet unreleased projects, like Meta’s Make-A-Video and Google’s Lumiere and Imagen. Doubtless there are startups furiously working on such applications. One such is Synthesia, which already lets us create very short videos on demand (yes, I’ll share some of my results soon!). Runway has its Gen-2. Also doubtlessly: there should be open source projects burgeoning.
A big opening here is soundtracks. Sora’s releases as as silent as the first films. Will OpenAI release an audio generator, or will users combine Sora with someone else’s sound creator?
Looking further out… I see several paths for text-to-video to take. One is to further democratize video creation, with people adding LLM-created clips to many aspects of life, from PowerPoint presentations to avatars and passion projects. Beyond this, we see a threat to professional video and film creators: movie studios, tv show makers, animators, CGI professionals, advertisers, independents of all kinds. Will Sora et al clobber them the way the way caused a massive die-back to travel agencies? Or will the businesses apply capital to labor in the established way, automating aspects of video production and laying off staff?
At the same time, we should anticipate some AI-enabled creative forms appearing in both amateur and professional video worlds, features we haven’t seen (much of) before. Perhaps we’ll see generative AI follow CGI’s route of working first on topics which are easier to master: toys before human faces. We might see animation efforts more accepted than live action. Maybe we’ll see people go wild with camera angles and points of view, since there won’t be any physical constraints. And if computing power is an issue (for reasons of price or climate impact) we might see users opting for videos with less movement and fewer moving objects, leading to a calmer, simpler style.
One potential twist: if Sora et al become globally available and at least offer decent quality output, perhaps text-to-video will replace storyboarding. Creators and designers of all sorts could turn to such apps to quickly churn out richer visualizations than those storyboards often offer. Imagine historians and police officers using something like Sora to visualize and test out hypotheses.
Let’s go a little further. The AI Explained YouTuber thinks we might see non-video generative AI subsumed within video tools. For example, we already know how to create images from film and video; why not fold DALL-E into Sora?
“AE” goes further and coincidentally chimes in with what I was teaching at Georgetown last week: branching paths. Sora or a competitor/successor can generate variations of a prompt, like all generative AI tools do. Why not offer multiple versions of a video story? Users get to create iterations according to what they prefer. We could change the main character of a video, while keeping the background and other characters the same, or vice versa. It’s a new spin on the garden of forking paths.
…and that’s all for now. I’d like to follow up with a focus on Sora and education.
Great article. Super insightful.
I'll be interested to see what you make of the educational impact. For now I think it is limited to background video that creates a 'vibe' for a lecture or instructional video. When it can reliably produce more specific detail will be the thing that has a big impact.