When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti.
It’s become something of a meme as well as a benchmark: Seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.
Google Veo 2 has done it.
We are now eating spaghett at last. pic.twitter.com/AZO81w8JC0
— Jerrod Lew (@jerrod_lew) December 17, 2024
Will Smith and pasta is but one of several bizarre “unofficial” benchmarks to take the AI community by storm in 2024. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect 4 against each other.
It’s not like there aren’t more academic tests of an AI’s performance. So why did the weirder ones blow up?
For one, many of the industry-standard AI benchmarks don’t tell the average person very much. Companies often cite their AI’s ability to answer questions on Math Olympiad exams, or figure out plausible solutions to Ph.D.-level problems. Yet most people — yours truly included — use chatbots for things like responding to emails and basic research.
Crowdsourced industry measures aren’t necessarily better or more informative.
Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image. But raters tend not to be representative — most come from AI and tech industry circles — and cast their votes based on personal, hard-to-pin-down preferences.
Ethan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they don’t compare a system’s performance to that of the average person.
“The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless,” Mollick wrote.
Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical — or even all that generalizable. Just because an AI nails the Will Smith test doesn’t mean it’ll generate, say, a burger well.
One expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That’s sensible. But I have a feeling that weird benchmarks aren’t going away anytime soon. Not only are they entertaining — who doesn’t like watching AI build Minecraft castles? — but they’re easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.
The only question in my mind is, which odd new benchmarks will go viral in 2025?
Leave a Reply