This Week in AI: Tech giants embrace synthetic data

Hiya, folks, welcome to TechCrunch’s regular AI newsletter. If you want this in your inbox every Wednesday, sign up here.

This week in AI, synthetic data rose to prominence.

OpenAI last Thursday introduced Canvas, a new way to interact with ChatGPT, its AI-powered chatbot platform. Canvas opens a window with a workspace for writing and coding projects. Users can generate text or code in Canvas, then, if necessary, highlight sections to edit using ChatGPT.

From a user perspective, Canvas is a big quality-of-life improvement. But what’s most interesting about the feature, to us, is the fine-tuned model powering it. OpenAI says it tailored its GPT-4o model using synthetic data to “enable new user interactions” in Canvas.

“We used novel synthetic data generation techniques, such as distilling outputs from OpenAI’s o1-preview, to fine-tune the GPT-4o to open canvas, make targeted edits, and leave high-quality comments inline,” ChatGPT head of product Nick Turley wrote in a post on X. “This approach allowed us to rapidly improve the model and enable new user interactions, all without relying on human-generated data.”

OpenAI isn’t the only Big Tech company increasingly relying on synthetic data to train its models.

In developing Movie Gen, a suite of AI-powered tools for creating and editing video clips, Meta partially relied on synthetic captions generated by an offshoot of its Llama 3 models. The company recruited a team of human annotators to fix errors in and add more detail to these captions, but the bulk of the groundwork was largely automated.

OpenAI CEO Sam Altman has argued that AI will someday produce synthetic data good enough to train itself, effectively. That would be advantageous for firms like OpenAI, which spends a fortune on human annotators and data licenses.

Meta has fine-tuned the Llama 3 models themselves using synthetic data. And OpenAI is said to be sourcing synthetic training data from o1 for its next-generation model, code-named Orion.

But embracing a synthetic-data-first approach comes with risks. As a researcher recently pointed out to me, the models used to generate synthetic data unavoidably hallucinate (i.e., make things up) and contain biases and limitations. These flaws manifest in the models’ generated data.

Using synthetic data safely, then, requires thoroughly curating and filtering it — as is the standard practice with human-generated data. Failing to do so could lead to model collapse, where a model becomes less “creative” — and more biased — in its outputs, eventually seriously compromising its functionality.

This isn’t an easy task at scale. But with real-world training data becoming more costly (not to mention challenging to obtain), AI vendors may see synthetic data as the sole viable path forward. Let’s hope they exercise caution in adopting it.

News

Ads in AI Overviews: Google says it’ll soon begin to show ads in AI Overviews, the AI-generated summaries it supplies for certain Google Search queries.

Google Lens, now with video: Lens, Google’s visual search app, has been upgraded with the ability to answer near-real-time questions about your surroundings. You can capture a video via Lens and ask questions about objects of interest in the video. (Ads probably coming for this too.)

From Sora to DeepMind: Tim Brooks, one of the leads on OpenAI’s video generator, Sora, has left for rival Google DeepMind. Brooks announced in a post on X that he’ll be working on video generation technologies and “world simulators.”

Fluxing it up: Black Forest Labs, the Andreessen Horowitz-backed startup behind the image generation component of xAI’s Grok assistant, has launched an API in beta — and released a new model.

Not so transparent: California’s recently passed AB-2013 bill requires companies developing generative AI systems to publish a high-level summary of the data that they used to train their systems. So far, few companies are willing to say whether they’ll comply. The law gives them until January 2026.

Research paper of the week

Apple researchers have been hard at work on computational photography for years, and an important aspect of that process is depth mapping. Originally this was done with stereoscopy or a dedicated depth sensor like a lidar unit, but those tend to be expensive, complex, and take up valuable internal real estate. Doing it strictly in software is preferable in many ways. That’s what this paper, Depth Pro, is all about.

A leksei Bochkovskii et al. share a method for zero-shot monocular depth estimation with high detail, meaning it uses a single camera, doesn’t need to be trained on specific things (like it works on a camel despite never seeing one), and catches even difficult aspects like tufts of hair. It’s almost certainly in use on iPhones right now (though probably an improved, custom-built version), but you can give it a go if you want to do a little depth estimation of your own by using the code at this GitHub page.

Model of the week

Google has released a new model in its Gemini family, Gemini 1.5 Flash-8B, that it claims is among its most performant.

A “distilled” version of Gemini 1.5 Flash, which was already optimized for speed and efficiency, Gemini 1.5 Flash-8B costs 50% less to use, has lower latency, and comes with 2x higher rate limits in AI Studio, Google’s AI-focused developer environment.

“Flash-8B nearly matches the performance of the 1.5 Flash model launched in May across many benchmarks,” Google writes in a blog post. “Our models [continue] to be informed by developer feedback and our own testing of what is possible.”

Gemini 1.5 Flash-8B is well-suited for chat, transcription, and translation, Google says, or any other task that’s “simple” and “high-volume.” In addition to AI Studio, the model is also available for free through Google’s Gemini API, rate-limited at 4,000 requests per minute.

Grab bag

Speaking of cheap AI, Anthropic has released a new feature, Message Batches API, that lets devs process large amounts of AI model queries asynchronously for less money.

Similar to Google’s batching requests for the Gemini API, devs using Anthropic’s Message Batches API can send batches up to a certain size — 10,000 queries — per batch. Each batch is processed in a 24-hour period and costs 50% less than standard API calls.

Anthropic says that the Message Batches API is ideal for “large-scale” tasks like dataset analysis, classification of large datasets, and model evaluations. “For example,” the company writes in a post, “analyzing entire corporate document repositories — which might involve millions of files — becomes more economically viable by leveraging [this] batching discount.”

The Message Batches API is available in public beta with support for Anthropic’s Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku models.

source