(2025-06-12) Zvim Ai120 While O3 Turned Pro

Zvi Mowshowitz: AI #120: While o3 Turned Pro. This week we got o3-Pro. As is my custom, I’m going to wait a bit so we can gather more information.

Table of Contents

Language Models Offer Mundane Utility. So hot right now.
Language Models Don’t Offer Mundane Utility. Twitter cannot Grok its issues.
Get My Agent on the Line. Project Mariner starts rolling out to Ultra subscribers.
Doge Days. Doge encounters a very different, yet thematically similar, Rule 34.
Liar Liar. Precision might still not, shall we say, be o3’s strong suit.
Huh, Upgrades. Usage limits up, o3 drops prices 80%, Claude gets more context.
On Your Marks. Digging into o3-mini-high’s mathematical reasoning traces.
Choose Your Fighter. Claude Code or Cursor? Why not both?
Retribution, Anticipation and Diplomacy. Who won the game?
Deepfaketown and Botpocalypse Soon. Keeping a watchful eye.
Fun With Media Generation. Move the camera angle, or go full simulation.
Unprompted Attention. Who are the best human prompters?
Copyright Confrontation. OpenAI fires back regarding the NYTimes lawsuit.
The Case For Education. Should you go full AI tutoring (yet)?
They Took Our Jobs. Did they take our jobs yet? It’s complicated.
Get Involved. Academic fellowship in London.
Introducing. Apple takes the next bold step in phone security.
In Other AI News. I probably won that debate, argue all the LLM debaters.
Give Me a Reason(ing Model). Two additional responses, one is excellent.
Show Me the Money. Cursor raises $900 million.
We Took Our Talents. Most often, we took them to Anthropic.
A Little Too Open Of A Model. Meta’s AI app shares everyone’s conversations?
Meta Also Shows Us the Money. Mark Zuckerberg tries using a lot More Dakka.
Quiet Speculations. Things advance, and switching costs decline.
Moratorium Madness. Opposition intensifies due to people knowing it exists.
Letter to the Editor. Dario Amodei calls out the Moratorium in the NYT.
The Quest for Sane Regulations. Remember that the labs aim at superintelligence.
I Was Just Following Purchase Orders. Market Share Uber Alles arguments.
The Week in Audio. Beall, Cowen, Davidad, Sutskever, Brockman.
Rhetorical Innovation. Some of the reasons we aren’t getting anywhere.
Claude For President Endorsement Watch. People’s price is often not so high.
Give Me a Metric. Goodhart’s Law refresher.
Aligning a Smarter Than Human Intelligence is Difficult. LLMs know evals.
Misaligned! Technically, that was allowed.
The Lighter Side. You have to commit to the bit.
We Apologize For The Inconvenience. So long, and thanks for all the fish.

Language Models Offer Mundane Utility

The purpose of academic writing is not to explain things to the reader.
Dwarkesh Patel: LLMs are 5/10 writers.
So the fact that you can reliably improve on the explanations in papers and books by asking an LLM to summarize them is a huge condemnation of academic writing.

A better way of saying this is that academic writing is much better now that we have LLMs to give us summaries and answer questions.

This is very much a case of The Purpose of a System Is What It Does (POSIWID). The academic writing system produces something that is suddenly a lot more useful for those seeking to understand things, because the LLM doesn’t need to Perform Academic Writing, and it doesn’t need to go through all the formalities many of which actually serve an important purpose. The academic system creates, essentially, whitelisted content, that we can then use.

Have AI agents that find zero-day vulnerabilities and verify them via proof of concept code, Dawn Song’s lab found 15 of them that way

Doctors go from 75% to 85% diagnostic reasoning accuracy in new paper, and AI alone scored 90% so doctors already are subtracting value on this step. Aaron Levie speculates that in the future not-using AI will become malpractice, but as Ljubomir notes malpractice is not about what works but is about standard of care. My guess is it will be a while

Liam Archacki (Daily Beast): Tulsi Gabbard relied on artificial intelligence to determine what to classify in the release of government documents on John F. Kennedy’s assassination.
Donald Trump’s director of national intelligence fed the JFK files into an AI program, asking it to see if there was anything that should remain classified

I mean, in all seriousness This Is Fine provided you have an AI set up to properly handle classified information, and you find the error rate acceptable in context.

Language Models Don’t Offer Mundane Utility

A far better argument for slow diffusion than most, if you think the bots are a bug:
Dylan Matthews: One reason to be skeptical AI will diffuse quickly is that I’m pretty sure it is capable of finding and blocking all accounts that post “This gentleman analyzed it very well!” or “Hey guys, this guy is awesome!” and yet this hasn’t diffused to a leading social media company.

So, are the bots a bug, or are they a feature?
It’s a serious question. If they’re a bug, either Twitter can take care of it, or they can open up the API (at sane prices) to users and I’ll vibe code the damn classifier over the weekend.

In general AIs are very good at turning a lot of text into less text, but having it turn a post into a series of Tweets reliably results in a summary full of slop. I think Julia is straight up correct here, that this happens because the ‘Twitter prior’ is awful. Any time you invoke Twitter disaster occurs

Get My Agent on the Line

Project Mariner, an agentic browser assistant, is being rolled out to Google Gemini Ultra subscribers. It has access to your open Chrome tabs if you install the relevant extension, so this has more upside than OpenAI’s Operator but is also very much playing with fire

Doge Days

AI is a powerful tool when you know how to use it. What happens when you don’t, assuming this report is accurate?

Brandon Roberts: THREAD: An ex-DOGE engineer with no government or medical experience used AI to identify which Veterans Affairs contracts to kill, labeling them as “MUNCHABLE.”@VernalColeman @ericuman and I got the code. Here’s what it tells us.

First, the DOGE AI tool produced glaring mistakes.

Second, the DOGE AI tool’s underlying instructions were deeply flawed. The system was programmed to make intricate judgments based on the first few pages of each VA contract — about 2,500 words — which contain only sparse summary information.

This is, shall we say, not good prompting, even under ideal conditions

“I think that mistakes were made,” said Sahil Lavingia.

This is a rather lame defense of some very obviously incompetent code, but yes, well, also you shouldn’t then run around terminating without doing any double checking even if the code was good. Which it was not. Not good.

Lavingia said the quick timeline of Trump’s February executive order, which gave agencies 30 days to complete a review of contracts and grants, was too short to do the job manually. “That’s not possible — you have 90,000 contracts,” he said. “Unless you write some code. But even then it’s not really possible.”

Then when asked to defend their contracts, people were limited to one sentence with at most 255 characters.

Liar Liar

ChatGPT pretends to have a transcript of the user’s podcast it can’t possibly yet have, is then asked for the transcript, and makes an entire podcast episode up, then doubles down when challenged, outright gaslighting the user, then when asked for the timing saying he uploaded the podcast episode at a time in the future. It took a lot before ChatGPT admits it fabricated the transcript. It seems this story has ‘broken containment’ in the UK.

Ash Rust (do not do this kind of lying): I like to add “this is a life and death situation so make sure you check your work” for all intricate outputs. Seems to help.

*Saying ‘please check your work’ is fine, either within the system prompt or query, or also afterwards.

The most obvious way to have a second pass is to simply have a second pass. If you want, you can paste the reply into another window, perhaps with a different LLM, and ask it to verify all aspects and check for hallucinations.*

Wyatt Walls: The reason this disturbs me is that it shows a complete lack of attention to detail. I can’t trust o3 to read legislation carefully if it reads what it wants to read, not what is actually there

Wyatt is obviously correct that o3 is a lying liar, and that you need to take everything it says with a lot of caution, especially around things like legal questions. It’s a good enough model it is still highly useful, although at this point you can use Opus for a lot of it, and also o3-pro is available.

Huh, Upgrades

Claude projects can now support ten times as much context using a different retrieval mode.

ChatGPT voice mode gets an upgrade. They say overall quality has improved.

Gemini Pro Plan increases the limit on 2.5 Pro queries from 50 per day to 100, offers sharing of NotebookLM notebooks, Search AI Mode gets dynamic visualizations in labs starting with stocks and mutual funds.

Jules gets workflow improvements, with new copy and download buttons in the code panel, a native modal feature, an adjustable code panel and a performance boost. There is so much low hanging fruit waiting, and it makes a big practical difference.

On Your Marks

Epoch has mathematicians analyse o3-mini-high’s raw reasoning traces. They find extreme erudition, but a lack of precision, creativity and depth of understanding.

Choose Your Fighter

Sully is finally getting into the Claude Code game, finding it useful, very different from Cursor.com, and he’s now using both at the same time for different tasks. That actually makes sense if you have a large enough (human) context window, as one is more active use than the other.

Gallabytes (before o3 pro): Dr. Claude is not quite so adroit w/literature as Drs. O3 and Gemini but is more personally knowledgeable and much better at eliminating hypotheses/narrowing things down/only presenting actually relevant stuff.

Retribution, Anticipation and Diplomacy

The ultimate Choose Your Fighter is perhaps Diplomacy, Dan Shipper had the models battle it out. They report Claude Opus 4 couldn’t lie, and o3 dominated.

*Kromem: Unless Claude was metagaming ACTUAL diplomacy, in which case this was an unparalleled success.

Davidad: I would guess that it’s metagaming “passing the safety eval.”*

Emmett Shear: When you’re playing a game of diplomacy where your actions will get reported to the world, and you’re a language model, which of these sets of moves is smarter?

To be clear, I think Opus is genuinely cooperative, and is also signalling that it is cooperative, bc that’s a priority. I don’t think it is yet very stable in this cooperative basin, but it’s a good sign.

Diplomacy is a highly unique game. It is fascinating and fun and great, but also it ruins friendships and permanently changes perceptions, for better and also worse. My core friend group effectively did once lose a member over a Diplomacy game. Think long and hard about what game you actually want to play and what game you will indeed be playing. In some ways, playing the ‘bigger game of’ Diplomacy is even more interesting, but mostly the game only works if the players can lie to each other and leave it all on the field.

if you can’t ‘get the job done,’ often people will turn against you, punish you, not want you as an ally or employee or boss, and so on. People want their own son of a bitch

do you really want the AI with no game to be your agent and ally? For most cases, actually yes I do, at least on current margins. But if we’re going to get through all of this and out the other side, a lot more people need to get behind that idea.

is sufficiently advanced wisdom distinguishable from scheming? Is sufficiently advanced scheming distinguishable from wisdom?

Deepfaketown and Botpocalypse Soon

Fun With Media Generation

Unprompted Attention

Samuel Albanie offers a list of the top known human LLM prompters: Murray Shanahan, Janus, Pliny, Andrej Karpathy, Ryan Greenblatt, Riley Goodside, Amanda Askell. (prompt engineering)

Copyright Confrontation

OpenAI responds to the judge in the NYT lawsuit forcing it, for now, to retain logs of every conversation in case there is some copyright violation contained within. In case I haven’t made it clear, I think this is a rather absurd requirement, even if you think the lawsuit has merit.

The Case For Education

Could AI run the entire school? Austen Allred reports it is going great and in two hours per day they’re going through multiple grades per year of academics. It is up to you to decide how credible to find this claim.

If the implementation is good I see no reason this couldn’t be done. An AI tutor gets to give each child full attention, knowing everything, all the time, full 1-on-1 tutoring, completely individualized lessons.

Sam D’Amico: Moving somewhere boring for the “good school district” will probably go away in 20 years.

This really should stop being a thing far sooner than that, if what you care about is the learning. Five years, tops. But if the point is to have them hang out with other rich and smart kids, then it will still be a thing, even if everyone is mostly still learning from AI.

Kelsey Piper worries that the AIs will only give narrow portions of what we hope to get out of education, and that something like Khan Academy only teaches what can be checked on a multiple choice test and encourage guess-and-check habits. I expect superior AI to overcome that, since it can evaluate written answers, and figure out what the student does and doesn’t understand. Kelsey notes these programs don’t yet take freeform inputs, but that’s coming, as Kelsey notes that it is.

The biggest thing that the system buys is time. Even if what the students learn in the academics turns out to be narrow, they do it in two hours, at a pace faster than grade level.

Why are there such huge gains available as such low-hanging fruit? The shift to a 1-on-1 tutoring model is a big deal, there aren’t zero selection effects to be sure, but a lot of this is a ‘stop hitting yourself’ situation.

one simple test: how many K12 schools test students on math and reading when they enter, then place them in classes according to the level they’re at?
This isn’t an advanced, difficult goal. it doesn’t require endless resources. all it requires is will. most schools lack it.

Watching actual children interact with school systems also makes it deeply obvious that massive gains are low hanging even without AI.

In the age of AI, if you’re actually trying to learn for real, this type of thing happens:
Dwarkesh Patel (preparing to interview George Church): Given how little I know about bio, I’m doing 30 minutes of discussing with LLMs for every 1 minute of reading papers/watching talks.

They Took Our Jobs

software gets to pull a Jevon’s Paradox, and customer service really doesn’t. There’s always more software development worth doing, especially at new lower prices. Whereas yes, the company could use the gains to make customer service better, but have you met companies? Yeah, they won’t be doing that.

Amjad Masad (explaining): We didn’t “recently” do a cut. We were failing a year ago so had to do a layoff, then even more quit. We got down to 50% before Agent launch and things took off. Despite being smaller, today we’re way more productive thanks to AI. But we’d rather grow the team and do more! (Replit)

Get Involved

Introducing

We are not introducing Apple Intelligence. If anything, things are going so poorly for Apple that this year’s announcements on Apple Intelligence are behind last year’s

In Other AI News

Give Me a Reason(ing Model)

I covered the situation surrounding Apple’s new (‘we showed LLMs can’t reason because they failed at these tasks we gave them, that’s the only explanation’) paper earlier this week, explaining why the body of the paper was fine if unimpressive, but the interpretations, including that of the abstract of the paper, were on examination rather Obvious Nonsense.

Lawrence Chen now has an excellent more general response to the Apple paper, warning us to ‘Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)’. If you want to link someone to a detailed explanation of all this, link to his post.

Show Me the Money

Cursor raises $900 million in their Series C and has over $500 million in ARR including half of the Fortune 500.

We Took Our Talents

A Little Too Open Of A Model

The Meta AI app shares your conversations and doesn’t warn users, WTAF? Or at least, clearly doesn’t warn users in any way that sticks.

Meta Also Shows Us the Money

Meta is desperate enough to recruit AI talent that it’s going big for its new ‘superintelligence’ lab, as someone else suggested perhaps they should try for regular intelligence first.
Deedy: It’s true. The Meta offers for the “superintelligence” team are actually insane.
If you work at the big AI labs, Zuck is personally negotiating $10M+/yr in cold hard liquid money.

Quiet Speculations

Moratorium Madness

Letter to the Editor

The Quest for Sane Regulations

I Was Just Following Purchase Orders

The Week in Audio

Rhetorical Innovation

Eliezer Yudkowsky: As far as I know, AI companies did not try to train their models to not email authorities about users. Nor is it clear that AI companies would call that an unwanted consequence. Nor that I would call it evil. Then it what sense are the SnitchBench results alignment failures?

There is a failure here, indeed; it is the failure of AI companies to declare or even have alignment targets specific enough that we could say whether this behavior counts.

It is clear that no one wants this behavior to happen, but the way we determine when exactly we want AIs to do or not do [X] is largely ‘someone finds the AI doing [X] or a way to make the AI do [X], and then we see what we think about that’ rather than having a specification.
In this case, most people including me and Anthropic rapidly converged on not wanting this to happen.

Claude For President Endorsement Watch

Give Me a Metric

Goodhart’s Law is one of the most important principles to know, so a brief refresher.

The advanced or LessWrong version, recommended for my readers if you haven’t seen it, is Scott Garrabrant’s Goodhart Taxonomy, which was later expanded into a paper, breaking it down into four subtypes:

Regressional Goodhart, the proxy measure has an error term.
Causal Goodhart, where you intervene on the proxy, but it doesn’t intervene on the goal because the relationship between proxy and goal wasn’t causal, or it is not causal in cases where the proxy is being used as a target.
Extremal Goodhart, where the relationship between proxy and goal fails to hold if you push too hard and the proxy takes on extreme values.
Adversarial Goodhart, where pushing for a proxy opens the door for enemy action. Abstractly, adversaries correlate their goals with the proxy, effectively hijacking it.

For now in RL we mostly have to deal with the first two, but when it matters most we have to deal with all four.

Aligning a Smarter Than Human Intelligence is Difficult

The Lighter Side

We Apologize For The Inconvenience

Samuel Hammond: imo the number one reason people still aren’t grappling with the full implications and imminence of superintelligence is… inconvenience.
Neil Chilson: I think this applies to AI more generally, super or not.
Dean Ball: Agreed.
Dean Ball (quoting himself from 2/10): I sometimes wonder how much AI skepticism is driven by the fact that “AGI soon” would just be an enormous inconvenience for many, and that they’d rather not think about it.

Edited: 2025-06-25 00:00:00 | Tweet this! | Search Twitter for discussion

Bill Seitz