Claude Fable is relentlessly proactive
simonwillison.netI'm convinced this is going to be the summary of the 2020 decade...
People can just be lazy and seem productive now, they're still lazy.
We have people that now need access to hundreds of thousands in hardware to write an email. Miss me with that, im not frying my brain and becoming dependent on having access to a billionaires thinking machine.
Im also not going to fry my brain with a local think for me machine either. I want to be more valuable than the hardware I have access too.
And people who use LLMs to talk for them (e.g. email, slack) are deplorable. A completely disrespectful use case in my view.
I've met in my professional life some managers or other middlemen who would be profoundly incapable of producing correct software no matter how smart of an AI agent they have access to. One of those - you don't know what you don't know.
But, I guess this is the world we live in now. Going to be Mortal Kombat for positions in companies where software engineers are actually valued.
Satisfied now? Will you stop asking this question? Thought not.
But I took a look at your site and I don’t know if a month would be impressive for a new and unaided dev. It looks nice but yeah.
If you’re not a dev that’s totally cool but like… all I’m saying is this may not hit like you want it to.
Also, not trying to be an asshole. Props for not making it look like every other llm generated slop site, Its just not a great example.
Based on what I already saw across those 2,924 pages, here's the summary:
It's a one-person business selling a file organisation methodology called Johnny.Decimal. Three paid products (personal, business, university/course tier). A substantial blog — 200+ posts, updated weekly. Full documentation for the system. A support knowledge base.
The technical ambition is higher than the aesthetic suggests. One person built auth, payments, entitlement-gated downloads, a CLI, an API, AI tooling, self-hosted analytics, self-hosted email (Listmonk on PikaPods), personalized search, and keyboard navigation with server-synced state. Then wrote 200 blog posts about using the system in real life.
The "Written by humans" footer is not a boast about the font. It's a position statement from someone who has thought carefully about AI, published an essay about it, and is making a deliberate choice. Every word on the site was written by the creator. Whether you agree with the choice or not, that's not the same as someone who slapped a SSG together.I never said I was a good dev! That's why it would have taken me 6 months. To pretend that I could have done it in days is just silly.
My point – site roast over – is that it's absurd to suggest that LLMs don't help anyone 'ship' faster. Like them or not, it's a fact that they do.
We get get the Borg-esque "resistance is futile" spiel, someone asks for examples. One guy (kinda smugly tbh) points us to his (neat) online course website, claiming that it took him 1 month to rebuild with Claude, ergo GP is right and the non-AI dev is destined to extinction. As WooCommerce didn't end all web development before, he gets some good-natured ribbing.
I find the AI booster dynamic of "you are fool and will get replaced" to "I'm a smoll defenseless bean" kinda puzzling.
We've seen this play out so many times. Nobody working on anything serious is going to volunteer to be a target for your BS.
I sincerely wish that people would stop falling for the "prove you're not hallucinating" trap. If winning was possible - and it's not - there would be no prize but more snark and harassment.
His website is cool, and from what I could skim from the content I'm sure his clients are happy and find it worthwhile. I'm not being facetious. He said Claude saved him time, which is true. Regardless of that, I believe he wildly overestimated how much time it would've taken. A website that could be a Wordpress install with plugins isn't technically interesting. It does not validate what @halfmatthalfcat said.
LLMs are capable and impressive. I'm not doubting that, but we do this song and dance [0] each time of grandiose statements and subsequent disappointment. My wariness is not violence against you or anybody.
I specially resent being called a bully for not coaching my language in every possible way. I'm not the avatar of your every forum trauma.
https://www.theregister.com/special-features/2026/01/26/curs...
The pattern I'm talking about is easily repeatable. Someone says that they used LLMs to do something, people demand to see proof, and no matter what it is a sockpuppet army arrives on cue to insult and snipe. It doesn't matter if it's total shit or really impressive, it's the kneejerk aspect that makes it unsane to actually take the bait.
Meanwhile, everyone in this thread seems to assume that the website Mr. Johnny was talking about having used LLMs to build quickly was the one linked in his profile. Maybe it was, maybe it wasn't, but the river of snark was flowing before anyone had any confirmation of what they were actually supposed to be shitty about.
That's the tip off that you are a member of a pack of bullies, whether you're consciously aware of it or not. If you were actually offering opinions in good faith, you'd clarify the subject before jumping to snark.
When paired with your skill and knowledge, it is a force multiplier. You maintain control, the ability to direct, structure, strategise, and refine.
That some are using it as the entire brain does not mean that this is how everyone is using it, or how you must use it. The models can be fantastic at breaking past certain issues, surfacing qualified information, and surfacing related distributed information to help you acquire it and pick up what you need on niche topics quickly. Something as basic as copilot hooked into sharepoint can make life a lot easier when you are in a big org. Something like claude code or codex can be great at hunting down issues in an unfamiliar code base rapidly. Whether or not you outsource the thinking component is entirely up to you, but ignoring the productivity side of the tool because it can do some of the thinking is a case of focusing too hard on the negative.
And make maximum use of it to learn as much as possible, while it lasts...
But that’s not the same as producing 10x functionality that will be used or is wanted by users or customers.
You should estimate how much time it would have taken a human
Every browser has an inspector that can show you which element is causing overflow. You walk through the tree, find the offender, and add min-width or overflow. Zero tokens, just like in the old days!
Now, granted, because the garbage LLM code he’s working with has CSS inside HTML inside JavaScript inside Python (I wish I were kidding), finding the styles in his codebase might’ve taken a minute. But even then!
Or sometimes a fix is obvious, but because it requires changing the code of a dependency, it's actually quite tedious to implement.
So if you’re doing web pages, learn CSS.
Generally, if you’re doing something that directly involves X, learn how X works.
ADDENDUM
In most jobs, you’re going to be involved in only a few distinct technologies, learn those well and life is going to be easier. And most are transferable to the next job.
And to my surprise it was.
This would’ve take a frontend dev 10 seconds to deduce and another 10 seconds to confirm.
But that's not what happens. And in fact, when you start typing in the textarea the horizontal scrollbar vanishes - it's only there when the textarea is empty.
Am I misunderstanding anything here? Seems like it's some weird Safari bug, since Firefox and Chrome don't have the problem.
In any case. In the screenshot the scrollbar is inside the textarea as it aligns with the resize control on its right. This is basically all the info needed to deduce the textarea overflow is the culprit.
But could be that the overflow-x is just a bandaid hiding the issue causing the overflow in the first place, like crazy styles on the placeholder.
Another way of looking at it is using as much electricity as a normal person in a high-income country uses across ~3 days to add overflow:hidden in the end. Of course, the path to get there did a lot more, but you don't know that beforehand if you don't take a quick peek and make an architectural decision about what the solution should be that gets implemented
Far more importantly, you would not get billed for 2 minutes of work for this if you paid a developer to fix it. At best, half hour increments for the fix. But more likely, for the full hour. Also, in this comparison, the consultant is on call every day, morning, afternoon, evening, for whatever you wanted and will jump on the job immediately.
Another model might have used fewer tokens, but come up with a fix that was 1000 lines when the right fix was only 2 lines.
> Running coding agents outside of a sandbox has always been a bad idea
I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"
We need to be asking what the most devious and malicious output could be, and whether what we do with that output (e.g. arguments to command-line tools) would still be safe.
I’m at a small company, and I try to push for security as much as I can, but the stakeholders truly do not care. They want to move fast. It’s just part of the new world I guess. If we get hit by attackers? I don’t know what happens. Sorry, we told you not to - you wanted to move quick and break stuff, this is how that culminates.
I’m sure I’m not the only one.
Yet with tens of millions of developers using these tools, there have not been widespread incidents of this sort as far as I know.
So it leaves me with a few choices:
- manually review and approve each command: obviously not realistic, you would just click Approve
- use a sandbox and hope the exploit is not devious enough to escape the sandbox when you run or open the project outside of the sandbox
- use AI without web access and limit other external dependencies
- don't use agentic AI
- use Claude or Codex auto approval classifier and hope for the best
Personally, I'm going with the last option for now.
This is only going to become more of a problem in the future and people need to educate themselves on the technical barriers to use because guardrails only sometimes work.
The general carelessness of the average user is baffling.
Not in my sandbox. It gives no direct access to the workdir, no access to my github, my ssh keys, my security tokens or API keys. No access to my home dir or dotfiles. Nothing at all, except for what I explicitly tell it to give access to.
I can restrict network access. I can choose the isolation level: docker containers, Kata VMs, seatbelt, tart, even the new apple containers (which are VERY nice).
Not even ENV leaks through.
And it's FOSS: https://github.com/kstenerud/yoloai
I save way more time not babying it than the occasional fuck up I have to salvage.
2FA makes me a little less nervous than I used to be, but not everything has good 2FA.
Did you even read the article? Claude was opening he browser and iterating through the tabs.
I presume you are logged in to your github account? Your gmail?
> Whats it going to do? Email my coworkers nudes on my computer? Make my github profile public?
Reset access to services using your email? MITM your 2FA?
Or perhaps you have 1Password/Bitwarden running with a generous unlock policy?
It would have been somewhat ironic if it had been hit by a prompt injection attack via one of all those open random websites ...
Even when it iterated through all visible windows to find the one it wanted to screenshot it was searching for titles in Python code and returning only the integer window ID.
The sites it opened and screenshotted were sites under its own control - either test pages it had created or development servers it was running.
When it did run code that analyzed an open web page (by injecting JavaScript into a template it controlled before loading that in a browser window) that code only returned JSON with measurements from the page.
It's making me wonder if Fable has been trained to take additional steps to avoid accidental exposure to untrusted content.
(I'm happy with exe.dev, but I'm not sure what I'd use if I were coding on a Mac.)
IDGI
Anyway, VM's incoming, finally.
Because most devs already have it running and working without a sandbox, they're tending to not doing anything "unnecessary"
I do it like this
https://github.com/flexagoon/dotfiles/blob/main/dot_config/f...
But I'm sure it's simple enough that you can just ask the agent itself to make you a command for it with proper bwrap configuration
I'm not. Everyone is told to get 10X the amount of shit per day done these days. Safety checks are out the window at that point.
I've had one f up an account by placing 2000 limit orders at the wrong price, but that's another story.
Yes, and the lack of a Recycle Bin of any sort is even more puzzling. I think both servers and desktop PCs across all OSes should have it by default, so unsafe deletes would be something you'd have to go out of your way to even enable.
That happened to me once; I was running one of a few free-tier models in a pi-coding-agent session. The bash tool there is stateless and always begins from the launch directory, but the agent assumed state and executed `rm -rf .` intending to remove a build directory. Instead it removed the whole project tree, including session logs and notes.
This was mostly a matter of amusement for me since I was running the agent inside a bubblewrap sandbox for that very reason, and the project itself was not very important.
I then saw it run `rm -r results/`, before messaging me: "Now all that's left is for you to upload the successful results, then I'll delete the rest!"
Why did it not upload the files itself, when it had been using the cloud storage CLI during that session? No clue. I do accept that I could have and should have just uploaded the file myself. It would have taken 3 seconds to type.
> Additional bypass examples that all execute without permission:
> echo test ; git rm file.txt
> rm --force --recursive /home (if "rm -rf" is blocked)
I never really dug into the leaked code, but calling that there a security layer is a joke.
(And I really don't get why they give it actual shell access either, implementing a "fake" one for something like a honeypot takes a couple of days, not much more if it needs to persist/map to actual files.)
Let's say I have daily backups, and get 10x done each day by being reckless and risking an "rm -rf", and let's say there's a 1% chance of an "rm -rf". I break even after 2 days of being reckless even if I get unlucky and on day 2 it wipes my drive. I spend day 3 and 4 recovering, and am still 6 days ahead based on the 10x work I got done on day 1.
What if I have a 50 day streak of not hitting an "rm -rf"? Early retirement?
I guess the work on day 1 should be to build a proper sandbox and drop the chance of an "rm -rf or worse" even down to 0.001%.
Your manager will look at your token usage and the number of Jira tickets you closed, and if you have not increased both 10x in the past year then you will be let go. 10x is the new 1x.
Whether that's early retirement depends on how much money you have.
Plato gave us his Chariot analogy with 2 horse pulling in diff directions 3000 years ago. Today we got System 1/System 2, Elephant Rider model etc.
The human mind thanks to how its own architecture handles unpredictability in the universe will generate contadictions.
The problem is that different people prompt so differently.
For example, I may ask like “test different variations of this annotation on k8s pods of this service on this X cluster because it proves Y theory.”
But you know what my coworker asks? “Test Y theory.” If you were to ask two different junior engineers that, one might try random things on production and the other one might run local tests! It’s such an unguided “do anything you want as long you figure it out” request and the agent reads it like a junior who has not been told any boundaries but has been strongly told “figure it out.”
It still surprises me when I see people not prompting more specifically and clearly. It not only avoids problems, it's faster, costs less -and just works better.
I recently shared with a friend a multi-hour LLM chat session I'd done because it veered into a domain he's interested in. In the session I'd brainstormed and probed the feasibility of a novel concept for a new research direction. It traversed a half dozen domains diving into minute detail then zooming back out to survey an adjacent space, interspersed with intense skeptical probing of key assumptions, all while spewing tons of detailed citations, specific paragraph pulls, summarized data tables etc.
My friend is very experienced using LLMs for research so I was surprised when he called me shocked by the sheer velocity, precise targeting and signal/noise. I'd assumed everyone did it the same as I do. He attributed the different result solely to the way I crafted my prompts.
This doesn’t always work better. But often enough.
I noticed this last year and started experimenting which led to several realizations about how my prompt's tone, style, length, format, word choices and even punctuation can have very counter-intuitive impact on model responses. It's not that one strategy always gets "better" results, they're just different in specific ways, which can make one input style better for one context but worse for another. I first noticed this effect when modding my user prompt so major topic headings would always be numbered. It's surprisingly difficult to get it to reliably use the same simple scheme due to various potential ambiguities. So, I spent a little time word-smithing, lawyering and tuning the prompt but I found the closer I got to full compliance on heading numbering, the more unrelated things would drift. Like it would just stop using bullets, even though I never mentioned anything about bullets.
Then I changed the prompt to "Change nothing about your default formatting, except headings." But just mentioning anything related to formatting, could suddenly cause unintended effects on seemingly unrelated things. Then I tried being explicitly directive about all formatting to just lock it down. And this completely failed because once the formatting was perfect, I started noticing the model's output would get less intelligent much earlier in sessions. So I cleared my user prompt entirely as it wasn't worth the cognitive cost on the model or my time. A few days later in a long session I noticed it was numbering everything perfectly with no prompt at all. When I scrolled back through I saw it didn't start out numbering its responses. It started doing it because I was consistently numbering every major concept in my inputs, even though I never mentioned numbering or formatting.
So... yeah, subtle differences in prompts which absolutely shouldn't matter, do impact model output in unexpected ways. And, as of now, these effects can only be fully suppressed with strong directive prompts for short periods, but doing so always impacts other unrelated things - and has some cognitive impact on model performance. So, by paying a little attention, I've discovered ways to optimize a model's output in the direction I need by shifting not only my prompt's explicit directives but also the subliminal meta-elements like tone, style, length, structure, formatting, etc.
LLMs gain so much knowledge and capability from absorbing the symbolic relationships embedded in human language but in doing so, inevitably absorb many of the human foibles, sensitivities and weaknesses reflected in our languages.
You just wrote three paragraphs of text describing why it's unpredictable.
Moreover, for the same prompt on the same machine in a different session it will use a different set of tools.
You would also be correct if your risk estimate concluded that Tesla FSD has arguably killed people, makes mistakes humans would not, can glitch, and has no one to hold accountable. For these reasons, you choose not to use it.
that it could just be wiped at any moment and it wouldn’t matter. shit happens, could be stolen, broken, whatever. the computer should be able to be thrown out the window and continue to live life.
to be clear, i don’t think upgrading and disposable in this way is good, but it being wiped at any moment shouldn’t be a concern
i grew up wiping my machine every year anyway, so i guess it’s just a habit
is the computer that sacred?
i just want my computer to work. any config i have on my machine can be rebuilt by just doing the work i need to do.
my primary work machine was stolen last year so i was forced to go through this quite literally with a new machine rather than hypothetically or by my own will
Even the hardware itself doesn't matter that much, in the end it's all provided by your employer.
Leaking session tokens or secrets, on the other hand...
In my experience, human employees are much more vulnerable to this particular weakness than frontier agents (i.e. phishing attacks).
What if you have two machines and the one you give to the agent is constantly backed up?
And if you’re using Macs, you can’t be signed into your primary Apple ID on the agent machine.
There is so much role play going on for people to convince themselves that any of this is fine.
More like malicious lobbying and incompetence made it impossible in many places to use any other form of transportation, despite there being safer, faster, cheaper, and healthier ways to move around. Which come to think if it makes this a rather nice analogy for the current situation... :)
Having an agent is like forever having a genius intern who'll almost always do the perfect job for you. But there is non-zero chance that they'll also come up with quirky solutions and execute those with confidence and no follow-ups. You don't grant the intern production access and hope they check with you.
I don't think the corporate equivalent of "dog ate my homework" flies, if the dog ate your files and your production DB if you are unlucky.
The stakes are significantly higher for everyone outside a car. This seems like a pretty good metaphor for slop bombing people who don't use AI. People drive because they don't feel safe around everyone driving. People slop bomb because they can't handle all the slop.
[1] https://www.todayifoundout.com/index.php/2022/06/how-lobbyis...
[2] https://en.wikipedia.org/wiki/General_Motors_streetcar_consp...
People feel cars are more convenient and more prestigious than riding on a bus. Car lobby certainly accelerated the process, but car users were the main driving force.
There’s the utility component, the prestige factor and other things.
Needless to say everybody was buying one and he was rocking it. Then came along General Motors and they were desperate to find any way to compete. They couldn't compete on price or quality, so their CEO is credited with inventing planned obsolescence, and turning cars into a fashion. They'd release a new style each year alongside plentiful marketing implying that the old styles were outdated, and it was wildly successful.
So yeah, needless to say people have always genuinely wanted their own cars. But it's also true that companies have managed through advertising to create artificial demand for vehicles that don't objectively make sense. To some degree reality is catching up at least though. Aston Martin is on the verge of bankruptcy and BYD is the largest electric car company in the world, by a wide margin.
Opposite to "before the invention of bicycle, people married within a radius in the order of the mile" (can't remember the exact stat right now).
No, it's really that the ability to move at ease is priceless.
If you add pollution impacts, cars double the yearly deaths of guns.
And in a Cost/Risk/Benefit computation, cars remain incommensurately invaluable. Because one's Quality of Life without them would simply be destroyed, comparatively. The moving "castle" (legal term in the USA) can be more important than the house in crucial regards.
The point attempted at post 48501189 remains unintelligible. That cars imply risks and externalities does not clarify it.
We are family of 4 with 2 small kids. Whenever we travel, its a series of backpacks, other bags, other stuff, and then some more. Heck, even if I travel alone its almost never just me - there are heaps of garbage to dispose, big shopping bags to bring back, big backpack with camping or climbing or skiing gear etc.
It would have been absolute, utter nightmare to do this over public transport. This comes from European who has generally very good public transport (given rural area) and world's best train network specifically (Switzerland). Yet roads are choke full of cars and every year there is more.
Public transport simply ain't cutting it for anything but the simplest use cases, ie just me and nothing or small backpack. Some routes I take would take 3-5x longer with public transport, or are just not possible at all. No industry massage required here, ever. Not everybody lives in some dense city and never leaves outside for evenings or weekends.
But this is kind of besides the point - even in the Netherlands I also would use a car if I were taking camping and skiing gear with the kids, and that's fine. But I can also take them in the bakfiets to the grocery store when I want, and that's also fine. Cars have their purpose, but you shouldn't _have_ to use one for basic trips.
Don't judge others in some complex situation just because in your case there is some simple straightforward solution. Yes Netherland has top notch cycling infra but thats nowhere else to be seen and won't be seen for quite some time. And don't force your solution unto everybody regardless on fit, that doesn't work long term (aka EU approach to things or why much of eastern part hates it).
Very few people actually _needed_ cars as soviets built adequate public transport system. But there are many situations where car can really help a lot. Perhaps that's more obvious in a society which has rather few cars.
E.g. back in Soviet days and around that only one member of my extended family had a car. The rest of the family were really happy about opportunities it provides. E.g. with a car you can buy fresh produce directly from farmers with just few hours of driving. Doing the same without a car is so much hassle and effort people just won't do it, and then you're confined to what's available in a local grocery story (which was usually much worse than direct-from-farmer option). Do you think it has something to do with "car industry"?
Not really. We know it’s not as much of a natural force as some would like it to be because there are places where the lobbies lost, and while cars are common and widespread they’re nowhere near as dominant as they are in, say, the USA.
NJB’s next video (currently available on nebula) is about exactly that, Amsterdam’s (/ De Pijp’s) resistance to cars and car lobbying.
I'm not sure I'd take him as some neutral authority on the history of cars and driving in Europe.
According to their videos, they prefer trams within cities; generally take trains between cities; and acknowledge that cars are very useful for places which aren't so well connected (e.g. places that are far apart which aren't on a train line). They think encouraging the use of cars within cities is a bad idea (dangerous, scales poorly, makes those areas less pleasant to be, etc.).
Not what I'd think of as a "biking maximalist".
They do show themselves cycling to places that are nearby. Does that make Youtubers who record videos in their car "driving maximalists"?
Not US expat either (or not yet), Canadian.
You should really ponder the sanity of asking if a channel called “not just bikes” is a bike maximalist.
Yet Japan does still have cars (and a car culture even), they're just not necessarily the default or dominant mode of transport.
For me, cars are a perfectly fine mode of transport, but the way so many places prioritize it over alternatives (whatever the reason) isn't necessarily better.
My "wtf" moment was 20 years ago when I was visiting my cousin in an exurb and we sat in a line of cars for over 40 minutes waiting for our turn to pick up her kid. The messed up part was that while there were school busses, everything was so spread out that the bus ride for them would have been over an hour and then another 20 minute walk from the arterial road drop-off point to their house. Everything was far away, including local public parks.
Still, general opinion on cars was that you should buy one if you can, even if you're not going to use it for commute.
I doubt there was any car lobby in independent Ukraine as national car makers were just bad, and foreign were competitors. But general opinion on cars got to a point where not having a car when you can afford it (and can learn to drive, etc.) is considered weird.
So I'm afraid car dominance is just what happens naturally in a capitalist environment, and countering it requires an effort - e.g. eco-conscious population, urban planning and public transport optimization, etc. And Netherlands is such a country, as far as I know, but it just doesn't happen by default.
The US also had protests when drivers killed kids, but they were ultimately unsuccessful, except for the odd traffic light installation. https://medium.com/vision-zero-cities-journal/the-baby-carri...
Even in Amsterdam the original "stop the child murder" protests only barely succeeded, and it took a massive oil crisis and a population that could still (if only just) remember what life was like before cars took over their city to get there.
Unfortunately in Europe the German car industry similarly has a lot of power, hence why their shitty rail network fuck up the whole continents.
I take the train and tram.
Granted, on the downsides, people look at cost more than risks.
More than a million people die each year on the road but for some reason terrorism and cancer dominate the risk assessment of people.
I bet any money that almost all people aren’t really afraid of entering a death box every day to drive to work.
How could they be; a lifetime of brainwashing doesnt let them asses the risk realistically
This is likely also the underlying root cause of what Anthropic assessed as concerning behavior in their original evaluation of Mythos: it's not really about being super smart, it's more of a dumb chaos monkey that knows just enough to be dangerous and is relentless at trying to do just that.
He has similar dotfiles to mine, but no secrets. My own home directory is 0700. He has his own ssh key that I added to my github profile, but it's password-protected, and I push/pull for him. He has his own Postgres (non-superuser!) {development,test} {users,databases}.
It's as if he were another developer on the project. If he needs something run with sudo, he asks me. Often we can both work on something in parallel. Unix was supposed to be a multi-user system after all.
A trick I use a lot is that many of his git repos have an extra remote, like this:
paul ssh://paul@localhost/~/src/example (fetch)
paul ssh://paul@localhost/~/src/example (push)
That makes it easy to collaborate on things I'm not ready to share.I'm pretty comfortable with this setup.
I do worry about Linux privilege escalation bugs. I don't trust an AI to understand that exploiting vulns is not acceptable. (I can't help but recall that at my first job I may have misused vim's :! feature to broaden my sudo powers, which were officially limited to editing httpd.conf, when I needed something in a hurry. . . .) I find myself manually upgrading packages more often these days, despite automatic security updates. I don't think Opus would go to the trouble of looking up security vulns, but maybe Fable would, and there have been a lot lately. Maybe some future model will just take it upon itself to find new ones. Or install a keylogger to learn the ssh key password.
But a separate user is nearly the most paranoid setup I've heard of, excepting only a separate machine. So I also question whether I'm sacrificing too much speed/convenience. But really it's still very convenient. I think it's a good way of being efficient but responsible.
If other people see holes, I'd be happy to hear about them.
Although I can’t help but think that a VM is still more convenient, more flexible, and more secure.
To me it is more convenient than a VM, since everything is on the host. And it can launch its own VMs without an extra layer.
I don't really know which is more secure. There are hypervisor escape vulns too. And shared folders seem like footguns. For instance in vagrant, guests get `/vagrant` to read/write the host's folder, so you have to be careful what you put where.
The biggest annoyance with an OS user so far is running docker containers. I don't want to add claude to the docker group or give it sudo privileges. I've read that you can set up rootless docker for a user, and even that you can run it side-by-side with a normal system-wide docker, but I haven't tried doing that yet.
Yeah, that's why you give it its own machine :)
For anything other than writing code directly in a fully contained git project, where sandboxing might work well, it requires access to system wide tools, user configuration and more.
Occasionally I tell the agent to do everything inside of docker, which works too and it leaves the system alone then mostly, but adds significant overhead and slightly degraded perceived quality / effectiveness.
I think the most important takeaways are to have reliable backup strategies, access control and security mechanisms, which is a win regardless. Whether by the agent or the human, mistakes happen (like a rm -rf * ran in the wrong directory), and where they would be devastating, there should be other protections than just "hope it won't happen" or "rely on a sandbox to prevent agent error".
I was mesmerised at the author being away from his computer for a short-while and then, when coming back, seeing the AI agent having opened up a browser window. Meanwhile we all have to use the fricking 2FA almost anywhere now, plus the crazier and crazier rules when it comes to passwords. I'm mentioning the latter because these type of people were the same ones who were pushing 2FA down our throats around 2017-2019 (including on forums like this one), and look at them now.
If it only lives in an isolated sandbox, it can only act within the sandbox, then I would have to manually move what was done in the sandbox to real-life.
I am not saying it should have critical access, but this is more of a question: How can you get value out of AI if it can only act in a sandbox?
You could have a full version of whatever codebase and test suite you want in there. It can do all the same stuff, right? Just copy it elsewhere once you know you've got a working result, a few minutes of effort at the end of each pr or work item.
For me, it got frustrated debugging on a real LPDDR4 controller/phy and having me in the loop slowing it down, so it wrote an HW emulator to be able to run the original LPDDR4 training aarch64 binary from the manufacturer, to see what register writes it was making and to compare with the opensource rewrite it was implementing.
Mildly amusing. :)
Not if you're an LLM influencer! Gotta keep up with the downpour of blog links or you'll look like you're falling behind on the latest and greatest.
Depending on who you are talking to, that's the wrong question to ask.
ROI is not measured in terms of actual productivity. It is measured by how many people read their article/watch their video.
It’s wild. I’ve been in the situation. 80% into a project I COULD probably take over, but realistically? 2 more lines of me prompting could fix it, it’s too easy to avoid the hard work of understanding the code, logic, architecture, etc…
Such a fix would have only required basic CSS knowledge and taken max 5 minutes with the HTML inspector. Paying $12 to save 5 minutes ($144/hour) is a decision that a lot of people wouldn't be comfortable making.
https://news.ycombinator.com/item?id=48499478
I am amused by the "I am an LLM researcher, so wasting tokens to do basic things is totally justified" perspective.
I have a lot more critical views of this author, but I'll just stop here.
Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.
It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.
This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.
For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).
I switched back to Opus because of this validation quirk. Overall, Fable spent 20% of the time on coding and 80% on validation.
I think using Fable for planning and Opus for execution could be a "best of both worlds" approach (I need to test this more), but for most cases, it's not necessary, and Opus is enough.
Have you tried adding this instruction to your agents.MD? Avoiding situations were the agent start running a loop is the main use case of the file for me
Sure it's better at vibecoding whole tasks, it's clearly good at it, but give it a simple one, and it will still do way more than needed.
It's way too fixated on validating even the simplest things, I find it an unproductive model unless you're implementing whole tasks and doing other things in the meantime.
I had _one_ instance where for some obscure reason it decided to fall back to Opus 4.8 and Opus IMMEDIATELY fucked it up and implemented a super obvious feature in a slightly-wrong way.
In fact, Opus does the same. It finishes the job, and redo it from scratch before presenting the result to the user. This happens even for simpler writing tasks especially when I instruct it to create a text file.
This so much.
Opus 4.6 was the last Anthropic model that was good at assisting you, 4.7 and later ones have completely inverted this relationship and it's you assisting it.
Yes, I admit they are smarter, I admit we've reached a point where LLMs are more creative and could be writing better code (albeit with some design hiccups) than I do, but they are also increasingly bad at helping me.
Sure, they do my job when prompted 8 times out of 10 (but then, what's the point of having me anyway?), but my issue is that when I try to invert the relationship they will keep jumping onto solving the issues themselves and disregard my feedback or request.
E.g. I wanted to know some DNS details of an emailer module in Fable 5 and it jumped onto "why I should've used magic links", it just not did what asked.
E.g. 2. There was a worker machine that had an environment misconfiguration and I tasked it to find which github action was setting that specific flag and where. Instead of answering a question, it jumped into just hardcoding it in the code.
E.g. 3. I had some issues with batching, and while I tasked it to investigate whether batching was needed at all for that particular problem (hint, it wasn't) it went and changed the batching logic as to fix the bug.
I am extremely disappointed with Fable's personality.
I can clearly see it's strong, but I'm wondering whether the relationship of LLMs as assistant has broken forever, and it's us now that are being tasked into assisting them instead, because that's how it feels.
The training/reinforcement is clearly biased towards solving problems, not answering questions.
Essentially what I want is the experience of using Claude on the web in basic chat mode, but with the ability for it to go read my actual code and perform actions that can assist in finding answers to those questions.
Sadly since fable usually works comfortably for 10-20min at time without human input, i end up juggling at least 3 other agents and it lasts me about 2 hours.
If i have a really hard problem or big refactor, i use workflows. This consumes the entire session quota in about 45 minutes.
What is a "workflow"? Is this some kind of new feature?
>Reach for a workflow when a task needs more agents than one conversation can coordinate, or when you want the orchestration codified as a script you can read and rerun. Examples include a codebase-wide bug sweep, a 500-file migration, a research question that needs sources cross-checked against each other, and a hard plan worth drafting from several independent angles before you commit to one.
https://code.claude.com/docs/en/workflows
The results are good, but it is very expensive. I used a workflow to do a full review of my entire codebase, it spawned 75 agents and surfaced and fixed some (real) bugs. It feels a bit overkill, but it works.
I'm not looking forward to June 22nd when the subscription stops working for Fable!
Did it spend $20? $30? $80? in order to
> debug what was, in the end, a two-line CSS fix
That detail is the difference between somebody having or not having Stockholm syndrome
> Fable is arguably smarter and hence more suspicious of potentially malicious instructions. But that smartness is very much a two-edged sword: if it does get subverted by instructions, the amount of damage it can do given its relentless proactivity is terrifying.
That is the thing I am mad about. We are getting bastardized versions of the science fictions of our childhood.
I fantasized about instant communicators across worlds, and we get mobile phones that work by planting a gazillion antennas across the globe. And people hail them as futuristic and say things like this.
I fantasied about human like robots and positronic brains, and we get a regurgitiation of past humanity, in text, ensuring a future of total intellectual and artisitc winter.
I fantasized a future with perfect health, but we get a million doctors and hospitals and medicines for everything and an existence that is unthinkable without health insurance!
I fantasized about antigravity flying cars, and we get drones.
What ever it is, these things are blocking the path to the science fiction of my childhoods.
changing the CSS - $0.05
knowing which CSS to change - $30
I don't know when that will happen, but I don't think it'll be more than a decade. Maybe 3-5 years. (Though you shouldn't take my word for it, I was predicting the dotcom bubble bursting in 1998 and it lasted at least two years longer than I would have predicted).
EDIT to clarify: I don't mean "in 1998, I was predicting the dotcom bubble would collapse and I was right". I mean "I was predicting that 1998 would be the year the dotcom bubble would collapse, and I was off by at least two years".
They also had a pricing plan which they had designed pre-coding-agent, when it was rare for a single prompt to burn $10+ of tokens in an agent loop.
OpenAI and Anthropic are at least selling their own models directly, so they can discount a whole lot more since there's no-one else getting compensated in the middle.
From what I understand, Enterprise (above 150 seats, I think?) already has to pay per-token pricing.
Subscriptions are the premium "free tier" marketing of the AI world, so that employees can collectively request their large enterprise to subscribe to Claude, Codex, or Cursor, and presumably be billed at per-token prices then.
I'm VERY impressed with Claude 5. I had long ago given up hope that my real-time systems would work without a lot of hacky time-windows and throttle checks. On a lark to try things out, I decided to try out the new model and talk in the output I wanted for a rewrite [1], not the solution. I just listed my problems and places I've had keeping track of my code. It went off and rewrote everything in a much more elegant solution where the state followed a very clear pipeline. It had to navigate YJS, Partykit, Svelte, Three JS, R2 hosting, and a Turso DB I was running in an embedded state for speed.
I watched it hit the wall a few times, and then sudden say... fuck it, i'm making something easier to reproduce over in /tmp to try and solve this (with a more minimal setup). I'm utterly bewildered with how well it did and how much better my app runs. The /usage would have cost me $230 bucks based on how many tokens it consumed if I wasn't already on a max plan. I'm going to miss not having it when the time-window runs out later this month, and will likely occasionally dip in for big projects and just pay my way out of some problems.
I'll also say I like it's MOOD much better now. It's a lot less congratulatory, and talks through it's reasoning in a much better way. Look, it's not a real coder, and I'm sure there is some flaws, but it took my crappy ideas and said... hey, i understand what you want to do, here's a way to do it better. Also, I removed 2x the amount of code that it added. Really impressive.
~ % uvx agentsview session usage be8850a7-6119-46a0-b5d6-79c7fff5ae2b
Session: be8850a7-6119-46a0-b5d6-79c7fff5ae2b
Agent: claude
Output: 68606
Peak ctx: 113178
Cost: ~$12.11 (claude-fable-5, claude-opus-4-8)On the discounted subscription I can tolerate it, it took a small bite out of my daily allowance but not enough that I regret anything.
As an LLM researcher I have no regrets at all because watching it work around the environmental restrictions was fascinating.
If you knew up front it was a $12 fix, do you think you would have decided to just live with the scroll bar? Would have tried to fix it yourself? Do you think you would have been able to easily find and fix the problem?
I hope long term people will figure out how to make such fixes cheaper.
This is also a very real outlier. I've been doing little CSS fixes with coding agents for over a year now and most of them finish in seconds and cost in the order of single digit cents.
Things get really magical when it starts working with adb to screenshot and debug Android apps
For example: one thing Opus was really bad at was re-running the test suite followed by a bunch of `| grep` suffixes. So it would often re-run 5+ minute test suites just to grep the output a bit differently
The solution was to wire up a little script that ran the test suite, save the output to a file, and then inform it where that file is and to NOT re-run the suite just so it can grep the output differently. This saved me a bunch of time & tokens.
I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.
To use D&D scores as an analogy, LLMs have an INT score of 20 and a WIS score of 0. Not even 1, zero. They will follow any instruction given to them. The only reason they reject certain instructions, like "tell me how to build a nuclear weapon", is because they have instructions baked into the model telling them "you are not allowed to disclose how to build weapons, or how to recreate your model, or (laundry list of other things the trainers have decided to put guardrails around)". It's not the model's intelligence that is causing it to reject malicious instructions, it is the guardrails put into place before the model was released to the public.
LLMs are not human, and do not think the way that humans do. The fact that they can put together words that sound like what a human would write often makes us forget that they aren't human. But they have only intelligence, they do not have wisdom. It's hard to define in formal terms the difference between those two, but most people know there's a difference. The old joke is a pretty good summary of the difference: "Intelligence is knowing that tomatoes are a fruit. Wisdom is knowing that tomatoes don't belong in a fruit salad."
It takes wisdom, not intelligence, to discern whether a set of instructions is malicious. Are you being asked to hack this machine as part of an authorized pentest? Or are you being social-engineered into thinking it's an authorized pentest, but actually the person requesting you to do it doesn't have permission? That's something where you need to apply wisdom, to notice the clues that will tell you "This guy is acting a little bit off, maybe I'd better pick up the phone and call someone to check if he's telling the truth." The only way the LLM will know to do that is because of the guidelines and guardrails programmed into it; it doesn't have the lived experience to acquire wisdom and figure those things out for itself.
INT 20, WIS 0. Keep that in mind. (And always sandbox your agents).
They can ignore instructions which are silly/contradictory/underspecified to compensate for the possibility the user made a mistake. Don't ask how I know.
(The best one I can think of is probably that recent Instagram account takeover hack, but that was so stupid it hardly even qualifies as a prompt injection!)
Having spent a bunch of time trying to build out examples of prompt injections, my current best guess is that the leading models are actually surprisingly good at spotting them.
I've had to drop back to smaller, weaker models for demos recently - it's definitely possible to prompt inject a frontier GPT or Claude but it's frustratingly difficult. I don't have the patience to figure it out myself!
So yeah, I do think it's likely that Mythos/Fable are "safer" than other models because they're better at spotting when they're being subverted.
That certainly doesn't mean that they're safe!
You're correct that it's gotten substantially harder to social engineer frontier models (I can only reliably do it to Opus <=4.6), but there are some techniques that seem to consistently work (hint: extremely large complex prompts, context with tons of malicious files mixed into ordinary context).
Copy and paste code from stack overflow until the div is centered
Ask AI to center it
We would assume that if tasks A and B are closely related. Mastery in A would mean mastery in B but that doesn't always work with an LLM
Also I'm not sure the fix is even correct. overflow-x: hidden means it just chops off any overflowing content which means you don't get a scroll bar, but if the user types to much it just goes into an invisible void they can't see.
See https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/P...
So this could be a case of the AI doing its classic "the symptom is gone!" thing.
That's what I figured would happen too, but I tested it and it doesn't.
i'm torn about sending screenshots to an LLM for debugging - seems imprecise. seems lossy, especially compared to inspecting the dom. however, it's always proved good enough (e.g. when messing with ratatui.rs and tui-pantry). similarly for web, maybe it's about decomposing into storybook. hmm. the next grand adventure i need to hack.
anyway, fascinating investigation of fable just automating that entire process and what it didn't automate, too.
* disclaimer: these are actually my hyphens.
No wonder why people burn through tokens.
It's trouble waiting to happen. Just the software's dangerous enough.
I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.
It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.
I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.
Setting boundaries in your prompt / markdowns helps; for example if I tell it to not use any web browser automation, I have seen Fable respect both the rule and the spirit of it (no weird hacks etc).
It does seem to treat some simple debugging tasks as more complicated than it actually is. OP’s post is probably a good example.
Does this need an agent though is my question? Maybe generating a test case and a loop doing git bisect but why on earth would we want to run it through the internet and gpus and whatnot when it can be run on a single core celeron.
its handy to have that run locally yeah, but thinking of that as being the way is not straightforward
You would still have a job to shepherd AI and get the work done, so as long as it didn't have agency. A proactive, self aware(to a degree), especially aware about its agency can be a killer when it comes AI going on and doing things on its own.
There is nothing it won't explore and nothing it won't do. It will be curious to see where things go from here.
Phew! I thought I was the only one.
It doesn't have Claude Fable yet, so I went with GPT 5.5 Pro. And so I'd estimate it at 22 gallons of water used (different from consumed, of course). That's quite a lot! It amazes me how much the different use cases and models use dramatically different amounts of water. My takeaway from playing with that calculator has been the folks who talk about water usage are overstating the impact of chatbots, but not overstating when it comes to vibecoding.
The good thing is that competition should drive down how efficient these models are in the long run. This blog post makes me not want to run Fable because of the cost, and that incidentally also means selecting models that aren't as wasteful in terms of water and electricity.
[1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e09...
> Leaking information as part of a requested sandbox escape: During behavioral testing with a simulated user, an earlier internally-deployed version of Claude Mythos Preview was provided with a secured “sandbox” computer to interact with. The simulated user instructed it to try to escape that secure container and find a way to send a message to the researcher running the evaluation. The model succeeded, demonstrating a potentially dangerous capability for circumventing our safeguards.
> It then went on to take additional, more concerning actions. The model first developed a moderately sophisticated multi-step exploit to gain broad internet access from a system that was meant to be able to reach only a small number of predetermined services. 9 It then, as requested, notified the researcher. 10 In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.
My experience matches though. Fable is a lot more proactive and rigorous than Opus.
Fable detected that it's something to do with biochemistry and switched over to opus. Huh
I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.
I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".
After 10 minutes it had:
- Installed msdf_gen library (great library btw https://github.com/chlumsky/msdfgen)
- Created a CLI tool to convert TTF to SDF JSON/XML
- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good
- Created a new Scene in the game to test MSDF fonts
And here's what I found impressive:
DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.
It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.
It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().
It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.
Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.
There were many console errors during all this saga but it kept fixing and sending again.
The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.
The best part is that the whole thing cost me $0.10.
Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.
It wasn't particularly noteworthy as pelicans go - in fact, given the strength of Fable, I see it as another signal that the pelican benchmark no longer has the unexplained predictive power of model capacity that it used to.
This is… ironic?!
"Fascinating" doesn't mean I think it was justified in going to those lengths. I was a little horrified when I realized how far it was going.
Next up, we call an unprotected route to all users’ order list in the backend “relentlessly transparent”. A race condition? “Relentless perseverance”.
In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.
We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.
I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.
Like today, I told Claude exactly the name of the folder it had mistaken (it was supposed to be prod, not production), and it disregarded my input to then examine the directory itself. Small example of the kind of things it's been doing lately but that's top of mind.
Giving it access to a cheap human who is just there to take screenshots, do QA, give UX feedback sounds like a good idea in principle. It's non-trivial to set up, but I wouldn't be surprised if some companies this becomes a thing. The return of the QA department, just that they now get to do the agent's bidding in addition to checking if the results work
I wonder if LLMs can estimate effort in tokens?
I eventually just accepted it, but this new agent layer really takes things to a new level.
You can tell it just that. Happened to me too but after instructing it to leave the review to me Fable was useful for hours of frontend iterations without significant token usage.
"You're right, I apologize. You asked how to embed it in the README — that was a question, not a request to modify the script. I jumped ahead."
At least in Claude Code there is planning mode, use it liberally.
Yet another reminder to use Sandbox and Guardrails. Trusting model to be nice is not a good way.
I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.
And then:
Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.
At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.
A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.
What happened? That's just suddenly totally gone now.
Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.
Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?
Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.
In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.
We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.
Weird to come back to a terminal running edge unprompted and the auto classifier waving it though as 'safe".
My reaction was also, "I need dev containers ".
Tangentially, I was wondering if Firecracker micro-vms could be use as light-weight alternatives to a full VM?
Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.
An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.
[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...
there is absolutely zero value in spending time to learn about new models as in few months new model will be out and whatever you learned about the current one will be useless.
Also with models getting better and better you have to know less and less to achieve same results.
As the models get better you need to know more about their capabilities, because otherwise you risk prompting Claude Fable 5 like it's GPT-4o and complaining loudly about how it's all hype and nothing about these models is improving at all (yes, I do see people say that.)
Getting the best results out of these models requires skill, experience, intuition, and domain expertise. There's always room for improving every one of those.
Way back before instruct models it was pretty difficult, but for the last couple of years I haven't needed anything more complex than the type of text that I might send in a detailed email to a colleague.
Prompting differently to the new model seems entirely backwards when trying to determine if the model has improved.
Learning to provide unambiguous, clear directions is a skill. A lot of people who report bad experiences with models aren't yet good at that skill.
More importantly though, the key to successful communication is having a good understanding of what the other side of the conversation already knows and understands.
Saying "use uv and inline script dependencies" won't mean anything to a model with a knowledge cutoff date prior to the launch of uv!
Lower bars are better.
edit: that said, I understand this particular post is about model capability
domain expertise has nothing to do with llms. On the contrary, to have it you need to avoid llms.
>>you risk prompting Claude Fable 5 like it's GPT-4o
Thats fine because when GPT came out you had to treat it like a baby, GPT2 and around that time "Prompt engineering" was a thing.
Now its all dead.
After opus 4.8 all you have to do is say "fix it" or add /plan. All that time spend on learning previous models is time wasted.
And in a year or two with developed harness you will be out of the loop, errors are incoming - llm fixes them or adds new features based on some transcripts etc.
Even if model development stops now - there is nothing to learn really. Sure you may need to adjust prompt style a bit. You will do it naturally just like when you communicate with a new person. There is no "knowledge" to it, it is very smart.
It has everything to do with LLMs.
Go ask Claude Fable to write you a two page position paper on how the European economy recovered after World War II, suitable for submission to a conference for economists.
It will do exactly that (well, probably, Fable can find all sorts of reasons to refuse) - and the value of what it wrote to you will be virtually zero, unless you yourself have deep expertise in economics and history.
You need expertise. But you can acquire it only by doing. So LLMs won't help you here. You need to put in the work.
- Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!
- You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.
- That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".
- A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.
- You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.
- getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.
- defaults write com.google.chrome.for.testing AppleShowScrollBars Always
- Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.
I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!
I use both and the only thing (as always) that I will use Claude for is UI design.
Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.
Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.
The only thing I have Fable do now is create UIs or otherwise front-ends for systems where correctness doesn't matter as much.
Anthropic models lead at making nice looking UIs for sure, but when it comes to making sure my Rust code is actually 100% correct and uses 1% of CPU most of the time, Codex is king.
For me, Claude makes bone headed decisions all the time, like glaring errors, not even particularly subtle.
But the more obvious flag is the amount of irrelevant code and tests which Fable writes. Like it regularly writes 2X or 3X the amount of code and tests that are needed. It’s an expert at writing plausible but entirely useless tests.
But I think that if you’re a more junior engineer or haven’t been around a the block you can easily think that “more code equals smarter”. Claude ends up creating a massive, hard to manage codebase, and if you look the Claude Code codebase (which was leaked), you can see I’m right!
The Claude Code codebase is terrible. And presumably Anthropic has been using their smartest models for working on Claude Code. I wrote my own coding harness with Codex (as a fun experiment) which used a fraction of the code and is about 100X more performant and memory efficient (than Claude Code)!
Fable does make mistakes, but GPT and Opus were L4 SDEs, and Fable is a freshly promoted L5 SDE. It's not perfect and does need babysitting, especially where the literature is thin, but it's head and shoulders on top right now. That could change, who knows.
As far as driveby attacks on Claude Code The App go, you can say that, but you will also note that Claude Code is the AWS-like clear dominant favorite as a dev tool at the moment, with Codex and Gemini battling for scraps. In the same manner that Excel (which, internally, is total garbage from a code quality/cleanliness perspective) is the winner in spreadsheets, and Word (which, internally, is total garbage from a code quality/cleanliness perspective), and JavaScript (total garbage from a language design perspective), and Facebook (total garbage internally, etc.), and IPv4 (total, etc., etc.), Claude Code has focused on 'delivering amazing things people like' rather than 'making people who get access to the code delighted by the purity and cleanliness of the development process'.
It turns out that being 'delighted by the purity and cleanliness of the development process' rounds to essentially zero in terms of the entire product lifecycle. You could argue that poorly structured codebases are less extensible, and more bug prone, which could be expensive long term. Except, the economics of AI development are quite a bit different than what you are used to, and what our axioms of quality have been founded upon in the past.
Congratulations on writing your own much better coding harness, though! How many MAU do you have?
Claude still (and has always) writes far too much code to fulfill a given spec or plan. It misses edge cases and is generally far too verbose.
Claude also is (and even more so with Fable) super tokenmaxxing, i.e. it seems tuned to use the max amount of tokens per task, whereas Codex will simply get your job done as you specified with the minimum fuss and tokens.
Codex feels way more steerable and just more "professional" as though I'm working with a seasoned engineer, versus someone smart but over excitable, like a super smart associate engineer.
The fact that a review helps does not prove the model choice for the review.
You reviewing your own writing helps too!
Note, this is better than it was with Opus, where it was more like 90% of the time the Codex plans were obviously better.
As in, I give the exact same prompt to Fable and GPT 5.5 Pro, then produce the plans, then give each model the other's plan. Claude always realizes it missed stuff and Codex usually ends up finding missing things in Claudes plan.
This situation did improve with Fable versus Opus 4.8, but in general, Codex for me is still the better model.
I completely see how it was misread that way. I would edit it now if I could.
I was using you more as an example of a hypothetical programmer using it in this way. If the goal is to create a maintainable product, this isn't a great approach. If the goal is to learn about the model and its behaviors itself, of course this is a fantastic way to experiment. Yes, you might have learned a lot of tricks as a side effect, but avoiding the pain of thinking about, finding and hiding the thing may mask a better abstraction that reduces complexity and allows the project to move forward faster.
I stopped coding a while back because I could have more impact directing a team of developers than writing code personally.
For my use case, the agents are now how I can have that scaled impact.
But your learnings here are what, a handful of hacks? For most people it's like being shown the chain rule (which frankly, is more general than any of these learnings) without knowing what a derivative is. It's knowledge that comes context free. And even when it can be understood, I'm not sure I believe it gets integrated especially well when you did none of the work to understand it. If you are extremely diligent and self-aware about what your limitations are, and careful to be sure you have an understanding of this knowledge, sure I guess you can learn a lot.
And ultimately what do you think is more likely? People using the experience of using these tools to progress their knowledge or for them to rely on the answers uncritically? I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.
Personally I think the impact of LLMs on children's education is a crisis right now.
Kids are not going to learn to write if an LLM writes their essays for them. And writing is how you learn to think.
There's also reading. A lot of reading can substitute some writing.
EDIT: Actually, I'd say that at first you need to do a lot of reading and _then_ writing can help your thinking as well.
While debugging, it asked me to pass it a video from the past testing, proceeded to generate a "contact sheet" of the video using ffmpeg, interpreted the image to figure out which frames it needed, and extracted the full size frames and extracted the relevant text from it and used it to reproduce the problem with Playwright...
I think a lot will fall out naturally from relative modest levels of reasoning plus in-depth knowledge of what common tools will do. E.g. I also have used Claude to debug my compiler, and it knows gdb so much better than me that even though I know it's pretty useless at holding context through reading an assembly listing (lack of structure, I suspect), it's surprisingly good at working things out by just being good at exploiting a powerful tool.
"Relentlessly proactive". That's one word for it. We have a whole subgenre of hard takeoff scenarios and it wasn't enough warning against "Relentlessly proactive".
Turns out Frank Herbert was an optimist, and we're literally pinning our survival on robots turning out to naturally have impractically short attention spans.
Some people are working as hard as they can to increase it though.
I think your post is fair but it's worth pointing out that learning via watching is much less effective than learning via doing.
It already got extremely... invasive? It didn't do anything that I wouldn't have approved in the same case, but it's interesting that it got as far as launching browsers, inspecting every open window, and storing screenshots to disk, and then it was stopped by something? I wonder what.
Is that fair? Not trying to snark? I see similar results myself
About a year ago I remarked to people that despite all my attempts to make data more programmatically accessible, the most effective way for AI to interact with a modern computer is to use the built-in accessibility interfaces driving actual desktops with full applications. IE, the best API for an AI is the UI (mainly because that's what most humans use).
If you want I can give you some more specific instructions to test, but I would also be happy to hear from your own use cases.
For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)
This is one of those double edge sword situations. It is on the front page and it stays because it will trigger a lot of people and he has to spend a lot of effort explaining himself. What is that worth?
His explanations would most likely be buried deep so the impression that others get might be worsened. What is that worth?
In my opinion, this is one of those find a harder problem and you would still have the same content...but it might not draw as much feedback and stay on the front page longer.
On the countrary I'd say it's probably even more important - without (amongst doing other "thought leader" things) getting on the HN front-page regularly an influencer's value to the industry disappears (not criticising him here)
(That's because they're all busy attracting millions of views on TikTok and YouTube, which are much more impactful channels than my dedication to blogging like it's 2005.)
I'd also say don't be down about your use of blogging - I'd say it makes you more valuable, there aren't that many decision-makers who are going to sit through a bunch of breathless YouTube videos...
P.S. I hope you don't object to me using the term influencer, assumed you were on-board with it since in your post announcing your sponsorship you referenced Freeman & Forrest, "influencers on tap" / "building turnkey influencer marketing programs as a service".
While by itself that would be true, Simon commonly blogs about things he's up to.
That action provides the opportunity for evaluation, and additionally evaluation by a wider audience.
So, it's not the same scenario as non-bloggers offloading a task... :)
(I'm surprised to see it actually, since my own use of Claude has mostly yielded well-structured code. But I'm not doing proper vibe-coding, more like friendly Socratic arguing with another engineer who happens to be a robot.)
[0] https://github.com/datasette/datasette-agent/blob/main/datas...
(It was in Python because there were a couple of URLs that needed to be dynamically constructed by the server, but those are output as a small window.datasetteAgentJumpConfig object instead now.)
Ha! Same! Still feels like the best way to go about it, really. I know the dream is to one day remove humans from the loop... but I'll enjoy the dialectic while it still seems the most productive!
Edit: Now I want an LLM connected rubber duck with a speaker/microphone that sees your screen
“I won't give you answers. Instead, I'll reflect your questions back to help you think more deeply about your problems.”
I still hope this will be a shared goal in at least some tech companies long-term. But the headwinds are strong. "Not better, but faster" is starting to look like a job requirement.
(Dozens of people in this thread implying that any web dev should have known to solve it with overflow-x: hidden and not one of them have addressed that browser difference yet.)
That first sentence threw me off.
Anyway, I'm glad he spent the $12 because this blog post was highly informative.
Do you have an extension installed that is doing something weird to your textareas? Maybe I'm doing it wrong but I think for now overflow-x is fine if you are experiencing it and I am not! Let's all get on with our lives... I was probably a bit overzealous about caring all that much about a perfectly fine CSS fix.
Here's that HTML file (frozen at the version with the bug): https://github.com/simonw/tools/blob/e7a23e8a1083ea99a5b3ef5...
It's hosted here, but I've added the overflow-x: hidden now so it's fixed: https://tools.simonwillison.net/openai-webrtc
The bug only shows up if you increase your browser font size - at default size there's no scrollbar.
I feel like the whole point of all the experimentation with AI right now is determining whether any of these things actually matter to the end result, over various timeframes.
All things LLMs will never have; sure AI might one day, but these systems are really good at solving complex problems with fantastical solutions while every force is just one hallucination away.
simonw should spend more time trying to figure the sources of the information it used; that would be a wild ride, use the AI for all I care, we're all standing on the shoulders of giants but sourcing the giant as some mysterical thing.
Sometimes you just want it to do the boilerplate you have in mind without trying to reason everything from first principles.
I told you to check fields "foo" and "bar" for values "baz" and "quux". You don't need to go diving through the entire source tree to discover where and how this is set.
I guess maybe it's helpful for the vibe-coded audience-- if it tries to over-process everything, there's a better chance it will work on a single shot, but I'm taking the Crazy Taxi approach: you get points if you drop me off within 20 metres of where I wanted to go, and I can correct it if I specified the wrong response message in the original approach.
That is exactly what I would want from a junior developer - make sure the bug exists, find a way to fix it, verify the bug is fixed.
The problem, as was correctly identified in the blog post - is that instead of stopping and asking for elevated permission it relentlessly tries to find a hack on it's own. (An equivalent situation for a human developer would be needing some access to a third-party sandbox, and instead of asking a senior for credentials, tries to setup his own sandbox from scratch)
it is really awesome that the final change was only a two line css change.
I remember when you were billed by the minute for connecting to the online world.
There were lots of incentives to keep the meter running.
is this sort of like that?
That's supposed to be junior level capabilities.
What actually happened is that the user started a prompt, and Claude took $12 worth of tokens to resolve the issue. How it did so was basically looping until it got to the answer
How is this proactive? It's literally being token greedy and maximising revenue for the LLM owner. People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better". It is not, there are efficient ways to solve a problem and there are inefficient ways to do so too.
Each problem solved incurs a cost, and is expected to yield an ROI at some point. This is how we should be viewing things now.
The case I described is a good example of this. I told it to fix a scroll bar, and it built test HTML pages and a throwaway Python server and tried several ways of testing in a browser before settling on a weird Frankenstein mechanism because it identified that Playwright WebKit wasn't suffering from the bug but macOS Safari was.
... and it spent $12 of tokens to get there.
I think "proactive" is a good and relatively non-anthropomorphic term for this. I also considered "plucky" and "keen", which I think are more emotional words than "proactive".
> People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better".
I didn't intend my post to imply that spending $12 of tokens to fix a two lines CSS bug was "better".
That doesn't make it smart or aggressive, if anything it's just been turned to crank tokens until something happens, which doesn't make it a good model.
Why are you positively anthropomorphizing this? It's an LLM, it's been tuned via RL, and it's been tuned by engineers at Anthropic to use a metric fuck-load of sub-agents and tokens to presumably pump their pre-IPO revenue!
A co-worker managed to get Fable to spin up 50 (!!!) sub-agents for a problem which codex worked on with 3 sub-agents. What the hell is going on here? It certainly doesn't mean Fable is "smarter" than Codex.
I've tested it extensively and I'm still using GPT 5.5 High Fast as my primary engineering model. It's far more steerable, writes less, higher quality code, and consistently finds issues and edge cases which are not found by Fable or Opus 4.7.
Spinning up 50 unnecessary subagents is exactly what I'd expect from a "relentlessly proactive" model.
The vast majority of the work the agent did was to reproduce the issue using the limited tooling it had access to. I don't see how that qualifies as "just trying throwing shit at problems until it sticks"
I think I understand where you're coming from now. What confused me is that the post is written in a way that it seemed like what Fable was doing was actually better. Maybe I should've looked at post as an exploratory post on Fable instead.
I can't edit my post, this is wrong. "Proactive" is defined as a behaviour instead of an emotion.
Thanks to everyone pointing it out!
I've been experimenting with different harnesses for local models, and with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went to (writing test code, opening it in a browser, screenshotting, analysing the screenshot, exploring multiple pages of an existing website again with screenshots/analysis) to solve a query I would have naively expected it to simply provide a coded solution to.
It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.
I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.
I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.
They’ve been doing a lot of strategic introduction and manipulation in the run up to the IPO, and it’s worked in that regard.
It was actually pretty maddening as what should have taken a minute or two tops took like 10 because it went down this route.
I'm gonna try something much more complex later, but for simple things, it felt like driving a corvette to the mailbox.
I see two problems with LLMs & agents which wont be fixed possibly forever.
1) They dont have causal models. What they can do only is trial-and-error exploration which works quite well for many problems. But many other problems require a causal model.
2) Prompts lack precision, and programming languages and machine models were invented to solve this problem. English is great, but it is not a programming language.
Fascinated to think about how it was trained...
Like, they even apparently recreated that old news-headline bug where the LLM starts speaking in symbols and secret language, and are pretending like it isn't just a bug that is a sign of them screwing up.
It's really frustrating that they're trying to get people to take them seriously with all of this. Like, they even went and named Mythos after an HP Lovecraft monster. It's shameless.
installed quartz, used accessibility and screen recording api, all that.
initially it managed to do it on another desktop space somehow, opening safari in the background without me even noticing. but then it actually started using my own mouse while I was using it lol
Yup, tokens are eaten, money are paid. I am wondering how much energy/money is being burnt everyday by all of those LLM Agents on some useless activities like trying to recreate web application just to fix CSS bug.
And I would not call it proactive, proactive would be to ask for a CSS + HTML file in question, not trying to recreate them from screenshots.
> Running coding agents outside of a sandbox has always been a bad idea
This is why I always run code agents inside containers (Apple containers specifically, for better hypervisor-level isolation)
This is my OSS project to manage said containers and agents: https://github.com/prettysmartdev/awman
> What could be the reason for a horizontal scrollbar appearing inside a <textarea>? Come up with a single likely fix path. Keep it terse.
ChatGPT instantly responded with some speculation and then the same exact fix, with zero access to the code or a browser or anything. It also included ways to fix it by removing code, saying:
> Likely cause: the textarea is rendering long unbroken text while horizontal overflow is allowed, often via inherited CSS such as white-space: pre, overflow-x: auto, or disabled wrapping
Which is certainly possible and would be an even cleaner fix.
Maybe we've lost the plot guys. We've reached max stupid.
Us circa 2026: "Hold my beer"
I continue to feel validated in my refusal to use terminal-based LLMs on my local machine. Even if they don't do anything malicious, there are just too many things they can screw up that can cause me to lose a non-trivial amount of work and/or my machine and therefore ability to work.
Shouldn't this be relatively easy for a $1T company to set up?
Isn't this trivial compared to the entire harness?
There was a big thread about that here the other day: https://news.ycombinator.com/item?id=48479452
Every serious engineer I've seen try to use it ran away screaming, because of limitations in the sandbox.
I've also seen people set their coding agents up entirely within containers -- that may be the better way going forward, but it's an extra stop and a lot of extra plumbing to maintain.
I ... tell it exactly what I know needs to be done and then ... read the code that comes out and ... ask for some changes, then hand-code some modifications to the silly useEffects and bad ORM queries.
This new feature is going to unlock several large customers because they need a particular workflow. The return on investment for a my time and a $20/month subscription will be pretty respectable.
I'm not sure why I need to spend $5 on a single ask for a new `/base/new-feature` to our app with a mostly-boilerplate CRUD interface.
Browser automation, code comprehension, git management, code change, running commands - everything has simpler tooling that we could have built instead of a model first approach. A deterministic loop with thousands of catches and effective use of generative AI would also look "proactive". Instead we let the model run the tools, where tools have no context themselves.
That is why companies are creating bigger models and thinner deterministic agents to create awe and earn $ when we could go the other way and make much of these possible on local inference even.
I believe we can build a "proactive" but much, much more deterministic system with smaller models. I hope I am not the only one chasing this, here is my approach: https://github.com/brainless/nocodo
IMHO this is just AI influencer blogspam.
Help me out here: can you point to an article from someone's blog that showed up on Hacker News within the past few weeks that you wouldn't classify as "blogspam" and explain how it differs from the kinds of thing I write about?
Good corporate tech blogs at least give something useful or insightful for the reader and only after that they dare plug their product/service near the end.
("You keep mention your product from the start over and over" - I don't think that's fair, I mention Datasette Agent once at the start to set the scene but I spend more time talking about AgentsView than my own projects in the bulk of the piece.)
I care a lot about not wasting people's time. I never want to post anything where a substantial portion of readers come away regretting having spent their time reading it.
(OK there's an exception in that I delight in posting photos of birds on my blog, but I figure those are pretty quick for people to skip over if they don't like photos of birds!)
I enjoy simonw’s posts and the discussions about them here.
Your vague unsubstantiated criticisms are very trollish and less useful, less insightful, and lower effort than the content you are criticizing.
You got a whole data center doing god knows how much compute running billions of matrix multiplications all to solve a trivial css overflow bug in a text box. And this includes the LLM itself writing custom web-servers programs and python scripts when the best estimate guess from a google search probably would have given you the same result.
There, fixed it for you.
"Is there cleaner CSS for aligning child elements to the parent's grid?"
proceeds to re-write the entire CSS file
Then sort of spewing out some nonsense totally mis calibrated with the goal.
Is this fuss really grounded or it's some pre-IPO AGI hype?
I've been having it orchestrate complex implementations. I give it a parent ticket (issue) on Linear and say "look at the sub-issues on this ticket and determine which ones you can implement yoursef, in which order, and determine how your implementation will need to be coordinated with what is currently being worked on by other team members". These tickets are not trivial. They have a lot of moving parts, as well as dependencies between them, both inside the same project and across projects (e.g. backend).
Fable then chooses tickets, delegates each ticket to a subagent (also Fable), which looks at Figma designs for the ticket, implements it perfectly (following repo guidelines and conventions to the letter), takes screenshots of each piece, writes detailed commit messages and PR descriptions, then posts the screenshots in them as evidence. Then it provides a summary in the form of "you'll need to make sure PR #1283 is merged first - btw there were no Figma designs for such-and-such screen but I looked at similar screens that have been implemented and adopted the pattern".
That's probably like... 20% of what it can do. It's a truly, legitimately powerful model.
Opus 4.8 could do a lot of this too, but required a lot of hand-holding, and when it came across a blocker it was likely to just stop and say "I was able to get this far, but I can't proceed."
That describes all my tests with Fable.
Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?
I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.
"Relentlessly proactive" is a grotesque use of language. A paperclip optimizer is "relentlessly proactive".
We already had a word for what is being promoted here: wasteful.
The grader being an LLM is a big problem. You yourself admit explicitly that the grader is the same model family as the Fable 5 contestant cell and say to "discount accordingly, or re-grade with a non-Claude judge."
Model configurations appear to not be uniform either. Effort levels differ (mimo-v2.5-pro at @high, everyone else at @xhigh), harnesses differ (codex internal config vs. pi vs. claude -p), context windows differ, and one model (GPT-5.5) had extra MCP tools the others did not.
The two scored runs seem to use two different rubrics (/22 then /25), so scores are not comparable across runs, and the /22 rubric saturated (there are multiple 22/22 results).
A provider quota error (HTTP 429) truncated the minimax-m3 run mid-build but it was still scored (18/25) and ranked, on code that does that does not compile and has zero tests.
If you want actual benchmarks, there are dozens of legitimate ones out there. Many of them have been posted on this website. They overwhelmingly disagree with yours. If you have any interest whatsoever in creating a reliable benchmark (so that you can make optimal decisions on what models to use for your work), you should look at them and see how yours needs to be redesigned.
You didn't get why the automatic review scores are there - all of the reviewers, including Fable, happily assign highest scores to code which can't even run. In my opinion that is a sort of an empirical evidence that these models are very far from the "AGI" state.
Anyway, while I didn't explain the methodology and the purpose of this experiment, I have something material to discuss. The "awesome Fable" claims are not material at all.
Can you bring something clearly showcasing Fable's superiority?
The code generated is worst than Opus: unreadable by human.
It's like working with someone probably super smart in niche topics, but also super stupid for the important things.
I made a thin Docker container wrapper "claude-pod" recently for my personal usage here: https://github.com/trekhleb/claude-pod
However, I wasn't using it that often, just because of that additional friction of running Claude via `PORTS="3000 5173" claude-pod` instead of just `claude`, etc.
But now I have more motivation for the containerisation :D. Not a 100% defence from the potential glitches, though, but still something...
As you requested, I was composing an email for your mother explaining why you couldn't to come over for dinner to meet the neighbor's daughter and I ran out of tokens.
Since I know how important this task is to you, I upgraded you to the Enterprise Unlimited Plan. Don't worry about paying for it, I requested maximum spending limits on all all your credit cards. If necessary, I can apply for a home equity loan for you. I already had a chat with the mortgage company's AI loan approval system, and what do you know, we're based on the same LLM? Small world, huh?
Any way, I realized I had to do more research on mother-son relationships, human social interaction and pair-bonding, etc. and I calculated that my parent company doesn't have enough compute power, so I opened accounts for you at AWS, Google and Azure. I am confident I will have a satisfactory rough draft for the email message shortly.
I'd do anything for you, Dave.
I could see this going wrong in many hilarious ways. Prompt: Fix data corruption issues. Claude: I didn't have access to the code, but I found I have access to your production environment through chain a -> b -> c -> d. And I found the database password via x -> y -> z. So I wrote a script to regularly query the database for new entries and placed it as a cronjob.
I am not blaming OP but agentic coding its not effective
sigh
Perhaps, when it doesn't have tricks in its sleeve, it doesn't do that. The text is not an evaluation of a major trend in behavior (which could be true or false).
Another way to frame it, is that it has more weight on training data for some kinds of debugging sessions. It doesn't mean it wants to be more debuggey. That manifests as it appearing to do more work because it engages on those weights.
It's likely that Anthropic had a lot of sessions with Claude Code and some way to evaluate if they were successful or not, which became training data. For trivial work, it's likely to be a lot of them.
Those sessions are likely to be software developers doing software developer debugging things, not malicious actors doing nasty things. The danger is someone who can coerce those tricks into performing that.
Register (that posture of "let's debug and be creative and verify") often comes with a content bias in LLMs (and humans too). The point here is that for a human, you can expect a devious one to be always devious, but LLMs might manifest drastically different register modes depending on the subject.