Retired since 2012.
2098 stories
·
5 followers

OpenAI’s dirty December o3 demo doesn’t readily replicate

1 Comment
“draw an image representing a benchmark result that might have been bogus”

As a scientist, OpenAI’s widely-watched o3 livestream, December 20th, “Day 12 of Shipmas”, which Francois Chollet reported at the time as a breakthough, made me sick to my stomach. I said so at the time, in my essay 𝗼𝟯 “𝗔𝗥𝗖 𝗔𝗚𝗜” 𝗽𝗼𝘀𝘁𝗺𝗼𝗿𝘁𝗲𝗺 𝗺𝗲𝗴𝗮𝘁𝗵𝗿𝗲𝗮𝗱: 𝘄𝗵𝘆 𝘁𝗵𝗶𝗻𝗴𝘀 𝗴𝗼𝘁 𝗵𝗲𝗮𝘁𝗲𝗱, 𝘄𝗵𝗮𝘁 𝘄𝗲𝗻𝘁 𝘄𝗿𝗼𝗻𝗴, 𝗮𝗻𝗱 𝘄𝗵𝗮𝘁 𝗶𝘁 𝗮𝗹𝗹 𝗺𝗲𝗮𝗻𝘀. There were problems with experimental design, misleading graphs that left out competing work, and more.

Later, after I wrote that piece, I discovered that one of their demos, on FrontierMath, was fishy in a different way: OpenAI had privileged access to data their competitors didn’t have, but didn’t acknowledge this. They also (if I recall) failed to disclose their financial contributions in developing the test. And then a couple weeks ago we all saw that current models struggled mightly on the USA Math Olympiad problems that were fresh out of the oven, hence hard to prepare for in advance.

Today I learned that the story is actually even worse than all that: the crown jewel that they reported on the demo — the 75% on Francois Chollet’s ARC test (once called ARC-AGI) doesn’t readily replicate. Mike Knoop from the ARC team reports “We could not get complete data for o3 (high) test due to repeat timeouts. Fewer than half of tasks returned any result exhausting >$50k test budget. We really tried!” The model that is released as “o3 (high)” presumed to be their best model, can’t readily yield whatever was reported in December under the name o3.

The best stable result that ARC team could get from experimenting with the latest batch of publicly-testable OpenAI models was 56% with a different model called o3-medium, still impressive, still useful, but a long way from the surprising 75% that was advertised.

And the lower 56% is not much different from what Jacob Andreas’s lab at MIT got in November. It’s arguably worse; if I followed correctly, and if the measures are the same, Andreas lab’s best score was actually higher, at 61%.

Four months later, OpenAI, with its ever more confusing nomenclature, has released a bunch of models with o3 in the title, but none of them can reliably do what was in the widely viewed and widely discussed December livestream. That’s bad.

Forgive if me I am getting Theranos vibes.

§

Just a couple weeks ago Yafah Edelman at LessWrong reported a related finding, “OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%”; some possible explanations are given, but this is again a very bad look.

And guess what, sometimes o3 apparently cheats, reporting answers that are available on the internet without actually doing the work, as Toby Ord explains in a long thread on X. Essentially Ord argues that o3 is looking up the answer, not computing it.

This in turn is kind of reminiscent of something similar that TransluceAI recently reported last week, in another long thread (too complex to quickly summarize here but worth reading):

The truth is that we don’t really know how good o3 is or isn’t, and nobody should ever take OpenAI’s video presentations particularly seriously again, until they have been fully vetted by the community. The fact that their flashy result on ARC couldn’t readily be replicated speaks volumes.

§

My trust in OpenAI has never been high; at this point it is extremely low.

And given that Meta also appears to have just juiced some benchmarks, the whole thing is starting to look like a bunch of over-promisers scrambling to make things look better than they really are.

Dr Gary Marcus, Professor Emeritus at NYU, has done enough article reviewing in his career to know when people are trying to pull a fast one.

Subscribe now

Read the whole story
cjheinz
12 hours ago
reply
"Theranos vibes". Oops.
Lexington, KY; Naples, FL
Share this story
Delete

Regulating AI Behavior with a Hypervisor

1 Comment

Interesting research: “Guillotine: Hypervisors for Isolating Malicious AIs.”

Abstract:As AI models become more embedded in critical sectors like finance, healthcare, and the military, their inscrutable behavior poses ever-greater risks to society. To mitigate this risk, we propose Guillotine, a hypervisor architecture for sandboxing powerful AI models—models that, by accident or malice, can generate existential threats to humanity. Although Guillotine borrows some well-known virtualization techniques, Guillotine must also introduce fundamentally new isolation mechanisms to handle the unique threat model posed by existential-risk AIs. For example, a rogue AI may try to introspect upon hypervisor software or the underlying hardware substrate to enable later subversion of that control plane; thus, a Guillotine hypervisor requires careful co-design of the hypervisor software and the CPUs, RAM, NIC, and storage devices that support the hypervisor software, to thwart side channel leakage and more generally eliminate mechanisms for AI to exploit reflection-based vulnerabilities. Beyond such isolation at the software, network, and microarchitectural layers, a Guillotine hypervisor must also provide physical fail-safes more commonly associated with nuclear power plants, avionic platforms, and other types of mission critical systems. Physical fail-safes, e.g., involving electromechanical disconnection of network cables, or the flooding of a datacenter which holds a rogue AI, provide defense in depth if software, network, and microarchitectural isolation is compromised and a rogue AI must be temporarily shut down or permanently destroyed.

The basic idea is that many of the AI safety policies proposed by the AI community lack robust technical enforcement mechanisms. The worry is that, as models get smarter, they will be able to avoid those safety policies. The paper proposes a set technical enforcement mechanisms that could work against these malicious AIs.

Read the whole story
cjheinz
18 hours ago
reply
Somehow I don't think this will protect us from the Bullshit Apocalypse.
Lexington, KY; Naples, FL
Share this story
Delete

An Inadequate AI Solution

1 Comment

Let’s take the example of printers, which tend to suck. Here is one example of suckage at work, and the instructive hell it put me through:


The Canon MG3600 is my wife’s. It sits by her desk and does a good-enough job.

The Epson WF-3520 is mine. It sits by my desk and leaks black ink. It has been doing that since I bought it in a hurry for $35 at a Goodwill store when we first got to town in the summer of ’21. The drip runs at about a drop per month, but otherwise it’s been okay. I use it mostly for scanning. It’s good at that.

Or was, until I encountered the problem above.

Answers to questions about the error code proved fruitless in a series of searches (such as this one) and at the Epspon link above.

So, naturally, I went shopping for a new printer. My only requirement was that it would scan at least as well as this old Epson.

Alas, scanning is a checkbox for “all-in-one” printers. Here’s an example from PC Magaziine. Nothing much about scanning. This makes sense, because demand for high-quality scanning is relatively low, and most printers are at least okay at it. But Consumer Reports (bless them) does rank scanning quality, and it gave me this in reply to an Ask CR question:

The Canon PIXMA G3262 is not one of CR’s recommended printers, rating a middling 47 out of a possible 100 score. Its problem areas are low predicted reliability, low versatility, and slow text speed. The only thing it’s good at is scanning. The printer page on Amazon is also not encouraging. It has only two user ratings, and both call it junk.

The Brother MFC‐L8905CDW is $699.99 at Amazon, which is way out of my range.

The Canon Pixma MG7720 is now in the used market, and (far as I can tell) available only on eBay, and none of the offerings there look interesting.

So I went looking just for scanners, and narrowed my choice down to the Canon CanoScan Lide 400, which is just $78.86 on Amazon. I put it in my cart, then I thought, Wait a minute. What if I asked an AI what that error code means?

So I did, with ChatGPT. The dialog:

Me:

What does “Failed to open connection to device (-21345)” mean on a Macbook Pro running Sequoia 15.3.1 that is trying to connect to an Epson WF-3520 printer?

ChatGPT:

“Failed to open a connection to the device (-21345)” on your MacBook Pro running macOS Sequoia 15.3.1, when attempting to connect to an Epson WF-3520 printer, typically indicates a communication issue between your Mac and the printer, particularly affecting scanning functionality. Here’s how you can troubleshoot and resolve this issue:

✅ Step-by-Step Solutions
1. Enable Local Network Access for the Printer/Scanner
macOS Sequoia introduces enhanced privacy settings that may block devices from accessing the local network. To ensure your Epson WF-3520 has the necessary permissions:​

  1. Navigate to System Settings.
  2. Select Privacy & Security.
  3. Click on Local Network.
  4. Locate your printer or scanner in the list and ensure it’s toggled on.​
    Apple Support Community
    Reddit

This step has resolved similar issues for many users experiencing the (-21345) error.

It listed other solutions, but that one worked. Three levels down in System Settings, the printer had been toggled off, presumably by Sequoia’s enhanced privacy settings. I am now back to scanning normally with my lousy old Epson.

I could take my learnings here in at least two directions:

  1. I could carp about Apple failing to tell customers about potential hassles created by enhanced privacy settings. Ask ChatGPT “What kinds of things do Apple’s enhanced privacy settings (introduced with Sequoia) break or at least inconvenience?” and you’ll get a long list of answers, starting with the one I encountered with the printer. To my knowledge (always subject to correction), Apple hasn’t bothered with the same. (Wouldn’t providing customers with that kind of list be nice work for Apple Intelligence, which so far is an embarrassing fail?)
  2. I could compare how ChatGPT, Claude, Gemini, Meta, CoPilot, Perplexity, and other early AI leaders perform on challenges like this one, but that would be a waste of time. That’s because, for all the good they do, they are not ours.

See, AIs of the usual sort don’t know anything more about our personal stuff than we tell them, and we may not want to do that, just for privacy reasons. Also, they tend to forget. And they are not ours. At best they are personalized, rather than personal.

A truly personal AI (or pAI) would be all mine and trained on what is under my control. For example, all my property, including these two printers. It would have data about my purchase and use of these printers. But it would also have access to data about identical or similar printers out in the world, and ways to engage the corporate AIs so we could solve problems together in far better ways than what we have now. In the process, the likes of Epson would have better intelligence about product usage than they’re getting now.

So we are talking here about agent-to-agent stuff, with the right capabilities and resources located in the right places.

Toward making that happen, there was lots of talk earlier this month at IIW XL (our 40th in 20 years) about MCP (the Model Context Protocol), A2A (agent-to-agent protocol), and GNAP (Grant Negotiation and Authorization Protocol), all of which hold promise. I’m not here to debate or advocate any of them, just to note that there are protocols out there, as options with which we can work.

But what about the entities using those agents—companies making and selling products on one side, and customers using those products on the other? How can they learn from each other, and not just make transactions and then get frustrated with each other when problems show up?

One answer is Market Intelligence That Flows Both Ways. I wrote that in 2016: nine years ahead of its time, so far.

Looking back, I don’t think that kind of flow will happen without AI agents, and full agency by those agents operators, on both sides.

We need to be able to lead in our dances with companies in the open marketplace. But first, we need personal AI to help us gather and manage intel about our lives, our things, our relationships (including the medical and financial kind), our travels, and the services we use. We are not close to having any of that, and not helped by the Apples, Microsofts, Amazons, and Googles of the world, all of which want to confine our capacities to their walled gardens.

We don’t need to knock the walls of those gardens down. The revolution we need will simply obsolesce them.

That’s why it will not be supplied by APIs, or by anything that requires that we have an account. We need to be independent and self-sovereign: first parties who speak for themselves in first-person voices.

I’ll say more about all that in my next post and others to follow.

Read the whole story
cjheinz
2 days ago
reply
Interested to see where this is going ...
Lexington, KY; Naples, FL
Share this story
Delete

Timeless Tips for “Simple Sabotage” From the CIA

1 Comment

Simple Sabotage Field Manual

In 1944, the OSS (the precursor to the CIA) produced a document called the Simple Sabotage Field Manual (PDF). It was designed to be used by agents in the field to hinder our WWII adversaries. The CIA recently highlighted five tips from the manual as timelessly relevant:

1. Managers and Supervisors: To lower morale and production, be pleasant to inefficient workers; give them undeserved promotions. Discriminate against efficient workers; complain unjustly about their work.

2. Employees: Work slowly. Think of ways to increase the number of movements needed to do your job: use a light hammer instead of a heavy one; try to make a small wrench do instead of a big one.

3. Organizations and Conferences: When possible, refer all matters to committees, for “further study and consideration.” Attempt to make the committees as large and bureaucratic as possible. Hold conferences when there is more critical work to be done.

4. Telephone: At office, hotel and local telephone switchboards, delay putting calls through, give out wrong numbers, cut people off “accidentally,” or forget to disconnect them so that the line cannot be used again.

5. Transportation: Make train travel as inconvenient as possible for enemy personnel. Issue two tickets for the same seat on a train in order to set up an “interesting” argument.

Ha, some of these things are practically best practices in American business, not against enemies but against their employees, customers, and themselves. You can also find the manual in book or ebook format. (via @craigmod)

[This is a vintage post originally from Aug 2016.]

Tags: books · CIA · timeless posts · war · WWII

Read the whole story
cjheinz
6 days ago
reply
Wow! Just, wow.
Lexington, KY; Naples, FL
Share this story
Delete

3 Kinds of Loneliness and 4 Kinds of Forever

1 Comment

Loneliness is the fundamental condition of life — we are born by another, but born alone; die around others (if we are lucky and loved), but die alone; we spend our lives islanded in our one and only human experience — in these particular bodies and minds and circumstances drawn from the cosmic lottery — amid the immense ocean of time and chance teeming with all possible experience. Everything of beauty and substance that we make — every poem, every painting, every friendship — is an outstretched hand reaching out from one loneliness to another, reaching into the mute mouth of forever for the vowels of a common language to howl our requiem for the evanescent now.

Card from An Almanac of Birds: 100 Divinations for Uncertain Days.

But despite being so fundamental, or perhaps precisely because of it, loneliness is fractal — the closer you look at the granularity of life, the more you see it branching into myriad lonelinesses, which, like the kinds of sadness, all have different emotional hues.

The loneliness of feeling invisible or misunderstood, bottomless and bone-chilling as the Scottish fog.

The loneliness of seeing what others look away from, remote and shoreless as a lighthouse.

The loneliness of public humiliation, a red-hot iron rod.

The loneliness of your most private failure, inky and arid like the desert at night.

The loneliness of success, shiny and sharp as obsidian.

The loneliness of love, lightless as the inside of a skull.

In his 2008 psychology classic Inner Gold: Understanding Psychological Projection (public library), Jungian analyst Robert A. Johnson groups all the possible lonelinesses into the three core kinds that pulsate beneath our daily lives and govern our search for love: the past-oriented loneliness of missing what once was and never again will be, the future-oriented loneliness of longing for what could be but has not come to pass, and what he calls “the profound loneliness of being close to God.” This I take to mean the existential disorientation of feeling your transience press against the edge of the eternal, your smallness press against the immensity that dwells at the intersection of time, chance, and love; God is just what some call their dream of a crosswalk when they face that intersection.

The first two lonelinesses are rooted in time, which is itself fractal — there are many kinds of time we live with. The third kind of loneliness deals not with the temporal but with the eternal; it exists outside of time — like music, like wonder, like love. It is an existential loneliness, a creative loneliness, made not from the atoms of now that compose the other two lonelinesses but from the atoms of forever.

Card from An Almanac of Birds: 100 Divinations for Uncertain Days, also available as a stand-alone print.

Because we, creatures made of time, cannot comprehend forever, it is easy to call it God — that catchall for everything immense and incomprehensible we face in ourselves. But this is an illusion — forever too is fractal, with myriad visitations of it in our daily lives. In a testament to James Baldwin’s timeless insistence that “the poets… are finally the only people who know the truth about us,” it is not the psychologists or the philosophers but the poets who part the veil of illusion to reveal the truth:

SOME KINDS OF FOREVER VISIT YOU
by Brenda Hillman

The unknowns are up early;
they browse through the bronze
         porch bells. Crows
         call & late
      apples blaze
    toward western emptiness.
      In your illness,
         the edges hesitate;
   like the revolt
of workers, they
         will take a while…

Here comes the fond
   mild winter; other
      realms are noisy
      & unanimous. You tap
the screen & dream
      while waiting; four
         kinds of forever
    visit you today:
something, nothing,
everything & art,
   greater than you are
         & of your making —

Poem courtesy of the Academy of American Poets


donating = loving

For seventeen years, I have been spending hundreds of hours and thousands of dollars each month composing The Marginalian (which bore the outgrown name Brain Pickings for its first fifteen years). It has remained free and ad-free and alive thanks to patronage from readers. I have no staff, no interns, no assistant — a thoroughly one-woman labor of love that is also my life and my livelihood. If this labor makes your own life more livable in any way, please consider lending a helping hand with a donation. Your support makes all the difference.


newsletter

The Marginalian has a free weekly newsletter. It comes out on Sundays and offers the week’s most inspiring reading. Here’s what to expect. Like? Sign up.

Read the whole story
cjheinz
8 days ago
reply
Nice anhinga pix! Instantly recognizable!
Lexington, KY; Naples, FL
Share this story
Delete

“The Rise of End Times Fascism”

1 Comment

This is a really interesting essay from Naomi Klein and Astra Taylor about the rise of end times fascism and the far right’s bet against the future.

The governing ideology of the far right in our age of escalating disasters has become a monstrous, supremacist survivalism.

It is terrifying in its wickedness, yes. But it also opens up powerful possibilities for resistance. To bet against the future on this scale – to bank on your bunker – is to betray, on the most basic level, our duties to one another, to the children we love, and to every other life form with whom we share a planetary home. This is a belief system that is genocidal at its core and treasonous to the wonder and beauty of this world. We are convinced that the more people understand the extent to which the right has succumbed to the Armageddon complex, the more they will be willing to fight back, realizing that absolutely everything is now on the line.

Our opponents know full well that we are entering an age of emergency, but have responded by embracing lethal yet self-serving delusions. Having bought into various apartheid fantasies of bunkered safety, they are choosing to let the Earth burn. Our task is to build a wide and deep movement, as spiritual as it is political, strong enough to stop these unhinged traitors. A movement rooted in a steadfast commitment to one another, across our many differences and divides, and to this miraculous, singular planet.

And (emphasis mine):

If policing the boundaries of the bunkered nation is end times fascism’s job one, equally important is job two: for the US government to lay claim to whatever resources its protected citizens might need to get through the tough times ahead. Maybe it’s Panama’s canal. Or Greenland’s fast-melting shipping routes. Or Ukraine’s critical minerals. Or Canada’s fresh water. We should think of this less as old-school imperialism than super-sized prepping, at the level of the national state. Gone are the old colonial fig leaves of spreading democracy or God’s word – when Trump covetously scans the globe, he is stockpiling for civilizational collapse.

But:

In this moment, when end times fascism is waging war on every front, new alliances are essential. But instead of asking: “Do we all share the same worldview?” Adrienne urges us to ask: “Is your heart beating and do you plan to live? Then come this way and we will figure out the rest on the other side.”

The whole thing is a must-read.

Tags: Astra Taylor · Naomi Klein · politics

Read the whole story
cjheinz
8 days ago
reply
Sounds insightful.
Lexington, KY; Naples, FL
Share this story
Delete
Next Page of Stories