I’m currently researching forecasting and epistemics as part of the Quantified Uncertainty Research Institute.
ozziegooen
Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling
Thanks!
I tried adding a “Clarity Coach” earlier. I think that this is a sort of area where RoastMyPost probably wouldn’t have a massive advantage over custom LLM prompts directly, but we might be able to make some things easy.It would be very doable to add custom evaluators to give tips along these lines. Doing a great job would likely involve a fair bit of prompting, evaluation, and iteration. I might give it a stab and if so will get back to you on this.
(One plus is that I’d assume this would be parallelizable, so it could be fast at least)
Thanks for reporting your findings!
As I stated here, the Fact Checker has a bunch of false positives, and you’ve noted some.
The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.
The Fallacy Checker is overzealous, and doesn’t use web search (adds costs), so will particularly make mistakes when it’s above the date the models were trained.
There’s clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.
I experimented with Opus 4.5 a bit for the Fallacy Check. Results did seem a bit better, but costs were much higher.
I think the main way I could picture adding money is to add some agentic setup that does a deep review of a certain paper and presents a summary. I could see the marginal costs of this being maybe $10 to $50 per 5k words or so, using a top model like Opus. That said, the fixed costs of doing a decent job seem frustrating, especially because we’re still lacking easy API use of existing agents (My preferred method would be a high-level Claude Code API, but that doesn’t really exist yet).
I’ve been thinking of having competitions here, for people to make their own reviews, then we could compare with a few researchers and LLMs. I think this area could make for a lot of cleverness and innovation.
I also added a table back to this post that gives a better summary.
There’s more going on, though that doesn’t mean it will be necessarily better for you than your system.
There are some “custom” evaluators, that are basically just what you’re describing, but with specific prompts. Though in these cases, note that there’s extra functionality for users to re-run evaluators, see the histories of the runs, and see the specific agent’s evals of many different documents.
The “system” evaluators typically have more specific code. They have short readmes you can see more on their pages:
https://www.roastmypost.org/evaluators/system-fact-checker
https://www.roastmypost.org/evaluators/system-fallacy-checkhttps://www.roastmypost.org/evaluators/system-forecast-checker
https://www.roastmypost.org/evaluators/system-link-verifier
https://www.roastmypost.org/evaluators/system-math-checker
https://www.roastmypost.org/evaluators/system-spelling-grammar
Some of this is just the tool splitting up a post into chunks, then doing analysis on each chunk. Some are more different. The link verifier works without any AI.
One limitation of these systems is that they’re not very customizable. So if you’re making something fairly specific to your system, this might be tricky.
My quick recommendation is to try running all the system evaluators at least on some docs so you can try them out (or just see the outputs on other docs).
Agreed!
The workflow we have does use a step for this. This specific workflow:
1. Chunks document
2. Runs analysis on each chunk, producing a long set of total comments.
3. Then, all the comments are fed into a final step. This step sees the full post. It then removes a bunch of comments and writes a summary.
I think it could use a lot more work and tuning. Generally, I’ve found these workflows fairly tricky and time-intensive to work on so far. I assume they will get easier in the next year or so.
Thanks for trying it out and reporting your findings!
It’s tricky to tune the system to both flag important errors, but not flag too many errors. Right now I’ve been focusing on the former, assuming that it’s better to show too many errors than too few.
The Fact Check definitely does have mistakes (often due to the chunking, as you flagged).
The Fallacy Check is very overzealous—I scaled it back, but will continue to adjust it. I think that overall the fallacy check style is quite tricky to do, and I’ve been thinking about some much more serious approaches. If people here have ideas or implementations I’d be very curious!
Shallow review of technical AI safety, 2025
Announcing RoastMyPost: LLMs Eval Blog Posts and More
The humans trusted to make decisions.
I’m hesitant to say “best humans”, because who knows how many smart people there may be out there who might luck out or something.
But “the people making decisions on this, including in key EA orgs/spending” is a much more understandable bar.
(Quick Thought)
Perhaps the goal for existing work targeting AI safety is less to ensure that AI safety happens, and more to make sure that we make AI systems that are strictly[1] better than the current researchers at figuring out what to do about AI safety.
I’m unsure how hard AI safety is. But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
If humans can successfully bootstrap more capable infrastructure than us, then our (humans) main work is done (though there could still be other work we can help with).
It might well be the case that the resulting AI systems would recognize that the situation is fairly hopeless. But at that point, humans have done they key things they need to do on this, hopeless or not. Our job is to set things up as best we can, more is by definition impossible.
Personally, I feel very doomy about humans now solving for various alignment problems of many years from now. But I feel much better about us making systems that will do a better job at guiding things then we could.
(The empirical question here is how difficult it is to automate alignment research. I realize this is a controversial and discussed topic. My guess is that many researchers will never agree with good AI systems, and always hold out on considering them superior—and that on the flip side, many people will trust AIs before they really should. Getting this right is definitely tricky.)
[1] Strictly meaning that they’re very likely better overall, not that there’s absolutely no area humans will be better than them.
Thanks for the clarification.
> But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem. As in, this can get us very far into making AI systems do a lot of valuable work for us with very low risk.
I imagine that you’re using a more specific definition of it than I am here.
I was thinking more of internal systems that a company would have enough faith in to deploy (a 1% chance of severe failure is pretty terrible!) or customer-facing things that would piss off customers more than scare them.
Getting these right is tremendously hard. Lots of companies are trying and mostly failing right now. There’s a ton of money in just “making solid services/products that work with high reliability.”
Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
I agree that some companies do use RL systems. However, I’d expect that most of the time, the black-box nature of some of these systems is not actively preferred. They use them despite the black-box nature, because these are specific situations where the benefits outweigh the costs, not because of them.
“current transformer models are arguably black boxes with massive adoption.” → They’re typically much less that of RL. There’s a fair bit of customization that can be done with prompting, and the prompting is generally English-readable.
Your example of “Everything Inc” is also similar to what I’m expecting. As in, I agree with:
1. The large majority of business strategy/decisions/implementation can (somewhat) quickly be done by AI systems.
2. There will be strong pressures to improve AI systems, due to (1).
That said, I’d expect:
1. The benefits are likely to be (more) distributed. Many companies will be simultaneously using AI to improve their standings. This leads to a world where there’s not a ton of marginal low-hanging-fruit for any single company. I think this is broadly what’s happening now.
2. A great deal of work will go into making many of these systems reliable, predictable, corrigible, legally-compliant, etc. I’d expect companies to really dislike being blind-sighted by sub-AI systems that do bizarre things.
3. This is a longer-shot, but I think there’s a lot of potential for strong cooperation between companies, organizations, and (effective) governments. A lot of the negatives of maximizing businesses comes from negative externalities and similar, which can also be looked at as coordination/governance failures. I’d naively expect this to mean that if power is distributed among multiple capable entities at time T, then these entities would likely wind up doing a lot of positive-sum interactions with each other. This seems good for many S&P 500 holders.
”or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable.”
This seems tough to me, but quite possible, especially as we get much stronger AI systems. I’d expect that we could (with a lot of work) have a great deal of:
1. Categorization of potential tasks into discrete/categorizable items.
2. Simulated environments that are realistic enough.
3. Innovations in finding good trade-offs between task competence and narrowness.
4. LLM task eval setups would get substantially more sophisticated and powerful.
I’d expect this to be a lot of work. But at the same time, I’d expect a lot of of it to be strongly commercially useful.
Thanks so much for that explanation. I’ve started to review those posts you linked to and will continue doing so later. Kudos for clearly outlining your positions, that’s a lot of content.
> “We probably mostly disagree because you’re expecting LLMs forever and I’m not.”
I agree that RL systems like AlphaZero are very scary. Personally, I was a bit more worried about AI alignment a few years ago, when this seemed like the dominant paradigm.
I wouldn’t say that I “expect LLMs forever”, but I would say that if/when they are replaced, I think it’s more likely than not that they will be replaced by a system of a scariness factor that’s similar to LLMs or less. The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
> I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level?
I’m referring to scaffolding. As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks. These subcalls might be optimized to be narrow + [low information] + [low access] + [generally friendly to humans] or similar. This can be made more advanced with a large variety of fine-tuned models, but that might be unlikely.
“Do you think AI-empowered people / companies / governments also won’t become more like scary maximizers?” → My statements above were very focused on AI architectures / accident risk. I see people / misuse risk as a fairly distinct challenge/discussion.
I appreciate this post for working to distill a key crux in the larger debate.
Some quick thoughts:
1. I’m having a hard time understanding the “Alas, the power-seeking ruthless consequentialist AIs are still coming” intuition. It seems like a lot of people in this community have this intuition, and I feel very curious why. I appreciate this crux getting attention.
2. Personally, my stance is something more like, “It seems very feasible to create sophisticated AI architectures that don’t act as scary maximizers.” To me it seems like this is what we’re doing now, and I see some strong reasons to expect this to continue. (I realize this isn’t guaranteed, but I do think it’s pretty likely)
3. While the human analogies are interesting, I assume they might appeal more to the “consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes. Perhaps they’re much better than some of what people imagine, but I suspect that we can make AI systems that have much more rigorous safety properties in the future. I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
4. You mention human intrinsic motivations as a useful factor. I’d flag that in a competent and complex AI architecture, I’d expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it’s really specific sub-routines and similar that have these more altruistic motivations.
Thanks for letting me know!
I’m not very attached to the name. Some people seem to like it a lot, some dislike.
I don’t feel very confident in how it will be most used 4-12 months from now. So my plan at this point is to wait on use, then consider renaming later as the situation gets clearer.