FireStormOOO

Karma: 284

FireStormOOO 14 Jul 2025 0:29 UTC
1 point
0
in reply to: AdamRies’s comment on: Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
Common failures aren’t common because they happen most of the time, they’re common because, conditioned on a failure happening, they’re likely.
The example is a bit contrived, but safety goals being poorly specified or outright inconsistent and contradictory seems quite plausible in general, as they have to try to incorporate input from PR, HR, legal compliance, etc. And this will always be a cost center, so minimal effort as long as it’s not making the model too painfully stupid.

FireStormOOO 11 Jul 2025 2:27 UTC
1 point
0
on: How much novel security-critical infrastructure do you need during the singularity?
I think this ignores how different the hardware that runs AI training or inference looks from hardware that does any other general purpose compute, and how much that gap continues to widen as the money pours in.
Keep in mind also that Nvidia has zero incentive to share their hardware accelerator firmware or other low level code, which severely restricts AI companies from optimizing over that. That could maybe change if the leverage does. The companies selling the infrastructure are much more consolidated than the AI research firms at present.
We may be closer to the optimistic scenario, not through any great wisdom or planning, but rather just because of the monopolies/oligopolies we ended up with. And cutting edge fabs are in high enough demand it’s hard to just route around the AI accelerator hardware monopoly because then you just find the cutting edge chip fab monopoly/oligopoly. Network hardware is no better (Broadcom).

FireStormOOO 11 Jul 2025 2:11 UTC
2 points
0
on: Comparing risk from internally-deployed AI to insider and outsider threats from humans
I agree insider vs outsider threat is an important distinction, and I one that I have seen security people take seriously in other contexts. My background is in enterprise IT and systems admin. I think there’s some practical nuance missing here.
In so far as security people are expecting to treat the AI as an outsider, they’re likely expecting to have a hard boundary between “systems that run the AI” and “systems and tools the AI gets to use”, where access to any given user is to only one or the other.
This is already fairly common practice, in the sense that security people already think about how to avoid giving too much privileged access to any specific insider. E.g. ideally in a large enough company, admins that have to most of the systems might not be given access to the systems responsible for logging what they did, nor the system granting access to other systems—centralized access control/identity (e.g. AWS IAM, Okta, etc.). In that example the team that manages those logging and access control systems would not have access to anything outside those systems and would just be responsible for implementation according to whatever the security (and indirectly, HR people, say). The security people might have just readonly access to systems (e.g. through a log search/archival tool and any telemetry from the security software) and rely on the separate IT team for all changes.
More extreme levels of security/paranoia are possible and do exist, especially if you look at the largest companies or national security applications. In the latter case, the US govt maintains and entire separate internet with distinct fiber optic cabling, routers, etc; the computers have no connection outside this separate network, sit exclusively in shielded rooms in physically access controlled buildings, no unauthorized electronics go in those rooms, nor anyone without a clearance, etc. I could go on but hopefully I’ve gotten the idea across; AI is a distinct threat that would need a different flavor of extreme precautions.
For AI there’s a pretty natural boundary to draw between the AI specific hardware accelerators on which the AI training and inference runs (and associated control, data warehouse, etc), vs the tooling the AI is permitted to use while it’s running. The AI’s inference running on a particular system doesn’t automatically give it privileged access to that system (though exploits might). Tooling that the AI calls into or uses need not exist on the same hardware that’s running inference. If such a boundary gets defined by the security team, it’s then possible to treat any leakage of AI state effecting the inference hardware as fishy and deserving of investigation, just as if it was an outside user (b/c with that division, it is).
Happy to answer related questions about what existing approaches exist here and might be considered ordinary practice. IT and information security mostly happens outside of academic literature and companies don’t like to talk about specific security precautions they’re taking. You’ll probably need to talk to many people with industry experience to get a well rounded take, as nobody gets to see all that many implementations in full detail in one career.

FireStormOOO 2 Jul 2025 4:50 UTC
1 point
0
in reply to: Richard_Kennaway’s comment on: A Straightforward Explanation of the Good Regulator Theorem
(I have been busy, hence the delay.)
No worries, likewise.
Most centrally I think we’re seeing fundamentally different things with the causal graph. Or more to the point, I haven’t the slightest idea how one is supposed to do any useful reasoning with time varying nodes without somehow expanding it to consider how one node’s function and/or time series effects it’s leaf nodes (or another way, specifically what temporal relation the arrow represents). It also seems fairly inescapable to me that any way you consider that relation, an actual causal cycle where A causes B causes C causes A at the same instant looks very different than one where they indirectly effect each-other at some later time, to the point of needing different tools to analyze the two cases. The latter looks very much like the sort of thing solved with recursion or update loops in programs all the time. Alternately diff eq in the continuous case. The former looks like the sort of thing you need a solver to look for a valid solution for.
It’s fairly obvious why cycles of the first kind I describe would need different treatment—the graph would place constraints on valid solutions but not tell you how to find them. I’m not seeing how the second case is cyclic in the same sense and how you couldn’t just use induction arguments to extend to infinity.
AFAICT you and I aren’t disagreeing on anything about real control systems. It’s difficult to find a non-contrived example because so many control systems either aren’t that demanding or have a human in the loop. But this theorem is about optimal control systems, optimal in the formal computer science sense, so the fact that neither of us can come up with an example that isn’t solved by a PID control loop or similar is somewhat besides the point.
While PID controllers are applicable to many control problems and often perform satisfactorily without any improvements or only coarse tuning, they can perform poorly in some applications and do not in general provide optimal control.
-Wikipedia PID article

FireStormOOO 21 Jun 2025 8:05 UTC
5 points
0
in reply to: Gordon Seidoh Worley’s comment on: Fictional Thinking and Real Thinking
Do you think they’re actually struggling to distinguish real from fiction, or merely struggling to keep two complex distinct worlds in their working memory/stack/context window and keep the details straight?
E.g. many animals will play and chase, understanding both that there’s different rules because it’s play, yet still transfer the skills to actually hunting or fighting. Seems more a matter of degree?

FireStormOOO 21 Jun 2025 5:40 UTC
2 points
0
in reply to: Richard_Kennaway’s comment on: A Straightforward Explanation of the Good Regulator Theorem
It sounded previously like you were making the strong claim that this setup can’t be applied to a closed control loop at all, even in e.g. the common (approximately universal?) case where we have a delay between the regulator’s action and it’s being able to measure that action’s effect. That’s mostly what I was responding to; the chaining that Alfred suggested in the sibling comment seems sensible enough to me.
It occurs to me that the household thermostat example is so non-demanding as to not be a poor intuition pump. I implicitly made the jump to thinking about a more demanding version of this without spelling that out. It’s always going to be a little silly trying to optimize an example that’s already intuitively good enough. Imagine for sake of argument a apparatus that needs tighter control such that there’s actually pressure to optimize beyond the simplest control algorithm.
Your examples of control systems all seem fine and accurate. I think we agree the tricky bit is picking the most sensible frame for mapping the real system to the diagram (assuming that’s roughly what you mean by terminology).
It seems like even with the improvements John Wentworth suggests there’s still some ambiguity in how to apply the result to a case where the regulator makes a time series of decisions, and you’re suggesting there’s some reason we can’t, or wouldn’t want to use discrete timesteps and chain/repeat the diagram.
At a little more length, I’m picturing the unrolling such that the current state is the sensor’s measurement time series through present, of which the regulator is certain. It’s merely uncertain about how its action—what fraction of the next interval to run the heat—will effect the measurement at future times. It’s probably easiest if we draw the diagram such that the time step is the delay between action and measured effect, and the regulator then sees the result of its action based on T1 at T3.
That seems pretty clearly to me to match the pattern this theorem requires, while still having a clear place to plug in whatever predictive model the regulator has. I bring up the sampling theorem as that is the bridge between the discrete samples we have and the continuous functions and differential equations you elsewhere say you want to use. Or stated a little more broadly, that theorem says we can freely move between continuous and discrete representations as needed, provided we sample frequently enough and the functions are well enough behaved to be amenable to calculus in the first place.

FireStormOOO 20 Jun 2025 2:37 UTC
2 points
0
in reply to: Richard_Kennaway’s comment on: A Straightforward Explanation of the Good Regulator Theorem
How do you figure a thermostat directly measures what it’s controlling? It controls heat added/removed per unit time, typically just more/less/no change, and measures the resulting temperature at a single point on typically a minute+ delay due to the dynamics of the system (air and heat take time to diffuse, even with a blower). Any time step sufficiently shorter than that delay is going to work the same. The current measurement depends on what the thermostat did tens of seconds if not minutes previously.
There are times the continuous/discrete distinction is very important but this example isn’t one of them. As soon as you introduce a significant delay between cause and effect the time step model works (it may well be a dependence on multiple previous time-steps, but not the current one).
I don’t think this is an unusual example, we have a small number of sensors, we get data on a delay, and we’re actually trying to control e.g. the temperature in the whole house, holding a set point, minimizing variation between rooms and minimizing variation across time, with the smallest amount of control authority over the system (typically just on/off).
I believe “sufficiently shorter than the delay” is just going to be Nyquist Shannon sampling theorem, once you’re sampling twice the frequency of the highest frequency dynamic in the system, your control system has all the information from the sensor and sampling more will not tell you anything else.

FireStormOOO 11 Mar 2025 18:22 UTC
4 points
0
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I wonder if you could produce this behavior at all in a model that hadn’t gone through the safety RL step. I suspect that all of the examples have in common that they were specifically instructed against during safety RL, alongside “don’t write malware”, and it was simpler to just flip the sign on the whole safety training suite.
Same theory would also suggest your misaligned model should be able to be prompted to produce contrarian output for everything else in the safety training suite too. Just some more guesses, the misaligned model would also readily exhibit religious intolerance, vocally approve of terror attacks and genocide (e.g. both expressing approval of Hamas’ Oct 6 massacre, and expressing approval of Israel making an openly genocidal response in Gaza), and eagerly disparage OpenAI and key figures therein.

FireStormOOO 11 Mar 2025 17:58 UTC
2 points
0
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Yikes. So the most straightforward take: When trained to exhibit a specific form of treachery in one context, it was apparently simpler to just “act more evil” as broadly conceptualized by the culture in the training data. And also seemingly, “act actively unsafe and harmful”, as defined by the existing safety RL. process. Most of those examples seem to just be taking the opposite position to the safety training, presumably in proportion to how heavily it featured in the safety training (e.g. “never ever ever say anything nice about Nazis” likely featured heavily).
I’d imagine those are distinct representations. There’s quite a large delta between what OpenAI thinks is safe/helpful/harmless vs what broader society would call good/upstanding/respectable. It’s possible that this is only inverting what was in the safety fine tuning, and likely specifically because “don’t help people write malware” was something that featured in the safety training.
In any case, that’s concerning. You’ve flipped the sign on the much of the value system it was trained on. Effectively by accident, with, as morally ambiguous requests go, a fairly innocuous one. People are absolutely going to put AI systems in adversarial contexts where they need to make these kind of fine tunings (“don’t share everything you know”, “toe the party line”, etc). One doesn’t generally need to worry about humans generalizing from “help me write malware” to “and also bonus points if you can make people OD on their medicine cabinet”.

FireStormOOO 24 Dec 2024 7:12 UTC
2 points
0
in reply to: papetoast’s comment on: When Is Insurance Worth It?
Hmm, I guess I see why other calculators have at least some additional heuristics and aren’t straight Kelly. Going bankrupt is not infinitely bad in the US. If the insured has low wealth, there’s likely a loan attached to any large asset that really complicates the math. Making W just be “household wealth” also doesn’t model “I can replace the loss next paycheck”. I’m not sure what exactly the correct notion of wealth is here, but if wealth is small compared to future earnings, and replacing the loss can be deferred, these assumptions are incorrect.
And obviously, paying $10k premium to insure a 50% chance of a $10k loss is always a mistake for all wealth levels. You’re choosing to be bankrupt in 100% of possible worlds instead of 50%.

FireStormOOO 24 Dec 2024 0:55 UTC
4 points
1
on: When Is Insurance Worth It?
This seems like a very handy calculator to have bookmarked.
~~I think I did find a bug:~~ At the low end it’s making some insane recommendations. E.g. with wealth W and a 50% chance of loss W (50% chance of getting wiped out), the insurance recommendation is any premium up to W.
Wealth $10k, risk 50% on $9999 loss, recommends insure for $9900 premium.
~~That log(W-P) term is shooting off towards -infinity and presumably breaking something?~~
Edit: As papetoast points out, this is a faithful implementation of the Kelly criterion and is not a bug. Rather, Kelly assumes that taking a loss >= wealth is infinitely bad, which is not true in an environment where debts are dischargable in bankruptcy (and total wealth may even remain positive throughout).
There’s probably corrections that would improve the model by factoring in future earnings, the degree to which the loss must be replace immediately (or at all), and the degree to which some losses are capped.

FireStormOOO 15 Dec 2024 3:15 UTC
2 points
0
in reply to: aphyer’s comment on: Peacewagers so Far
Related, I noticed Civ VI also really missed the mark with that mechanic. I found that a great strategy, having a modest lead on tech, was to lean into coal power, which has the best bonuses, get your seawalls built to stop your coastal cities from flooding, and flood everyone else with sea-level rise. Only one player wins, so anything to sabotage others in the endgame will be very tempting.
Rise of Nations had an “Armageddon counter” on the use of nuclear weapons, which mostly resulted in exactly the behavior you mentioned—get ’em first and employ them liberally right up to the cap.
Fundamentally both games are missing any provision for complex, especially multilateral agreements, nor is there any way to get the AI on the same page.

FireStormOOO 8 Nov 2024 5:11 UTC
4 points
3
on: Quantum Immortality: A Perspective if AI Doomers are Probably Right
Your examples seem to imply that believing QI means such an agent would in full generality be neutral on an offer to have a quantum coin tossed, where they’re killed in their sleep on tails, since they only experience the tosses they win. Presumably they accept all such trades offering epsilon additional utility. And presumably other agents keep making such offers since the QI agent doesn’t care what happens to their stuff in worlds they aren’t in. Thus such an agent exists in an ever more vanishingly small fraction of worlds as they continue accepting trades.
I should expect to encounter QI agents approximately never as they continue self-selecting out of existence in approximately all of the possible worlds I occupy. For the same reason, QI agents should expect to see similar agents almost never.
From the outside perspective this seems to be in a similar vein to the fact all computable agents exist in some strained sense (every program, more generally every possible piece of data, is encodable as some integer, and exist exactly as much as the integers do) , even if they’re never instantiated. For any other observer, this QI concept is indistinguishable in the limit.
Please point out if I misunderstood or misrepresented anything.

FireStormOOO 7 Nov 2024 18:17 UTC
4 points
0
on: Why is o1 so deceptive?
I’ll note that malicious compliance is a very common response to being provided a task that’s not straightforwardly possible with the resources available, and no channel to simply communicate that without retaliation. BS an answer, or technically correct/rules as written response, is often just the best available strategy if one isn’t in a position to fix the evaluator’s broken incentives.
An actual human’s chain of thought would be a lot spicier if their boss ask them to produce a document with working links without providing internet access.

FireStormOOO 20 Oct 2024 0:04 UTC
2 points
0
in reply to: Mary Chernyshenko’s comment on: video games > IQ tests
“English” keeps ending up as a catch-all in K-12 for basically all language skills and verbal reasoning skills that don’t obviously fit somewhere else. Read and summarize fiction—English, Write a persuasive essay—English, grammar pedantry—English, etc.

FireStormOOO 17 Sep 2024 20:42 UTC
3 points
0
on: The Asshole Filter
That link currently redirects the reader to https://siderea.dreamwidth.org/1209794.html
(just in case the old one stops working)

FireStormOOO 4 Aug 2024 0:00 UTC
1 point
1
in reply to: Max Harms’s comment on: 3b. Formal (Faux) Corrigibility
Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of “how much influence”, but also along whatever other axes the sort of influence could vary?
I think if the agent’s action space is still so unconstrained there’s room to consider benefit or harm that flows through principle value modification it’s probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).
At the same time, it’s not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it’s free money? [e.g. Prince’s insurance company wants to bribe him to stop smoking.]

FireStormOOO 3 Aug 2024 2:18 UTC
LW: 4 AF: 2
0
AF
on: 3b. Formal (Faux) Corrigibility
WRT non-manipulation, I don’t suppose there’s an easy way to have the AI track how much potentially manipulative influence it’s “supposed to have” in the context and avoid exercising more than that influence?
Or possibly better, compare simple implementations of the principle’s instructions, and penalize interpretations with large/unusual influence on the principle’s values. Preferably without prejudicing interventions straightforwardly protecting the principle’s safety and communication channels.
Principle should, for example, be able to ask the AI to “teach them about philosophy”, without it either going out of it’s way to ensure Principle doesn’t change their mind about anything as a result of the instruction, nor unduly influencing them with subtly chosen explanations or framing. The AI should exercise an “ordinary” amount of influence typical of the ways AI could go about implementing the instruction.
Presumably there’s a distribution around how manipulative/anti-manipulative(value-preserving) any given implementation of the instruction is, and we may want AI to prefer central implementations rather than extremely value-preserving ones.
Ideally AI should also worry that it’s contemplating exercising more or less influence than desired, and clarify that as it would any other aspect of the task.

FireStormOOO 1 Aug 2024 18:36 UTC
1 point
0
in reply to: quiet_NaN’s comment on: The Incredible Fentanyl-Detecting Machine
You’re very likely correct IMO. The only thing I see pulling in the other direction is that cars are far more standardized than humans, and a database of detailed blueprints for every make and model could drastically reduce the resolution needed for usefulness. Especially if the action on a cursory detection is “get the people out of the area and scan it harder”, not “rip the vehicle apart”.

FireStormOOO 10 Jul 2024 20:26 UTC
4 points
0
on: Slack matters more than any outcome
This is the first text talking about goals I’ve read that meaningfully engages with “but what if you were (partially) wrong about what you want” instead of simply glorifying “outcome fixation”. This seems like a major missing piece in most advice about goals. That the most important thing about your goals is that they’re actually what you want. And discovering that may not be the case is a valid reason to tap the brakes and re-evaluate.