I went ahead and created the 2024 version of one of the questions. If you’re looking for high-liquidity questions to include, which seems like a good way to avoid false alarms / pranks, this one seems like a good inclusion.
There are a bunch of lower-liquidity questions; including a mix of those with some majority-rule type logic might or might not be worth it.
Thank you! Much to think about, but later...
If there are a large number of true-but-not-publicly-proven statements, does that impose a large computational cost on the market making mechanism?I expect that the computers running this system might have to be fairly beefy, but they’re only checking proofs.
If there are a large number of true-but-not-publicly-proven statements, does that impose a large computational cost on the market making mechanism?
I expect that the computers running this system might have to be fairly beefy, but they’re only checking proofs.
They’re not, though. They’re making markets on all the interrelated statements. How do they know when they’re done exhausting the standing limit orders and AMM liquidity pools? My working assumption is that this is equivalent to a full Bayesian network and explodes exponentially for all the same reasons. In practice it’s not maximally intractable, but you don’t avoid the exponential explosion either—it’s just slower than the theoretical worst case.
If every new order placed has to be checked against the limit orders on every existing market, you have a problem.
For thickly traded propositions, I can make money by investing in a proposition first, then publishing a proof. That sends the price to $1 and I can make money off the difference. Usually, it would be more lucrative to keep my proof secret, though.
The problem I’m imaging comes when the market is trading at .999, but life would really be simplified for the market maker if it was at actually, provably 1. So it could stop tracking that price as something interesting, and stop worrying about the combinatorial explosion.
So you’d really like to find a world where once everyone has bothered to run the SAT-solver trick and figure out what route someone is minting free shares through, that just becomes common knowledge and everyone’s computational costs stop growing exponentially in that particular direction. And furthermore, the first person to figure out the exact route is actually rewarded for publishing it, rather than being able to extract money at slowly declining rates of return.
In other words: at what point does a random observer start turning “probably true, the market said so” into “definitely true, I can download the Coq proof”? And after that point, is the market maker still pretending to be ignorant?
This is very neat work, thank you. One of those delightful things that seems obvious in retrospect, but that I’ve never seen expressed like this before. A few questions, or maybe implementation details that aren’t obvious:
For complicated proofs, the fully formally verified statement all the way back to axioms might be very long. In practice, do we end up with markets for all of those? Do they each need liquidity from an automated market maker? Presumably not if you’re starting from axioms and building a full proof, and that applies to implications and conjunctions and so on as well, because the market doesn’t need to keep tracking things that are proven. However:
First Alice, who can prove A, produces many many shares of ¬A for free. This is doable if you have a proof for A by starting from a bunch of free ⊥ shares and using equivalent exchange. She sells these for $0.2 each to Bob, pure profit.
In order for this to work, the market must be willing to maintain a price for these shares in the face of a proof that they’re equivalent to ⊥. Presumably the proof is not yet public, and if Alice has secret knowledge she can sell with a profit-maximizing strategy.
She could simply not provide the proof to the exchange, generating A and ¬A pairs and selling only the latter, equivalent to just investing in A, but that requires capital. It’s far more interesting if she can do it without tying up the capital.
So how does the market work for shares of proven things, and how does the proof eventually become public? Is there any way to incentivize publishing proofs, or do we simply get a weird world where everyone is pretty sure some things are true but the only “proof” is the market price?
Second question: how does this work in different axiom systems? Do we need separate markets, or can they be tied together well? How does the market deal with “provable from ZFC but not Peano”? “Theorem X implies corollary Y” is a thing we can prove, and if there’s a price on shares of “Theorem X” then that makes perfect sense, but does it make sense to put a “price” on the “truth” of the ZFC axioms?
Presumably if we have a functional market that distinguishes Peano proofs from ZFC proofs, we’d like to distinguish more axiom sets. What happens if someone sets up an inconsistent axiom set, and that inconsistency is found? Presumably all dependent markets become a mess and there’s a race to the exits that extracts all the liquidity from the AMMs; that seems basically fine. But can that be contained to only those markets, without causing weird problems in Peano-only markets?
Probably some of this would be clearer if I knew a bit more about modern proof formalisms.
My background: educated amateur. I can design simple to not-quite-simple analog circuits and have taken ordinary but fiddly material property measurements with electronics test equipment and gotten industrially-useful results.
One person alleges an online rumor that poorly connected electrical leads can produce the same graph. Is that a conventional view?
I’m not seeing it. With a bad enough setup, poor technique can do almost anything. I’m not seeing the authors as that awful, though. I don’t think they’re immune from mistakes, but I give low odds on the arbitrarily-awful end of mistakes.
You can model electrical mistakes as some mix of resistors and switches. Fiddly loose contacts are switches, actuated by forces. Those can be magnetic, thermal expansion, unknown gremlins, etc. So “critical magnetic field” could be “magnetic field adequate to move the thing”. Ditto temperature. But managing both problems at the same time in a way that looks like a plausible superconductor critical curve is… weird. The gremlins could be anything, but gremlins highly correlated with interesting properties demand explanation.
Materials with grains can have conducting and not-conducting regions. Those would likely have different thermal expansion behaviors. Complex oxides with grain boundaries are ripe for diode-like behavior. So you could have a fairly complex circuit with fairly complex temperature dependence.
I think this piece basically comes down to two things:
Can you get this level of complex behavior out of a simple model? One curve I’d believe, but the multiple curves with the relationship between temperature and critical current don’t seem right. The level of mistake to produce this seems complicated, with very low base rate.
Did they manage to demonstrate resistivity low enough to rule out simple conduction in the zero-voltage regime? (For example, lower resistivity than copper by an order of magnitude.) The papers are remarkably short on details to this effect. They claim yes, but details are hard to come by. (Copper has resistivity ~ 1.7e-6 ohm*cm, they claim < 10^-10 in the 3-author paper for the thin-film sample, but details are in short supply.) Four point probe technique to measure the resistivity of copper in a bulk sample is remarkably challenging. You measure the resistivity of copper with thin films or long thin wires if you want good data. I’d love to see more here.
If the noise floor doesn’t rule out copper, you can get the curves with adequately well chosen thermal and magnetic switches from loose contacts. But there are enough graphs that those errors have to be remarkably precisely targeted, if the graphs aren’t fraud.
Another thing I’d love to see on this front: multiple graphs of the same sort from the same sample (take it apart and put it back together), from different locations on the sample, from multiple samples. Bad measurement setups don’t repeat cleanly.
My question for the NO side: what does the schematic of the bad measurement look like? Where do you put the diodes? How do you manage the sharp transition out of the zero-resistance regime without arbitrarily-fine-tuned switches?
Do any other results from the 6-person or journal-submitted LK papers stand out as having the property, “This is either superconductivity or fraud?”
The field-cooled vs zero-field-cooled magnetization graph (1d in the 3-author paper, 4a in the 6-author paper). I’m far less confident in this than the above; I understand the physics much less well. I mostly mention it because it seems under-discussed from what I’ve seen on twitter and such. This is an extremely specific form of thermal/magnetic hysteresis that I don’t know of an alternate explanation for. I suspect this says more about my ignorance than anything else, but I’m surprised I haven’t seen a proposed explanation from the NO camp.
The comparison between the calculations saying igniting the atmosphere was impossible and the catastrophic mistake on Castle Bravo is apposite as the initial calculations for both were done by the same people at the same gathering!
One out of two isn’t bad, right?
Of course a superintelligence could read your keys off your computer’s power light, if it found it worthwhile. Most of the time it would not need to, it would find easier ways to do whatever humans do by pressing keys. Or make the human press the keys.
FYI, the referenced thing is not about what keys are being pressed on a keyboard, it’s about extracting the secret keys used for encryption or authentication. You’re using the wrong meaning of “keys”.
If you think the true likelihood is 10%, and are being offered odds of 50:1 on the bet, then the Kelly Criterion suggests you should be about 8% of your bankroll. For various reasons (mostly human fallibility and an asymmetry in the curve of the Kelly utility), lots of people recommend betting at fractions of the Kelly amount. So someone in the position you suggest might reasonably wish to be something like $2-5k per $100k of bankroll. That strategy, your proposed credences, and the behavior observed so far would imply a bankroll of a few hundred thousand dollars. That’s not trivial, but also far from implausible in this community.
I’d also guess that the proper accounting of the spending here is partly on the bet for positive expected value, and partly on some sort of marketing / pushing for higher credibility of their idea sort of thing. I’m not sure of the exact mechanism or goal, and this is not a confident prediction, but it has that feel to it.
“Now” is the time at which you can make interventions. Subjective experience lines up with that because it can’t be casually compatible with being in the future, and it maximizes the info available to make the decision with. Or rather, approximately maximizes subject to processing constraints: things get weird if you start really trying to ask whether “now” is “now” or “100 ms ago”.
That’s sort of an answer that seems like it depends on a concept of free will, though. To which my personal favorite response is… how good is your understanding of counterfactuals? Have you written a program that tries to play a two-player game, like checkers or go? If you have, you’ll discover that your program is completely deterministic, yet has concepts like “now” and “if I choose X instead of Y” and they all just work.
Build an intuitive understanding of how that program works, and how it has both a self-model and understanding of counterfactuals while being deterministic in a very limited domain, and you’ll be well under way to dissolving this confusion. (Or at least, I’ve spent a bunch of hours on such programs and I find the analogy super useful; YMMV and I’m probably typical-minding too much here.)
My concern with conflating those two definitions of alignment is largely with the degree of reliability that’s relevant.
The definition “does what the developer wanted” seems like it could cash out as something like “x% of the responses are good”. So, if 99.7% of responses are “good”, it’s “99.7% aligned”. You could even strengthen that as something like “99.7% aligned against adversarial prompting”.
On the other hand, from a safety perspective, the relevant metric is something more like “probabilistic confidence that it’s aligned against any input”. So “99.7% aligned” means something more like “99.7% chance that it will always be safe, regardless of who provides the inputs, how many inputs they provide, and how adversarial they are”.
In the former case, that sounds like a horrifyingly low number. What do you mean we only get to ask the AI 300 things in total before everyone dies? How is that possibly a good situation to be in? But in the latter case, I would roll those dice in a heartbeat if I could be convinced the odds were justified.
So anyway, I still object to using the “alignment” term to cover both situations.
If there are reasons to refuse bets in general, that apply to the LessWrong community in aggregate, something has gone horribly horribly wrong.
No one is requiring you personally to participate, and I doubt anyone here is going to judge you for reluctance to engage in bets with people from the Internet who you don’t know. Certainly I wouldn’t. But if no one took up this bet, it would have a meaningful impact on my view of the community as a whole.
I don’t know how it prevents us from dying either! I don’t have a plan that accomplishes that; I don’t think anyone else does either. If I did, I promise I’d be trying to explain it.
That said, I think there are pieces of plans that might help buy time, or might combine with other pieces to do something more useful. For example, we could implement regulations that take effect above a certain model size or training effort. Or that prevent putting too many flops worth of compute in one tightly-coupled cluster.
One problem with implementing those regulations is that there’s disagreement about whether they would help. But that’s not the only problem. Other problems are things like: how hard would they be to comply with and audit compliance with? Is compliance even possible in an open-source setting? Will those open questions get used as excuses to oppose them by people who actually object for other reasons?
And then there’s the policy question of how we move from the no-regulations world of today to a world with useful regulations, assuming that’s a useful move. So the question I’m trying to attack is: what’s the next step in that plan? Maybe we don’t know because we don’t know what the complete plan is or whether the later steps can work at all, but are there things that look likely to be useful next steps that we can implement today?
One set of answers to that starts with voluntary compliance. Signing an open letter creates common knowledge that people think there’s a problem. Widespread voluntary compliance provides common knowledge that people agree on a next step. But before the former can happen, someone has to write the letter and circulate it and coordinate getting signatures. And before the latter can happen, someone has to write the tools.
So a solutionism-focused approach, as called for by the post I’m replying to, is to ask what the next step is. And when the answer isn’t yet actionable, break that down further until it is. My suggestion was intended to be one small step of many, that I haven’t seen discussed much as a useful next step.
I think neither. Or rather, I support it, but that’s not quite what I had in mind with the above comment, unless there’s specific stuff they’re doing that I’m not aware of. (Which is entirely possible; I’m following this work only loosely, and not in detail. If I’m missing something, I would be very grateful for more specific links to stuff I should be reading. Git links to usable software packages would be great.)
What I’m looking for mostly, at the moment, is software tools that could be put to use. A library, a tutorial, a guide for how to incorporate that library into your training run, and a result of better compliance with voluntary reporting. What I’ve seen so far is mostly high-effort investigative reports and red-teaming efforts.
Best practices around how to evaluate models and high-effort things you can do while making them are also great. But I’m specifically looking for tools that enable low effort compliance and reporting options while people are doing the same stuff they otherwise would be. I think that would complement the suggestions for high-effort best practices.
The output I’d like to see is things like machine-parseable quantification of flops used to generate a model, such that a derivative model would specify both total and marginal flops used to create it.
One thing I’d like to see more of: attempts at voluntary compliance with proposed plans, and libraries and tools to support that.
I’ve seen suggestions to limit the compute power used on large training runs. Sounds great; might or might not be the answer, but if folks want to give it a try, let’s help them. Where are the libraries that make it super easy to report the compute power used on a training run? To show a Merkle tree of what other models or input data that training run depends on? (Or, if extinction risk isn’t your highest priority, to report which media by which people got incorporated, and what licenses it was used under?) How do those libraries support reporting by open-source efforts, and incremental reporting?
What if the plan is alarm bells and shutdowns of concerning training runs? Or you’re worried about model exfiltration by spies or rogue employees? Are there tools that make it easy to report what steps you’re taking to prevent that? That make it easy to provide good security against those threat models? Where’s the best practices guide?
We don’t have a complete answer. But we have some partial answers, or steps that might move in the right direction. And right now actually taking those next steps, for marginal people kinda on the fence about how to trade capabilities progress against security and alignment work, looks like it’s hard. Or at least harder than I can imagine it being.
(On a related note, I think the intersection of security and alignment is a fruitful area to apply more effort.)
Aren’t the other used cars available nearby, and the potential other buyers should you walk away, relevant to that negotiation?
This was fantastic; thank you! I still haven’t quite figured it out, I’ll definitely have to watch it a second time (or at least some parts of it).
I think some sort of improved interface for your math annotations and diagrams would be a big benefit, whether that’s a drawing tablet or typing out some LaTeX or something else.
I think the section on induction heads and how they work could have used a bit more depth. Maybe a couple more examples, maybe some additional demos of how to play around with PySvelte, maybe something else. That’s the section I had the most trouble following.
You mentioned a couple additional papers in the video; having links in the description would be handy. I suspect I can find them easily enough as it is, though.
Yes, if Omega accurately simulates me and wants me to be wrong, Omega wins. But why do I need to get the answer exactly “right”? What does it matter if I’m slightly off?
This would be a (very slightly) more interesting problem if Omega was offering a bet or a reward and my goal was to maximize reward or utility or whatever. It sure looks like for this setup, combined with a non-adversarial reward schedule, I can get arbitrarily close to maximizing the reward.
This feel reminiscent of:
If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.
And while it’s a well-constructed pithy quote, I don’t think it’s true. Can a system understand itself? Can a quining computer program exist? Where is the line between being able to recite itself and understand itself?
You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don’t see any obvious reason why that threshold would move up with the size of the model under interpretation.
Agreed. A quine needs some minimum complexity and/or language / environment support, but once you have one it’s usually easy to expand it. Things could go either way, and the question is an interesting one needing investigation, not bare assertion.
And the answer might depend fairly strongly on whether you take steps to make the model interpretable or a spaghetti-code turing-tar-pit mess.
I think that sounds about right. Collecting the arguments in one place is definitely helpful, and I think they carry some weight as initial heuristics, which this post helps clarify.
But I also think the technical arguments should (mostly) screen off the heuristics; the heuristics are better for evaluating whether it’s worth paying attention to the details. By the time you’re having a long debate, it’s better to spend (at least some) time looking instead of continuing to rely on the heuristics. Rhymes with Argument Screens Off Authority. (And in both cases, only mostly screens off.)
That’s the point. SpaceX can afford to fail at this; the decision makers know it. Eliezer can afford to fail at tweet writing and knows it. So they naturally ratchet up the difficulty of the problem until they’re working on problems that maximize their expected return (in utility, not necessarily dollars). At least approximately. And then fail sometimes.
Or, for the trapeze artist… how long do they keep practicing? Do they do the no-net route when they estimate their odds of failure are 1/100? 1⁄10,000? 1e-6? They don’t push them to zero, at some point they make a call and accept the risk and go.
Why should it be any different for an entity that can one-shot those problems? Why would they wait until they had invested enough effort to one-shot it, and then do so? When instead they could just… invest less effort, attempt it earlier, take some risk of failure, and reap a greater expected reward?
The analogy suggests that entities capable of one-shotting problem X (presumably, by putting in a lot of preparatory effort, running analysis, and so on) will do so. I don’t think that’s true.
(And I think the tweet writing problem is actually an especially strong example of this—hypercompetitive social environments absolutely produce problems calibrated to be barely-solvable and that scale with ability, assuming your capability is in line with the other participants, which I assert is the case for Eliezer. he might be smarter / better at writing tweets than most, but he’s not that far ahead.)
Perhaps I’m missing something obvious, and just continuing the misunderstanding, but...
It seems to me that if you’re the sort of thing capable of one-shotting Starship launches, you don’t just hang around doing so. You tackle harder problems. The basic Umeshism: if you’re not failing sometimes, you’re not trying hard enough problems.
Even the “existential” risk of SpaceX getting permanently and entirely shut down, or just Starship getting shut down, is much closer in magnitude to the payoff than is the case in AI risk scenarios.
Some problems are well calibrated to our difficulties, because we basically understand them and there’s a feedback loop providing at least rough calibration. AI is not such a problem, rockets are, and so the analogy is a bad analogy. The problem isn’t just one of communication, the analogy breaks for important and relevant reasons.
This is extremely true for hypercompetitive domains like writing tweets that do well.