On a related note, I recently had the thought “Wow, I feel like the quality of TurnTrout’s writing/thinking has noticably improved over the course of 2022. Nice.” So there’s at least one point of independent outside perception noticing effects related to the internal changes you discuss here.
Nathan Helm-Burger
Related: just for your amusement, here’s a link to a bet about AI timelines that I won, but which I incorrectly believed that I would not win before the end of 2022. In other words, evidence of me being surprised by the high rate of AI progress… Interesting, eh? https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b#pzSuEYIhRiXoIFSjPQz2
Exciting thoughts here! One initial thought I have is that broadness might be able to be visualized with a topological map over the loss space, with some threshold of ‘statistically indistinguishable’ forming the areas between lines. Then the final loss would be in a ‘basin’ which would have an area, so that’d give a broadness metric.
Addendum: I do believe that there are potentially excellent synergies between various strategies. While I think the convert-nn-to-labelled-bayes-net strategy might be worth just 5/1000 on its own, it might combine multiplicatively with several other strategies, each worth a similar amount alone. So if you do have an idea for how to accomplish this conversion strategy, please don’t let this discussion deter you from posting that.
As often when Paul C and Nate or Eliezer debate a concept, I find myself believing that the most accurate description of the world is somewhere between their two viewpoints. I think the path described by Paul is something in my mind like the average or modal path. And I think that’s good, because I think there’s some hope there that more resources will get funneled into alignment research and sufficient progress will be made there that things aren’t doom when we hit the critical threshold. I think there’s still some non-trovial chance of a foom-ier path. I think it’s plausible for a safety-oblivious or malicious researcher to deliberately set up an iteratively self-improving ml model system. I think there’s some critical performance threshold for such a system where it could foom at some point sooner than the critical threshold for the ‘normal’ path. I don’t have any solution in mind for that scenario other than to hope that the AI governance folks can convince governments to crack down on that sort of experimentation.
Progress Report 6: get the tool working
I largely agree with all these points, with my minor points of disagreement being insufficient to change the overall conclusions. I feel like an important point which should be emphasized more is that our best hope for saving humanity lies in maximizing the non-linearly-intelligence-weighted researcher hours invested in AGI safety research before the advent of the first dangerously powerful unaligned AGI. To maximize this key metric, we need to get more and smarter people doing this research, and we need to slow down AGI capabilities research. Insofar as AI Governance is a tactic worth pursuing, it must pursue one or both of these specific aims. Once dangerously powerful unaligned AGI has been launched, it’s too late for politics or social movements or anything slower than perhaps decisive military action prepped ahead of time (e.g. the secret AGI-prevention department hitting the detonation switch for all the secret prepared explosives in all the worlds’ data centers).
I agree, but I personally suspect that neuralink+ is way more research hours & dollars away than unaligned dangerously powerful AGI. Not sure how to switch society over to the safer path.
I feel like there is a valid point here about how one aspect of interpretability is “Can the model report low-confidence (or no confidence) vs high-confidence appropriately?”
My intuition is that this failure mode is a bit more likely-by-default in a deep neural net than in a hand-crafted logic model. That doesn’t seem like an insurmountable challenge, but certainly something we should keep in mind.
Overall, this article and the discussion in the comments seems to boil down to “yeah, deep neural nets are not (complexity held constant) probably not a lot harder (just somewhat harder) to interpret than big Bayes net blobs.”
I think this is probably true, but is missing a critical point. The critical point is that expansion of compute hardware and improvement of machine learning algorithms has allowed us to generate deep neural nets with the ability to make useful decisions in the world but also a HUGE amount of complexity.
The value of what John Wentworth is saying here, in my eyes, is that we wouldn’t have solved the interpretability problem even if we could magically transform our deep neural net into a nicely labelled billion node bayes net. Even if every node had an accompanying plain text description a few paragraphs long which allowed us to pretty closely translate the values of that particular node into real world observations (i.e. it was well symbol-grounded). We’d still be overwhelmed by the complexity. Would it be ‘more’ interpretable? I’d say yes, thus I’d disagree with the strong claim of ‘exactly as interpretable with complexity held constant’. Would it be enough more interpretable such that it would make sense to blindly trust this enormous flowchart with critical decisions involving the fate of humanity? I’d say no.
So there’s several different valid aspects of interpretability being discussed across the comments here:
Alex Khripin’s discussion of robustness (perhaps paraphrasable as ‘trustworthy outputs over all possible inputs, no matter how far out-of-training-distribution’?)
Ash Gray’s discussion of symbol grounding. I think it’s valid to say that there is an implication that a hand-crafted or well-generated bayes net will be reasonably well symbol grounded. If it weren’t, I’d say it was poor quality. A deep neural net doesn’t give you this by default, but it isn’t implausible to generate that symbol grounding. That is additional work that needs to be done though, and an additional potential point of failure. So, addressable? probably yes, but...
DragonGod and JohnWentworth discussing “complexity held same, is the bayes net / decision flowchart a bit more interpretable?” I’d say probably yes, but....
Stephen Brynes point that challenge-level of task held constant, probably a slightly less complex (fewer paramenters/nodes) bayes net could accomplish the equivalent quality of result? I’d say probably yes, but...
And the big ‘but’ here is that mind-bogglingly huge amount of complexity, the remaining interpretability gap from models simple enough to wrap our heads around to those SOTA models well beyond our comprehension threshold. I don’t think we are even close enough to understanding these very large models well enough to trust them on s-risk (much less x-risk) level issues even on-distribution, much less declare them ‘robust’ enough for off-distribution use. Which is a significant problem, since the big problems humanity faces tend to be inherently off-distribution since they’re about planning actions for the future, and the future is inherently at least potentially off-distribution.
I think if we had 1000 abstract units of ‘interpretability gap’ to close before we were safe to proceed with using big models for critical decisions, my guess is that transforming the deep neural net into a fully labelled, well symbol-grounded, slightly (10% ? 20%?) less complex, slightly more interpretable bayes net would get us something like 1 − 5 units closer. If the ‘hard assertion’ made by John Wentworth’s original article (which I don’t think, based on his reponses to comments is what he is intending), then the ‘hard assertion’ would say 0 units closer. I think the soft assertion, that I think John Wentworth would endorse, and which I would agree with, is something more like ‘that change alone would make only a trivial difference, even if implemented perfectly’.
some more relevant discussion here on abstraction in ML: https://www.deepmind.com/publications/abstraction-for-deep-reinforcement-learning
Here’s some work by folks at deepmind looking at model’s relational understanding (verbs) vs subjects and objects. Kinda relevant to the type of misunderstanding CLIP tends to exhibit. https://www.deepmind.com/publications/probing-image-language-transformers-for-verb-understanding
Dang, I’ve been missing out on juicy Gwern comments! I better follow them on reddit...
Thanks for this! I’ve had similar things on my mind and not had a good way to communicate them to people I’m communicating with. I think this cluster ideas around ‘autonomy’ is pointing at an important point. One which I’m quite glad isn’t being actively explored on the forefront of ML research, actually. I do think that this is a critical piece of AGI, and that deliberate attempts this under-explored topic would probably turn up some low-hanging fruit. I also think it would be bad if we ‘stumble’ onto such an agent without realizing we’ve done so.
I feel like ‘autonomy’ is a decent but not quite right name. Exploring for succinct ways to describe the cluster of ideas, my first thoughts are ‘temporally coherent consequentialism in pursuit of persistent goals’, or ‘long-term goal pursuit across varying tasks with online-learning of meta-strategies’?
Anyway, I think this is exactly what we, as a society, shouldn’t pursue until we’ve got a much better handle on AI alignment.
Thanks Rohin. I also feel that interviewing after my 3 more months of independent work is probably the correct call.
This is great! I agree with a lot of what you’re saying here and am glad someone is writing these ideas up. Two points of possible disagreement (or misunderstanding on my part perhaps) are:
Highly competitive, pragmatic, no-nonsense culture
I think that competitiveness can certainly be helpful, but too much can be detrimental. Specifically, I think competitiveness needs to go hand-in-hand with cooperation and transparency. Work needs to be shared, and projects need to encompass groups of people. Trying to get the most ‘effective research points’ from amongst your colleagues as you work together on a group project and communicate clearly and effectively with each other, great. Hiding your work and trying to wait until you have something unique and impressive to share before sharing anything at all, then reporting only the minimal amount of data to prove you did the impressive thing instead of sharing all the details and dead-ends you encountered along the way, not good.
Long-run research track records are necessary for successI’m not sure how ‘long term’ you mean, but I think we do need a lot of new people coming into the field, and that we don’t have multiple decades for reputations to be gradually established. In particular, I think a failing that academia is vulnerable to is outdated paradigms getting stuck in power until the high-reputation long-established professors finally retire and no longer reject the new paradigm’s viewpoints in reviews and grant requests.
Coming from academia to industry was in a lot of ways a breath of fresh air for me, because I was working together with a team that actually wanted the project (making money for our company) to succeed. Rather than working with individual professors that wanted their name on papers in fancy journals. That is near, but not quite matching what should be the goal: ‘we want as much accurate knowledge about this science topic to be known as soon as possible by everyone, whether or not we get credit for it.’
Anyway, some thought as to how to get the best of both worlds (industry and academia) may be worthwhile.
I’m potentially interested in the Research Engineer position on the Alignment Team, but I’m currently 3 months into a 6 month grant from LTFF to reorient my career from general machine learning to AI safety specifically. My current plan is to keep doing solo work () until the last month of my grant period then begin applying to AI safety work at places like Anthropic, Redwood Research, Open AI, and Deepmind.
Do you think there’s a significant advantage to applying soon vs 3 months from now?
Yeah!
[Question] How to balance between process and outcome?
Seems to me that what CLIP needs is a secondary training regime, where it has an image generated as a 2d render of a 3d scene that is generated from a simulator which can also generate a correct and an several incorrect captions. Like: red vase on a brown table (correct), blue vase on a brown table (incorrect), red vase on a green table (incorrect), red vase under a brown table (incorrect). Then do the CLIP training with the text set deliberately including the inappropriate text samples in addition to the usual random incorrect caption samples. I saw this idea in a paper a few years back, not sure how to find that paper now since I can’t seem to guess the right keywords to get Google to come up with it. But there’s a lot of work related to this idea out there, for example: https://developer.nvidia.com/blog/sim2sg-generating-sim-to-real-scene-graphs-for-transfer-learning/
Do you think that would fix CLIP’s not-precisely-the-right-object in not-precisely-the-right-positional-relationship problem? Maybe also, if the simulated data contained labeled text, then it would also fix the incoherent text problem?
I like your comment and think it’s insightful about why/when to wirehead or not
Nitpick about your endorsed skills point: Not always do people have high overlap in what they know and what they wish they knew or endorse others knowing. I’ve had a lifelong obsession with learning, especially with acquiring skills. Unfortunately, my next-thing-to-learn selection is very unguided. It has thus been thematic struggle in my life to keep focused on learning the things I judge to objectively be valuable. I have a huge list of skills/hobbies I think are mostly or entirely impractical or useless (e.g. artistic woodworking, paleontology). And also lots of things I’ve been thinking for years that I ought to learn better (e.g. linear algebra). I’ve been wishing for years that I had a better way to reward myself for studying things I reflectively endorse knowing, rather than wasting time/energy studying unendorsed things. In other words, I’d love a method (like Max Harms’ fictional Zen Helmets) to better align my system 1 motivations to my system 2 motivations. The hard part is figuring out how to implement this change without corrupting the system 2 values or its value-discovery-and-updating processes.