Hello! I work at Lightcone and like LessWrong :-)
kave
As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.
What are some examples of work that is most largeness-loaded and most risk-preventing? My understanding is that interpretability work doesn’t need large models (though I don’t know about things like influence functions). I imagine constitutional AI does. Is that the central example or there are other pieces that are further in this direction?
I wasn’t in this dialogue, you didn’t invite me and so being a ‘backseat participant’ feels a tad odd
Thanks for sharing this. I generally want dialogues to feel open for comment afterwards
But I don’t know if it’s complete or ongoing …
Navigating an ecosystem that might or might not be bad for the world
I like the Mark Xu & Daniel Kokotajlo thread on that post too
Yes, the standard is different for private individuals than public officials, where it is merely “negligence” rather than “actual malice”. (https://www.dmlp.org/legal-guide/proving-fault-actual-malice-and-negligence)
My housemate and I laughed at these a lot!
Thanks! The permutation-invariance of a bunch of theories is a helpful concept
I think that means one of the following should be surprising from theoretical perspectives:
That the model learns a representation of the board state
Or that a linear probe can recover it
That the board state is used causally
Does that seem right to you? If so, which is the surprising claim?
(I am not that informed on theoretical perspectives)
What is the work that finds the algorithmic model of the game itself for Othello? I’m aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda’s and Kenneth Li), but thought it was just about board state representations.
Adding filler tokens seems like it should always be neutral or harm a model’s performance: a fixed prefix designed to be meaningless across all tasks cannot provide any information about each task to locate the task (so no meta-learning) and cannot store any information about the in-progress task (so no amortized computation combining results from multiple forward passes).
I thought the idea was that in a single forward pass, the model has more tokens to think in. That is, the task description on its own is, say, 100 tokens long. With the filler tokens, it’s now, say, 200 tokens long. In principle, because of the uselessness/unnecessariness of the filler tokens, the model can just put task-relevant computation into the residual stream for those positions.
Link?
I think the “changed my mind” Delta should be have varied line widths, like https://thenounproject.com/icon/delta-43529/ (reads too much like “triangle” to me at the moment).
- 27 May 2023 14:30 UTC; 1 point) 's comment on Open Thread With Experimental Feature: Reactions by (
Two, actually
Curated. I am excited about many more distillations and expositions of relevant math on the Alignment Forum. There are a lot of things I like about this post as a distillation:
Exercises throughout. They felt like they were simple enough that they helped me internalise definitions without disrupting the flow of reading.
Pictures! This post made me start thinking of finite factorisations as hyperrectangles, and histories as dimensions that a property does not extend fully along.
Clear links from Finite Factored Sets to Pearl. I think these are roughly the same links made in the original, but they felt clearer and more orienting here.
Highlighting which of Scott’s results are the “main” results (even more than the “Fundamental Theorem” name already did).
Magdalena Wache’s engagement in the comments.
I do think the pictures became less helpful to me towards the end, and I thus have worse intuitions about the causal inference part. I’m also not sure about the emphasis of this post on causal rather than temporal inference. But I still love the post overall.
What do you mean by “its outputs are the same as its conclusions”? If I had to guess I would translate it as “PA proves the same things as are true in every model of PA”. Is that right?
What does “logically coherent” mean?
I think maybe Ra is the first post about the rationalist egregores to use the term