jsd

Karma: 389

jsd 4 Sep 2021 17:05 UTC
7 points
on: 18 possible meanings of “I Like Red”
Another meaning could be: I want to raise the salience of the issue ‘Red vs Not Red’, I want to convey that ‘Red vs Not Red’ is an underrated axis. I think this is also an example of level 4?

jsd 21 Oct 2021 16:13 UTC
5 points
on: Emergent modularity and safety
On one hand, Olah et al.’s (2020) investigations find circuits which implement human-comprehensible functions.
At a higher level, they also find that different branches (when the modularity is enforced already by the architecture) tend to contain different features.

jsd 21 Oct 2021 16:18 UTC
LW: 4 AF: 3
AF
in reply to: Jsevillamol’s comment on: Emergent modularity and safety
Relevant related work : NNs are surprisingly modular
https://arxiv.org/abs/2003.04881v2?ref=mlnew
I believe Richard linked to Clusterability in Neural Networks, which has superseded Pruned Neural Networks are Surprisingly Modular.
The same authors also recently published Detecting Modularity in Deep Neural Networks.

“Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time

jsd18 Nov 2021 23:24 UTC

11 points

9 comments1 min readLW link

(arxiv.org)

jsd 9 Dec 2021 14:17 UTC
3 points
on: Some thoughts on why adversarial training might be useful
Another way adversarial training might be useful, that’s related to (1), is that it may make interpretability easier. Given that it weeds out some non-robust features, the features that remain (and the associated feature visualizations) tend to be clearer, cf e.g. Adversarial Robustness as a Prior for Learned Representations. One example of people using this is Leveraging Sparse Linear Layers for Debuggable Deep Networks (blog post, AN summary).
The above examples are from vision networks—I’d be curious about similar phenomena when adversarially training language models.

jsd 12 Jan 2022 10:22 UTC
4 points
on: The Best Software For Every Need
Software: streamlit.io
Need: making small webapps to display or visualize results
Other programs I’ve tried: R shiny, ipywidgets
I find streamlit extremely simple to use, it interoperates well with other libraries (eg pandas or matplotlib), the webapps render well and are easy to share, either temporarily through ngrok, or with https://share.streamlit.io/.

jsd 29 Mar 2022 4:21 UTC
6 points
in reply to: TLW’s comment on: Some thoughts about deceptive mesaoptimization
It’s an example first written about by Paul Christiano here (at the beginning of Part III).
The idea is this: suppose we want to ensure that our model has acceptable behavior even in worst-case situations. One idea would be to do adversarial training: at every step during training, train an adversary model to find inputs on which the model behaves unacceptably, and penalize the model accordingly.
If the adversary is able to uncover all the worst-case inputs, this penalization ensures we end up with a model that behaves acceptably on all inputs.
RSA-2048 is a somewhat contrived but illustrative example of how this strategy could fail:
As a simple but silly example, suppose our model works as follows:
- Pick a cryptographic puzzle (e.g. “factor RSA-2048”).
- When it sees a solution to that puzzle, it behaves badly.
- Otherwise, it behaves well.
Even the adversary understands perfectly what this model is doing, they can’t find an input on which it will behave badly unless they can factor RSA-2048. But if deployed in the world, this model will eventually behave badly.
In particular:
Even the adversary understands perfectly what this model is doing, they can’t find an input on which it will behave badly unless they can factor RSA-2048
This is because during training, as is the case now, we and the adversary we build are unable to factor RSA-2048.
But if deployed in the world, this model will eventually behave badly.
This is because (or assumes that) at some point in the future, a factorization of RSA-2048 will exist and become available.

jsd 9 Apr 2022 4:44 UTC
1 point
in reply to: paulfchristiano’s comment on: [Link] A minimal viable product for alignment
Your link redirects back to this page. The quote is from one of Eliezer’s comments in Reply to Holden on Tool AI.

jsd 13 Apr 2022 17:12 UTC
4 points
on: jsd’s Shortform
I’ve been thinking about these two quotes from AXRP a lot lately:

From Richard Ngo’s interview:

Richard Ngo: Probably the main answer is just the thing I was saying before about how we want to be clear about where the work is being done in a specific alignment proposal. And it seems important to think about having something that doesn’t just shuffle the optimization pressure around, but really gives us some deeper reason to think that the problem is being solved. One example is when it comes to Paul Christiano’s work on amplification, I think one core insight that’s doing a lot of the work is that imitation can be very powerful without being equivalently dangerous. So yeah, this idea that instead of optimizing for a target, you can just optimize to be similar to humans, and that might still get you a very long way. And then another related insight that makes amplification promising is the idea that decomposing tasks can leverage human abilities in a powerful way.

Richard Ngo: Now, I don’t think that those are anywhere near complete ways of addressing the problem, but they gesture towards where the work is being done. Whereas for some other proposals, I don’t think there’s an equivalent story about what’s the deeper idea or principle that’s allowing the work to be done to solve this difficult problem.

From Paul Christiano’s interview:

Paul Christiano: And it’s nice to have a problem statement which is entirely external to the algorithm. If you want to just say, “here’s the assumption we’re making now; I want to solve that problem”, it’s great to have an assumption on the environment be your assumption. There’re some risk if you say, “Oh, our assumption is going to be that the agent’s going to internalize whatever objective we use to train it.” The definition of that assumption is stated in terms of, it’s kind of like helping yourself to some sort of magical ingredient. And, if you optimize for solving that problem, you’re going to push into a part of the space where that magical ingredient was doing a really large part of the work. Which I think is a much more dangerous dynamic. If the assumption is just on the environment, in some sense, you’re limited in how much of that you can do. You have to solve the remaining part of the problem you didn’t assume away. And I’m really scared of sub-problems which just assume that some part of the algorithm will work well, because I think you often just end up pushing an inordinate amount of the difficulty into that step.

jsd 15 Apr 2022 21:47 UTC
2 points
on: jsd’s Shortform
A few ways that StyleGAN is interesting for alignment and interpretability work:
- It was much easier to interpret than previous generative models, without trading off image quality.
- It seems like an even better example of “capturing natural abstractions” than GAN Dissection, which Wentworth mentions in Alignment By Default.
  - First, because it’s easier to map abstractions to StyleSpace directions than to go through the procedure in GAN Dissection.
  - Second, the architecture has 2 separate ways of generating diverse data: changing the style vectors, or adding noise. This captures the distinction between “natural abstraction” and “information that’s irrelevant at a distance”.
- Some interesting work was built on top of StyleGAN:
  - David Bau and colleagues started a sequence of work on rewriting and editing models with StyleGAN in Rewriting a Generative Model, before moving to image classifiers in Editing a classifier by rewriting its prediction rules and language models with ROME.
  - Explaining in Style is IMO one of the most promising interpretability methods for vision models.
However, StyleGAN is not super relevant in other ways:
- It generally works only on non-diverse data: you train StyleGAN to generate images of faces, or to generate images of churches. The space of possible faces is much smaller than e.g. the space of images that could make it in ImageNet. People recently released StyleGAN-XL, which is supposed to work well on diverse datasets such as ImageNet. I haven’t played around with it yet.
- It’s an image generation model. I’m more interested in language models, which work pretty differently. It’s not obvious how to extend StyleGAN’s architecture to build competitive yet interpretable language models. This paper tried something like this but didn’t seem super convincing (I’ve mostly skimmed it so far).

jsd 4 Sep 2022 22:06 UTC
1 point
on: The case for becoming a black-box investigator of language models
Some of the most interesting black box investigations I’ve found are Riley Goodside’s.

jsd 28 Oct 2022 15:32 UTC
7 points
0
on: Six (and a half) intuitions for KL divergence
Thanks for this post! Relatedly, Simon DeDeo had a thread on different ways the KL-divergence pops up in many fields:
Kullback-Leibler divergence has an enormous number of interpretations and uses: psychological, epistemic, thermodynamic, statistical, computational, geometrical… I am pretty sure I could teach an entire graduate seminar on it.
Psychological: an excellent predictor of where attention is directed. http://ilab.usc.edu/surprise/
Epistemic: a normative measure of where you ought to direct your experimental efforts (maximize expected model-breaking) http://www.jstor.org/stable/4623265
Thermodynamic: a measure of work you can extract from an out-of-equlibrium system as it relaxes to equilibrium.
Statistical: too many to count, but (e.g.) a measure of the failure of an approximation method. https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
Computational (machine learning): a measure of model inefficiency—the extent to which it retains useless information. https://arxiv.org/abs/1203.3271
Computational (compression): the extent to which a compression algorithm designed for one system fails when applied to another.
Geometrical: the (non-metric!) connection when one extends differential geometry to the probability simplex.
Biological: the extent to which subsystems co-compute.
Machine learning: the basic loss function for autoencoders, deep learning, etc. (people call it the “cross-entropy”)
Algorithmic fairness. How to optimally constrain a prediction algorithm when ensuring compliance with laws on equitable treatment. https://arxiv.org/abs/1412.4643
Cultural evolution: a metric (we believe) for the study of individual exploration and innovation tasks… https://www.sciencedirect.com/science/article/pii/S0010027716302840 …
Digital humanism: Kullback-Leibler divergence is related to TFIDF, but with much nicer properties when it comes to coarse-graining. (The most distinctive words have the highest partial-KL when teasing apart documents; stopwords have the lowest) http://www.mdpi.com/1099-4300/15/6/2246
Mutual information: Well, it’s a special case of Kullback-Leibler—the extent to which you’re surprised by (arbitrary) correlations between a pair of variables if you believe they’re independent.
Statistics: it’s the underlying justification for the Akiake Information Criterion, used for model selection.
Philosophy of mind: It’s the “free energy” term in the predictive brain account of perception and consciousness. See Andy Clark’s new book or https://link.springer.com/article/10.1007%2Fs11229-017-1534-5

jsd 25 Dec 2022 1:51 UTC
1 point
0
in reply to: paulfchristiano’s comment on: On sincerity
Thanks for this comment, I found it useful.

What did you want to write at the end of the penultimate paragraph?

jsd 23 Feb 2023 6:06 UTC
10 points
0
on: Please don’t throw your mind away
Thanks for this. I’ve been thinking about what to do, as well as where and with whom to live over the next few years. This post highlights important things missing from default plans.

It makes me more excited about having independence, space to think, and a close circle of trusted friends (vs being managed / managing, anxious about urgent todos, and part of a scene).

I’ve spent more time thinking about math completely unrelated to my work after reading this post.

The theoretical justifications are more subtle, and seem closer to true, than previous justifications I’ve seen for related ideas.

The dialog doesn’t overstate its case and acknowledges some tradeoffs that I think can be real—eg I do think there is some good urgent real thinking going on, that some people are a good fit for it, and can make a reasonable choice to do less serious play.

jsd 8 Mar 2023 2:12 UTC
2 points
0
on: Utilitarianism Meets Egalitarianism
I was a little bit confused about Egalitarianism not requiring (1). As an egalitarian, you may not need a full distribution over who you could be, but you do need the support of this distribution, to know what you are minimizing over?

jsd 11 Mar 2023 7:43 UTC
2 points
0
in reply to: Alok Singh’s comment on: When To Stop
You may enjoy: https://arxiv.org/abs/2207.08799

Notes on Teaching in Prison

jsd19 Apr 2023 1:53 UTC

270 points

12 comments12 min readLW link

jsd 20 Apr 2023 20:09 UTC
4 points
1
in reply to: M. Y. Zuo’s comment on: Notes on Teaching in Prison
How is it logistically possible for the guards to go on strike?
Who was doing all the routine work of operating cell doors, cameras, and other security facilities?
This is a good question. In France, prison guards are not allowed to strike (like most police, military, and judges). At the time, the penitentiary administration asked for sanctions against guards who took part in the strike, but I think most were not applied because there was a shortage of guards.
In practice, guards were replaced by gendarmes, and work was reduced to the basics (e.g. security and food). In particular, classes, visits, and yard time were greatly reduced and sometimes completely cut, which heightened tensions with inmates.

jsd 22 Apr 2023 0:14 UTC
10 points
0
in reply to: Vitor’s comment on: Notes on Teaching in Prison
How did you end up doing this work? Did you deliberately seek it out?
I went to a French engineering school which is also a military school. During the first year (which corresponds to junior year in US undergrad), each student typically spends around six months in an armed forces regiment after basic training.
Students get some amount of choice of where to spend these six months among a list of options, and there are also some opportunities outside of the military: these include working as a teaching assistant in some high schools, working for some charitable organizations, and working as a teacher in prison.
When the time came for me to choose, I had an injured shoulder, was generally physically weak, and did not have much of a “soldier mindset” (ha). So I chose the option that seemed most interesting among the nonmilitary ones: teaching in prison. Every year around five students in a class of about 500 do the same.
What are teachers, probation officers and so on (everyone who is not a guard) like? What drives them?
Overall, I thought the teachers were fairly similar to normal public school teachers—in fact, some of them worked part-time at a normal high school. In the prison I worked at, they seemed maybe more dedicated and open-minded than normal public school teachers, but I don’t remember them super well.
I don’t really remember my conversations with probation officers. One thing that struck me was that in my prison, maybe 90% of them were women. If I remember correctly, most were in their thirties.

jsd 6 May 2023 6:53 UTC
1 point
1
in reply to: Ruby’s comment on: Notes on Teaching in Prison
Thank you Ruby. Two other posts I like that I think fit this category are A Brief Introduction to Container Logistics and What it’s like to dissect a cadaver.

jsd

“Ac­qui­si­tion of Chess Knowl­edge in AlphaZero”: prob­ing AZ over time

Notes on Teach­ing in Prison

“Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time

Notes on Teaching in Prison