Karma: 471

# An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
212 points

# A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
60 points
• Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project

Decision square’s Euclidean distance to the top-right corner, positive ().

We are confused and don’t fully understand which logical interactions produce this positive regression coefficient.

I’d be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.

It might be worth making a cross-correlation plot of the features. This won’t give you a new coefficients to put faith in, but it might help you decide how much to trust the ones you have. It can also be useful looking at how unstable the coefficients are during training (or e.g. when trained on a different dataset).

# Nav­i­gat­ing pub­lic AI x-risk hype while pur­su­ing tech­ni­cal solutions

19 Feb 2023 12:22 UTC
18 points

# [In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
109 points

# In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
135 points
• I think the risk level becomes clearer when stepping back from stories of how pursuing specific utility functions lead to humanity’s demise. An AGI will have many powerful levers on the world at its disposal. Very few combinations of lever pulls result in a good outcome for humans.

From the perspective of ants in an anthill, the actual utility function(s) of the humans is of minor relevance; the ants will be destroyed by a nuclear bomb in much the same way as they will be destroyed by a new construction site or a group of mischievous kids playing around.

(I think your Fermi AGI paradox is a good point, I don’t quite know how to factor that into my AGI risk assessment.)

• I have a different intuition here; I would much prefer the alignment team at e.g. DeepMind to be working at DeepMind as opposed to doing their work for some “alignment-only” outfit. My guess is that there is a non-negligible influence that an alignment team can have on a capabilities org in the form of:

• The alignment team interacting with other staff either casually in the office or by e.g. running internal workshops open to all staff (like DeepMind apparently do)

• The org consulting with the alignment team (e.g. before releasing models or starting dangerous projects)

• Staff working on raw capabilities having somewhere easy to go if they want to shift to alignment work

I think the above benefits likely outweigh the impact of the influence in the other direction (such as the value drift from having economic or social incentives linked to capabilities work)

• Nice list!

Conditioned on the future containing AIs that are capable of suffering in a morally relevant way, interpretability work may also help identify and even reduce this suffering (and/​or increase pleasure and happiness). While this may not directly reduce x-risk, it is a motivator for people taken in by arguments on s-risks from sentient AIs to work on/​advocate for interpretability research.