there’s a lot of discussion on these topics posted here. I’d suggest reading through some recent top posts; they vary in opinion significantly but there are a lot of insightful perspectives. some interesting ones from the past year according to me; vaguely filtered by relevance, but you’re going to have to click through them to decide which ones are good:
there’s a lot of discussion on these topics posted here. I’d suggest reading through some recent top posts; they vary in opinion significantly but there are a lot of insightful perspectives. some interesting ones from the past year according to me; vaguely filtered by relevance, but you’re going to have to click through them to decide which ones are good:
howtos:
https://www.lesswrong.com/posts/xEHy9oivifjgFbnvc/slack-matters-more-than-any-outcome
https://www.lesswrong.com/posts/Afdohjyt6gESu4ANf/most-people-start-with-the-same-few-bad-ideas
https://www.lesswrong.com/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic
https://www.lesswrong.com/posts/h5CGM5qwivGk2f5T9/7-traps-that-we-think-new-alignment-researchers-often-fall
https://www.lesswrong.com/posts/zo9zKcz47JxDErFzQ/call-for-distillers
research overviews:
https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is
https://www.lesswrong.com/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability
https://www.lesswrong.com/posts/BzYmJYECAc3xyCTt6/the-plan-2022-update
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
https://www.lesswrong.com/posts/27AWRKbKyXuzQoaSk/some-conceptual-alignment-research-projects
insights & reports:
https://www.lesswrong.com/posts/hsf7tQgjTZfHjiExn/my-take-on-jacob-cannell-s-take-on-agi-safety
https://www.lesswrong.com/posts/KLS3pADk4S9MSkbqB/review-love-in-a-simbox
https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target → https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
https://www.lesswrong.com/posts/xF7gBJYsy6qenmmCS/don-t-die-with-dignity-instead-play-to-your-outs
https://www.lesswrong.com/posts/8oMF8Lv5jiGaQSFvo/boundaries-part-1-a-key-missing-concept-from-utility-theory
https://www.lesswrong.com/posts/XYDsYSbBjqgPAgcoQ/why-the-focus-on-expected-utility-maximisers
https://www.lesswrong.com/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms
https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood
https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
https://www.lesswrong.com/posts/kmpNkeqEGvFue7AvA/value-formation-an-overarching-model
https://www.lesswrong.com/posts/nwLQt4e7bstCyPEXs/internal-interfaces-are-a-high-priority-interpretability
communication:
https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
https://www.lesswrong.com/posts/gpk8dARHBi7Mkmzt9/what-ai-safety-materials-do-ml-researchers-find-compelling
and if you’re gonna complain I linked too many posts, I totally did, yeah that is fair
Thank you!