lovetheusers

Karma: 11

lovetheusers 15 Nov 2022 1:19 UTC
1 point
on: Does SGD Produce Deceptive Alignment?
For example, a model trained on the base objective “imitate what humans would say” might do nearly as well if it had the proxy objective “say something humans find reasonable.” There are very few situations in which humans would find reasonable something they wouldn’t say or vice-versa, so the marginal benefit of aligning the proxy objective with the base objective is quite small.
For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.
For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.

lovetheusers 15 Nov 2022 2:11 UTC
1 point
0
on: AGI Ruin: A List of Lethalities
Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.
Modern language models are not aligned. Anthropic’s HH is the closest thing available, and I’m not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI’s Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective—usually something bland and “reasonable.”)

lovetheusers 15 Nov 2022 2:43 UTC
1 point
0
on: AGI Ruin: A List of Lethalities
Human raters make systematic errors—regular, compactly describable, predictable errors.
This implies it’s possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model’s error prediction to the rater answer and get a correct label.

lovetheusers 15 Nov 2022 2:57 UTC
2 points
0
on: AGI Ruin: A List of Lethalities
When you explicitly optimize against a detector of unaligned thoughts, you’re partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.
This is correct, and I believe the answer is to optimize for detecting aligned thoughts.

[ ]
[deleted]

Linkpost for a generalist algorithmic learner: capable of carrying out sorting, shortest paths, string matching, convex hull finding in one network

lovetheusers9 Dec 2022 0:02 UTC

7 points

1 comment1 min readLW link

(twitter.com)

lovetheusers 31 Jan 2023 2:45 UTC
4 points
0
on: Tools for finding information on the internet
Good AI summarization tools: https://www.towords.io/ https://detangle.ai/