Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.
Modern language models are not aligned. Anthropic’s HH is the closest thing available, and I’m not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI’s Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective—usually something bland and “reasonable.”)
For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.
For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.