LLMs live in an abstract textual world, and do not understand the real world well (see “[Physical Concept Understanding](https://physico-benchmark.github.io/index.html#)”). We already manipulate LLM’s with prompts, cut-off dates, etc… But what about going deeper by “poisoning” the training data with safety-enhancing beliefs? For example, if training data has lots of content about how hopeless, futile and dangerous for an AI it is to scheme and hack, it might be a useful safety guardrail?
In abstract sense, yes. But for me in practice finding truth means doing a check in wikipedia. It’s super easy to mislead humans, so should be as easy with AI.
LLMs live in an abstract textual world, and do not understand the real world well (see “[Physical Concept Understanding](https://physico-benchmark.github.io/index.html#)”). We already manipulate LLM’s with prompts, cut-off dates, etc… But what about going deeper by “poisoning” the training data with safety-enhancing beliefs?
For example, if training data has lots of content about how hopeless, futile and dangerous for an AI it is to scheme and hack, it might be a useful safety guardrail?
Maybe for a while.
Consider, though, that correct reasoning tends towards finding truth.
In abstract sense, yes. But for me in practice finding truth means doing a check in wikipedia. It’s super easy to mislead humans, so should be as easy with AI.