Every real AI project so far has been a form of “given this system able to affect the (real or simulated) world, choose a good sequence of control actions for the system”. A robotic arm that picks and loads bins, an autonomous car, or an agent entering commands on an Atari controller are all examples. In all of these cases, the agent is choosing actions from a finite set, and the reward heuristic + set of available actions precludes the agent from hostile behavior.
For example, a robotic arm able to pick and place clothes could theoretically type on a keyboard in reach of the arm and enter the exact sequence of commands needed to trigger nuclear Armageddon (assuming such a sequence exists, it shouldn’t but it might), but it won’t even reach for the keyboard because the agent’s heuristic is based on a reward for picking and placing. Any action that isn’t at least predicted to result in a successful pick or place in future frames seen by the agent won’t be taken.
It seems like you could bypass most alignment problems simply by making sure your heuristics for an agent have clauses to limit scope. “maximum production of paperclips, but only by issue commands to machinery in this warehouse or by ordering online new equipment for the production of paperclips, but no more unique IDs for machinery than you have network ports AND no humans harmed AND no equipment located outside this geographic location AND..”
Each of these statements would be a term in the heuristic, giving the agent a 0 or negative reward if it breaks that particular term. You need redundant terms in case there ends up being a software bug or exploit that causes one of the ‘scope limiting’ clauses to be missed.