I operate by Crocker’s rules.
It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.
The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren’t fitness-aligned.
We meant the linked proposal. Although I don’t think we need to do more than verify a GPT’s safety, this approach could be used to understand AI enough to design a safe one ourselves, so long as enforcing modularity does not compromise capability.
Consider also trying the other direction—after all, KL is asymmetric.
I meant your latter interpretation.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that’s linear on small values, I would therefore expect more of the calculation to be visible.
Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
If each layer were trained to give its best guess at the next token, this myopia would prevent all sorts of hiding data for later. This would be a good experiment for your last story, yes? I expect this would perform very poorly, though if it doesn’t, hooray, for I really don’t expect that version to develop inner optimizers.
I expect that all that’s required for a Singularity is to wait a few years for the sort of language model that can replicate a human’s thoughts faithfully, then make it generate a thousand year’s worth of that researcher’s internal monologue, perhaps with access to the internet.
Neural networks should be good at this task—we have direct evidence that neural networks can run human brains.
Whether our world’s plot has a happy ending then merely depends on the details of that prompt/protocol—such as whether it decides to solve alignment before running a successor. Though it’s probably simple to check alignment of the character—we have access to his thoughts. A harder question is whether the first LM able to run humans is still inner aligned.
Can you locally replicate GPT? For example, can GPT-you compress WebText better than GPT-2?
SOTA: Penalize my action by how well a maximizer that takes my place after the action would maximize a wide variety of goals.
If we use me instead of the maximizer, paradoxes of self-reference arise that we can resolve by inserting a modal operator: Penalize my action by how well I expect I would maximize a wide variety of goals (if given that goal). Then when considering the action of stepping towards an omnipotence button, I would expect that given that I decided to take one step, I would take more, and therefore penalize the first step a lot. Except if there’s plausible deniability, because the first step towards the button is also a first step towards my concrete goal, because then I might still expect to be bound by the penalty.
I’ve suggested using myself before in the last sentence of this comment: https://www.lesswrong.com/posts/mdQEraEZQLg7jtozn/subagents-and-impact-measures-full-and-fully-illustrated?commentId=WGWtoKDrnN3o6cS6G
Long outputs will tend to naturally deteriorate, as it tries to reproduce the existing deterioration and accidentally adds some more. Better: Sample one tag at a time. Shuffle the inputs every time to access different subdistributions. (I wonder how much the subdistributions differ for two random shuffles...) If you output the tag that has the highest minimum probability in each of a hundred subdistributions, I bet that’ll produce a tag that’s not in the inputs.
You make it sound like it wants things. It could at most pretend to be something that wants things. If there’s a UFAI in there that is carefully managing its bits of anonymity (which sounds as unlikely as your usual conspiracy theory—a myopic neural net of this level should keep a secret no better than a conspiracy of a thousand people), it’s going to have better opportunities to influence the world soon enough.
Just ask GPT to do the tagging, people.
The wavefunctioncollapse algorithm measures whichever tile currently has the lowest entropy. GPT-3 always just measures the next token. Of course in prose those are usually the same, but I expect some qualitative improvements once we get structured data with holes such that any might have low entropy, a transformer trained to fill holes, and the resulting ability to pick which hole to fill next.
Until then, I expect those prompts/GPT protocols to perform well which happen to present the holes in your data in the order that wfc would have picked, ie ask it to show its work, don’t ask it to write the bottom line of its reasoning process first.
Long shortform short: Include the sequences in your prompt as instructions :)
The obvious answer to spammers being run by GPT is mods being run by GPT. Ask it whether every comment is high-quality/generated, then act on that as needed to keep the site functional.
It was meant as a submission, except that I couldn’t be bothered to actually implement my distribution on that website :) - even/especially after superintelligent AI, researchers might come to the conclusion that we weren’t prepared and *shouldn’t* build another—regardless of whether the existing sovereign would allow it.
Answering with a point estimate seems rather silly. Shouldn’t it answer with a distribution? Then one question would be enough.
Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.
Not quite. Just look at the prior and draw the vertical line at 2030. Note that you’re incentivizing people to submit their guess as late as possible, both to have time to read other comments yourself and to put your guess right to one side of another.