David Jilk

Karma: 8

David Jilk 2 May 2023 18:27 UTC
3 points
7
on: Systems that cannot be unsafe cannot be safe
It’s worse than that. https://arxiv.org/abs/1604.06963

David Jilk 21 Jan 2024 14:08 UTC
3 points
0
on: Goals selected from learned knowledge: an alternative to RL alignment
>> I’ve been trying to understand and express why I find natural language alignment … so much more promising >> than any other alignment techniques I’ve found.
Could it be that we humans have millennia of experience aligning our new humans (children) using this method? Whereas every other method is entirely new to us, and has never been applied to a GI even if it has been tested on other AI systems; thus, predictions of outcomes are speculative.

But it still seems like there is something missing from specifying goals directly via expression through language or even representational manipulation. If the representations themselves do not contain any reference to motivational structure (i.e., they are “value free” representations), then the goals will not be particularly stable. Johnny knows that it’s bad to hit his friends because Mommy told him so, but he only cares because it’s Mommy who told him, and he has a rather strong psychological attachment to Mommy.,

David Jilk 10 Feb 2024 20:41 UTC
5 points
0
on: Believing In
I’ve done some similar analysis on this question myself in the past, and I am running a long-term N=1 experiment by opting not to take the attitude of belief toward anything at all. Substituting words like prefer, anticipate, suspect, has worked just fine for me and removes the commitment and brittleness of thought associated with holding beliefs.

Also in looking into these questions, I learned that other languages do not have in one word the same set of disparate meanings (polysemy) of our word belief. In particular, the way we use it in American English to “hedge” (i.e., meaning “I think but I am not sure”) is not a typical usage and my recollection (possibly flawed) is that it isn’t in British English either.