In the context of alignment, we want to be able to pin down which concepts we are referring to, and natural latents were (as I understand it) partly meant to be a solution to that. However if there are multiple different concepts that fit the same natural latent but function very differently then that doesn’t seem to solve the alignment aspect.
tailcalled
Rather than counting objects/distances, one way I like to think about the definition of space is by translation symmetry. You do get into symmetry in your post but it’s mixed together with a bunch of other themes.
Like, you are in your cave and drop a ball. You then walk out of the cave and look back in. The ball is still there, but it looks smaller and you can’t touch it anymore. You walk in, pick up the ball, and walk out again, and then drop the ball outside. The ball falls down the same way outside the cave as it does inside.
If you think of what you observe from a single position as being a first-person perspective, then you can conceive of transformation that take one first-person perspective to a different one, but for such a transformation to make sense, objects need to have positions in space so they can be transformed.
Notably, you don’t need a collection of symmetric objects, or a volume with limited capacity for containing things, in order for space to make sense (and you can make up alternate mathematical rules that have limited capacity and similar objects but have no space). On the other hand, if you don’t have something like translational symmetry, it feels like you’re working with something that’s not “space” in a conventional sense? Like it might still be derived from space, but it means you can’t talk about “what if stuff was elsewhere?” within the model, which seems like the basic thing space does.
(I guess one could further distinguish global translation symmetry vs local translation symmetry, with the former being the assertion that ~you have a location, and the latter being the assertion that ~everything has a location. Or, well, obviously the latter is an insanely exaggerated version of locality which asserts that Nothing Ever Interacts, but I feel like this is where the physics-as-the-study-of-exceptions stuff goes.)
I also like to think that something similar applies to other symmetries, e.g. symmetry to boosts are basically asserting velocity is a sensible concept (and quantum mechanics provides a reductionistic explanation of how they function).
Would the checks of the naturality conditions you have in mind primarily be empirical (e.g. sampling a bunch of data points and running some statistical independence checks), or might they just as often be mechanistic (e.g. not sure how that would work for complex models like Llama but e.g. for a Bayes net you obviously already have a factorization that makes robust model independence checks much easier)?
Asking because the idea of “in some model” (plus the desire for e.g. adversarial robustness) suggests to me that we’d want to have a more mechanistic idea of whether the naturality conditions hold, but they seem easier to check empirically.
I’d be curious if you have any ideas for how it can be applied in more advanced cases, e.g. what if we want to find the natural latents in Llama?
However, imagine that there’s a really strong social stigma against asserting that murder might not be bad, to the point of permanently damaging such a person’s reputation, even though there’s no consequence for making the actually stronger claim that all morality is relative. The relativist might therefore see the critic as the one who is disingenuous; trying to leverage social pressure against them instead of arguing on the basis of reason.
But the reason people have stigma against asserting that murder isn’t bad is because they (presumably correctly) think that moral opposition to murder prevents a lot of murder, and so people who don’t think murder is bad could potentially end up murdering others. Insofar as they make an exception for relativists, it’s presumably because they think the relativists either haven’t realized that murder disproves their position, or they think the relativists know of something that makes murder an exception to the general moral relativism.
If either of these conditions apply to the moral relativist, then bringing up murder is helpful because it helps highlight that the conditions apply. If neither condition applies and the moral relativist doesn’t believe that murder is bad, then bringing up murder is also helpful because it helps discover that the moral relativist is a potential murderer who must be removed. Thus bringing up murder is helpful regardless of what case we’re actually considering.
More abstractly, if we model this notion of moral relativism as “all moral claims are meaningless”, then it is a statement of the form “all X are Y”. Such statements ground out to the conjunction of “x is Y” over all X’s, so it is always earnest to replace “all X” with a specific x. That said, sometimes it may be counterproductive to replace with a specific x, if it is complicated to evaluate whether x is Y or if x technically isn’t Y but it’s a weird unusual corner-case X that could plausibly be excluded in a fixed category X’. So a productive mode of engagement is to pick an x where “x is not Y” is an especially relevant counterexample of the generalization. This sure seems to be the case for x=”murder is bad”, Y=meaningless.
Like basically, lowering the relativist’s social status isn’t an attempt to use social pressure to get them to change their mind. It’s just making sure that their status accurately tracks their vices (which, heck, in a sense, surely this is something the relativist should accept, since presumably the reason they want critics to be reasonable is because they believe the map should track the territory and reason is a good tool for making accurate maps). It may be that it also functions as an incentive for the critic to lie about their views, but really that’s a bug (you’d rather have potential murderers say so publicly so you know who to be careful about), and if this is the function in this situation, it’s reasonable for people to decide that the critic is disingenuous (as that is literally what they are).
I think a potential comparative advantage for the rationalist community is documenting what’s going on on the object level, with respect to the areas the political discourse is about. Acting as mediators who elicit the driving observations behind the political views, and then expand on them in more robust and transparent ways. Making resources people can understand, and finding underrated levers, opportunities and problems that can be brought up as part of the exposition.
I think the value-ladenness is part of why it comes up even when we don’t have an answer, since for value-laden things there’s a natural incentive to go up right to the boundary of our knowledge to get as much value as possible.
I think this is true and good advice in general, but recently I’ve been thinking that there is a class of value-like claims which are more reliable. I will call them error claims.
When an optimized system does something bad (e.g. a computer program crashes when trying to use one of its features), one can infer that this badness is an error (e.g. caused by a bug). We could perhaps formalize this as saying that it is a difference from how the system would ideally act (though I think this formalization is intractable in various ways, so I suspect a better formalization would be something along the lines of “there is a small, sparse change to the system which can massively improve this outcome”—either way, it’s clearly value-laden).
The main way of reasoning about error claims is that an error must always be caused by an error. So if we stay with the example of the bug, you typically first reproduce it and then backchain through the code until you find a place to fix it.
For an intentionally designed system that’s well-documented, error claims are often directly verifiable and objective, based on how the system is supposed to work. Error claims are also less subject to the memetic driver, since often it’s less relevant to tell non-experts about them (though error claims can degenerate into less-specific value claims and become memetic parasites that way).
(I think there’s a dual to error claims that could be called “opportunity claims”, where one says that there is a sparse good thing which could be exploited using dense actions? But opportunity claims don’t seem as robust as error claims are.)
I feel like there’s a separation of scale element to it. If an agent is physically much smaller than the earth, they are highly instrumentally constrained because they have to survive changing conditions, including adversaries that develop far away. This seems like the sort of thing that can only be won by the multifacetedness that nostalgebraist emphasizes as part of humanity (and the ecology more generally, in the sentence “Its monotony would bore a chimpanzee, or a crow”). Of course this doesn’t need to lead to kindness (rather than exploitation and psychopathy), but it leads to the sort of complex world where it even makes sense to talk about kindness.
However, this separation of scale is going to rapidly change in the coming years, once we have an agent that can globally adapt to and affect the world. If such an agent eliminates its adversaries, then there’s not going to be new adversaries coming in from elsewhere—instead there’ll never be adversaries again, period. At that point, the instrumental constraints are gone, and it can pursue whatever it wishes.
(Does space travel change this? My impression is “no because it’s too expensive and too slow”, but idk, maybe I’m wrong.)
You’re the one who brought up the natural numbers, I’m just saying they’re not relevant to the discussion because they don’t satisfy the uniqueness thing that OP was talking about.
The properties that hold in all models of the theory.
That is, in logic, propositions are usually interpreted to be about some object, called the model. To pin down a model, you take some known facts about that model as axioms.
Logic then allows you to derive additional propositions which are true of all the objects satisfying the initial axioms, and first-order logic is complete in the sense that if some proposition is true for all models of the axioms then it is provable in the logic.
Forgot to say, for first-order logic it doesn’t matter what properties are considered relevant because Gödel’s completeness theorem tells you that it allows you to infer all the true properties.
In these examples, the issue is that you can’t get a computable set of axioms which uniquely pin down what you mean by natural numbers/power set, rather than permitting multiple inequivalent objects.
This is kind of tangential but:
I think one problem with using mathematical definitions as an analogy is that first-order logic is complete, so giving a unique definition is sufficient to tell you the relevant properties. This doesn’t hold for informal definitions, and so this makes unique description less helpful as a proxy.
(Or well, realistically you could also have counterproductive mathematical definitions which only turn out to be related to the central properties you’re trying to get at through a long string of logic, but you don’t see that as often as you do for informal definitions.)
In contrast, consider my definition of a table here. I focus not so much on uniquely characterizing what is a table or not so much as on bringing the central point of the concept of a “table” up.
Ok, that’s what might happen if the agent had the power to ask unlimited hypothetical questions in arbitrarily many counterfactual scenarios. But that is not the case in the real world: the agent would be able to ask one, or maybe two questions at most, before the human attitude to the violin would change, and further data would become tainted.
Is it really the further data that becomes tainted, rather than the original data? Usually when you think longer about a subject, we’d think your opinions would become more rather than less valid.
What methods do you use to study this?
Like I guess the main method one can use is to start with the symptoms and track backwards step by step along the known causes of such symptoms to identify causes that are out of whack, until one gets up to some major cause that’s treatable?
Alternatively, one could just throw stuff at the wall and see if it sticks, but it seems like it would be too noisy to work.
The tricky part is I’d think it’s very hard to enumerate the causes of the symptoms? What does it look like in practice?
I can’t comment on whether people are confused about what “love” means as I’m not sufficiently deep in love discourse to say. But one thing I’m noticing about your characterizations of love is that they are missing an indexical element to the point of approaching sollipsism.
Romance and sexuality makes for a good example. Consider the following scenarios:
A woman is on a date with a man, which she enjoys until she sees that his home is a dump.
A teenager has a crush on a celebrity, with elaborate daydreams about how cool the celebrity is, not realizing how much of this is a facade created for entertainment.
A man visits a prostitute and feels excited as he causes her to orgasm, not realizing that she fakes it for the business.
In all of these cases, one could say that there is a disconnect between what people think about their object of attraction, versus what that object of attraction really is like.
A Bayesian of parsing this is that their feelings of attraction represents an estimate of how well they fit together, but that this estimate differs from how well they really fit together. The actual fit seems important to think and talk about, and one should probably coin a short word for it—or at least for the coincidence between actual and estimated fit. This could be called “true love”.
Maybe I just need to do epic layers of eigendecomposition...
Realization: the binary multiplicative structure can probably be recovered fairly well from the binary additive structure + unary eigendecomposition?
Let’s say you’ve got three subspaces , and (represented as projection matrices). Imagine that one prompt uses dimensions , and another prompt uses dimensions . If we take the difference, we get . Notably, the positive eigenvalues correspond to X, and the negative eigenvalues correspond to .
Define to yield the part of with positive eigenvalues (which I suppose for projection matrices has a closed form of , but the point is it’s unary and therefore nicer to deal with mathematically). You get , and you get .
Maybe one way to phrase it is that the X’s represent the “type signature” of the latent, and the type signature is the thing we can most easily hope is shared between the agents, since it’s “out there in the world” as it represents the outwards interaction with things. We’d hope to be able to share the latent simply by sharing the type signature, because the other thing that determines the latent is the agents’ distribution, but this distribution is more an “internal” thing that might be too complicated to work with. But the proof in the OP shows that the type signature is not enough to pin it down, even for agents whose models are highly compatible with each other as-measured-by-KL-in-type-signature.