drocta

Karma: 79

drocta 7 May 2021 20:10 UTC
LW: 10 AF: 2
AF
on: Parsing Chris Mingard on Neural Networks
This comment I’m writing is mostly because this prompted me to attempt to see how feasible it would be to computationally enumerate the conditions for the weights of small networks like the 2 input 2 hidden layer 1 output in order to implement each of the possible functions. So, I looked at the second smallest case by hand, and enumerated conditions on the weights for a 2 input 1 output no hidden layer perceptron to implement each of the 2 input gates, and wanted to talk about it. This did not result in any insights, so if that doesn’t sound interesting, maybe skip reading the rest of this comment. I am willing to delete this comment if anyone would prefer I do that.
Of the 16 2-input-1-output gates, 2 of them, xor and xnor, can’t be done with the perceptrons with no hidden layer (as is well known), for 8 of them, the conditions on the 2 weights and the bias for the function to be implemented can be expressed as an intersection of 3 half spaces, and the remaining 6 can of course be expressed with an intersection of 4 (the maximum number that could be required, as for each specific input and output, the condition on the weights and bias in order to have that input give that output is specified by a half space, so specifying the half space for each input is always enough).

The ones that require 4 are: the constant 0 function, the constant 1 function, return the first input, return the second input, return the negation of the first input, and return the negation of the second input.
These seem, surprisingly, among the simplest possible behaviors. They are the ones which disregard at least one input. It seems a little surprising to me that these would be the ones that require an intersection of 4 half spaces.

I haven’t computed the proportions of the space taken up by each region so maybe the ones that require 4 planes aren’t particularly smaller. And I suppose with this few inputs, it may be hard to say that any of these functions are really substantially more simple than any of the rest of them. Or it may be that the tendency for simpler functions to occupy more space only shows up when we actually have hidden layers and/or have many more nodes.
Here is a table (x and y are the weights from a and b to the output, and z is the bias on the output):
outputs for the different inputs when this function is computed
0000 (i.e. the constant 0) z<0, x+y+z<0, x+z<0, y+z<0
0001 (i.e. the and gate) x+y+z>0, x+z<0, y+z<0
0010 (i.e. a and not b) z<0, x+y+z<0, x+z>0
0011 (i.e. if input a) z<0, x+y+z>0, x+z>0, y+z<0
0100 (i.e. b and not a) z<0, x+y+z<0, y+z>0
0101 (i.e. if input b) z<0, x+y+z>0, x+z<0, y+z>0
0110 (i.e. xor) impossible
0111 (i.e. or) z<0, x+z>0, y+z>0
1000 (i.e. nor) z>0, x+z<0, y+z<0
1001 (i.e. xnor) impossible
1010 (i.e. not b) z>0, x+y+z<0, x+z>0, y+z<0
1011 (i.e. b->a ) z>0, x+y+z>0, x+z<0
1100 (i.e. not a) z>0, x+y+z<0, x+z<0, y+z>0
1101 (i.e. a->b ) z>0, x+y+z>0, y+z<0
1110 (i.e. nand ) x+y+z<0, x+z>0, y+z>0
1111 (i.e. constant 0) z>0, x+z>0, y+z>0, x+y+z>0

drocta 9 May 2023 19:08 UTC
9 points
0
in reply to: mruwnik’s comment on: All AGI Safety questions welcome (especially basic ones) [May 2023]
I want to personally confirm a lot of what you’ve said here. As a Christian, I’m not entirely freaked out about AI risk because I don’t believe that God will allow it to be completely the end of the world (unless it is part of the planned end before the world is remade? But that seems unlikely to me.), but that’s no reason that it can’t still go very very badly (seeing as, well, the Holocaust happened).
In addition, the thing that seems to me most likely to be the way that God doesn’t allow AI doom, is for people working on AI safety to succeed. One shouldn’t rely on miracles and all that (unless [...]), so, basically I think we should plan/work as if it is up to humanity to prevent AI doom, only that I’m a bit less scared of the possibility of failure, but I would hope only in a way that results in better action (compared to panic) rather than it promoting inaction.
(And, a likely alternative, if we don’t succeed, I think of as likely being something like,
really-bad-stuff happens, but then maybe an EMP (or many EMPs worldwide?) gets activated, solving that problem, but also causing large-scale damage to power-grids, frying lots of equipment, and causing many shortages of many things necessary for the economy, which also causes many people to die. idk.)

drocta 10 Jun 2022 0:56 UTC
8 points
on: How Do Selection Theorems Relate To Interpretability?
As another “why not just” which I’m sure there’s a reason for:
in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.
If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I’d expect those nodes to have much more similar responses to those images than any other nodes in the network. I would expect that cosine similarity of the “how strongly does this node respond to this image” would be able to pick out the node(s) in question fairly well? Perhaps I’m wrong about that.
And, of course, this idea seems only directly applicable to feed-forward convolution networks that take an image as the input, and so, not so applicable when trying to like, understand how an agent works, probably.
(well, maybe it would work in things that aren’t just a convolutions-and-pooling-and-dilation-etc , but seems like it would be hard to make the analogous synthetic inputs which exemplify the sort of thing that the node responds to, for inputs other than images. Especially if the inputs are from a particularly discrete space, like sentences or something. )
But, this makes me a bit unclear about why the “NP-HARD” lights start blinking.
Of course, “find isomorphic structure”, sure.
But, if we have a set of situations which exemplify when a given node does and does not fire (rather, when it activates more and when it activates less) in one network, searching another network for a node that does/doesn’t activate in those same situations, hardly seems NP-hard. Just check all the nodes for whether they do or don’t light up. And then, if you also have similar characterizations for what causes activation in the nodes that came before the given node in the first network, apply the same process with those on the nodes in the second network that come before the nodes that matched the closest.
I suppose if you want to give a overall score for each combination of “this sub-network of nodes in the new network corresponds to this other network of nodes in the old-and-understood network”, and find the sub-network that gives the best score, then, sure, there could be exponentially many sub-networks to consider. But, if each well-understood node in the old network generally has basically only one plausibly corresponding node in the new network, then this seems like it might not really be an issue in practice?
But, I don’t have any real experience with this kind of thing, and I could be totally off.

drocta 13 May 2021 22:48 UTC
7 points
on: Agency in Conway’s Game of Life
nitpick : the appendix says $10^{60}$ possible configurations of the whole grid, while it should say $2^{(10^{60})}$ possible configurations. (Similarly for what it says about the number of possible configurations in the region that can be specified.)

drocta 5 Jun 2021 0:51 UTC
6 points
in reply to: axioman’s comment on: An Intuitive Guide to Garrabrant Induction
My understanding:
One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc.
The programs receive as input the prices which the market maker is offering.
It doesn’t need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.

drocta 1 Oct 2020 3:26 UTC
5 points
in reply to: TurnTrout’s comment on: “Zero Sum” is a misnomer.
As an example, I think in the game “both players win if they choose the same option, and lose if they pick different options” has “the two players pick different options, and lose” as one of the feasible outcomes, and it is not on the Pareto frontier, because if they picked the same thing, they would both win, and that would be a Pareto improvement.

drocta 3 Jan 2023 22:28 UTC
3 points
0
on: Verification Is Not Easier Than Generation In General
First, I want to summarize what I understand to be what your example is an example of:
”A triple consisting of
1) A predicate P
2) the task of generating any single input x for which P(x) is true
3) the task of, given any x (and given only x, not given any extra witness information), evaluating whether P(x) is true
”
For such triples, it is clear, as your example shows, that the second task (the 3rd entry) can be much harder than the first task (the 2nd entry).
_______
On the other hand, if instead one had the task of producing an exhaustive list of all x such that P(x), this, I think, cannot be easier that verifying whether such a list is correct (provided that one can easily evaluate whether x=y for whatever type x and y come from), as one can simply generate the list, and check if it is the same list.
Another question that comes to mind is: Are there predicates P such that the task of verifying instances which can be generated easily, is much harder than the task of verifying those kinds of instances?
It seems that the answer to this is also “yes”: Consider P to be “is this the result of applying this cryptographic hash function to (e.g.) a prime number?”. It is fairly easy to generate large prime numbers, and then apply the hash function to it. It is quite difficult to determine whether something is the hash of a prime number (… maybe assume that the hash function produces more bits for longer inputs in order to avoid having pigeonhole principle stuff that might result in it being highly likely that all of the possible outputs are the hash of some prime number. Or just put a bound on how big the prime numbers can be in order for P to be true of the hash.)
(Also, the task of “does this machine halt”, given a particular verifying process that only gets the specification of the machine and not e.g. a log of it running, should probably(?) be reasonably easy to produce machines that halt but which that particular verifying process will not confirm that quickly.
So, for any easy-way-to-verify there is an easy-way-to-generate which produces ones that the easy-way-to-verify cannot verify, so that seems to be another reason why “yes”, though, there may be some subtleties here?)

drocta 5 Jun 2021 0:23 UTC
3 points
AF
on: Finite Factored Sets: Introduction and Factorizations
The part about Chimera functions was surprising, and I look forward to seeing where that will go, and to more of this in general.
In section 2.1 , Proposition 2 should presumably say that $\geq_{S}$ is a partial order on $Part (S)$ rather than on $S$ .

drocta 3 Jun 2021 22:36 UTC
3 points
on: An Intuitive Guide to Garrabrant Induction
In the section about Non-Dogmatism , I believe something was switched around. It says that if the logical inductor assigns prices converging to $1 to a proposition that cannot be proven, that the trader can buy shares in that proposition at prices of $ $2^{- n}$ and thereby gain infinite potential upside. I believe this should say that if the logical inductor assigns prices converging to $0 to a proposition that can’t be dis-proven, instead of prices converging to $1 for a proposition that can’t be proven .
(I think that if the price was converging to $1 for a proposition that cannot be proven, the trader would sell $1 / (1 - c_{n})$ shares at prices $ $1 - c_{n}$ , for potential gain of $1 each time, and potential losses of $(1 / (1 - c_{n})) - 1$ , so, to have this be $ $2^{- n}$ , this should be $1 - c_{n} = (1 / (1 + 2^{- n})$ .)

There’s also a little formatting error with the LaTeX in section 4.1
Nice summary/guide! It made the idea behind the construction of the algorithm much more clear to me.
(I had a decent understanding of the criterion, but I hadn’t really understood big picture of the algorithm. I think I had previously been tripped up by the details around the continuity and such, and not following these led to me not getting the big picture of it.)

drocta 28 May 2021 3:51 UTC
LW: 3 AF: 2
AF
on: Finite Factored Sets
You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I’m not sure it’s the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I’m not sure that it is fully categorical.
Let S be an object.
For i from 1 to k, let $A_{i}$ be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let $f : \prod_{i} A_{i} \to S$ be an isomorphism.
Then, say that $((A_{i})_{i}, f, S)$ is a representation of a factorization of S.
If $((A_{i})_{i}, f, S)$ and $((A_{i}^{'})_{i}, f^{'}, S)$ are each a representative of a factorization of S, then say that they represent the same factorization of S iff there exist isomorphisms $g_{i} : A_{i} \to A_{i}^{'}$ such that , where $⟨ g_{1}, g_{2}, . . ., g_{k} ⟩ : \prod_{i} A_{i} \to \prod_{i} A_{i}^{'}$ is the isomorphism obtained from the $g_{i}$ with the usual product map, the composition of it with f’ is equal to f, that is, $⟨ g_{1}, g_{2}, . . ., g_{k} ⟩; f^{'} = f$ .
Then say that a factorization is, the class of representative of the same factorization. (being a representation of the same factorization is an equivalence relation).
For FinSet , the factorizations defined this way correspond to the factorizations as originally defined.
However, I’ve no idea whether this definition remains interesting if applied to other categories.
For example, if it were to be applied to the closed disk in a category of topological spaces and continuous functions, it seems that most of the isomorphisms from [0,1] * [0,1] to the disk would be distinct factorizations, even though there would still be many which are identified, and I don’t really see talking about the different factorizations of the closed disk as saying much of note. I guess the factorizations using [0,1] and [0,1] correspond to different cosets of the group of automorphisms of the closed disk by a particular subgroup, but I’m pretty sure it isn’t a normal subgroup, so no luck there.
If instead we try the category of vector spaces and linear maps over a particular field, then I guess it looks more potentially interesting. I guess things over sets having good analogies over vector spaces is a common occurrence. But here still, the subgroups of the automorphism groups given largely by the products of the automorphism groups of the things in the product, seems like they still usually fail to be a normal subgroup, I think. But regardless, it still looks like there’s some ok properties to them, something kinda Grassmannian-ish ? idk. Better properties than in the topological spaces case anyway.

drocta 29 Jan 2021 3:59 UTC
3 points
on: Newcomb’s Paradox Assumes The Halting Problem Is Decidable
That something can be modeled using some Turing machine, doesn’t imply that it can be any Turing machine.

If I have some simple physical system, such that I can predict how it will behave, well, it can be modeled by a Turing machine, but me being able to predict it doesn’t imply that I’ve solved the halting problem.
A realistic conception of agents in an environment doesn’t involve all agents having unlimited compute at every time-step. An agent cannot prevent the universe from continuing simply by getting stuck in a loop and never producing its output for its next action.

drocta 6 Nov 2020 1:35 UTC
3 points
in reply to: Scott Garrabrant’s comment on: Subagents of Cartesian Frames
Thanks! (The way you phrased the conclusion is also much clearer/cleaner than how I phrased it)

drocta 5 Nov 2020 7:46 UTC
LW: 3 AF: 1
AF
on: Subagents of Cartesian Frames
I am trying to check that I am understanding this correctly by applying it, though probably not in a very meaningful way:
Am I right in reasoning that, for $S \subseteq W$ , that $1_{S} ⊲ C$ iff ( (C can ensure S), and (every element of S is a result of a combination of a possible configuration of the environment of C with a possible configuration of the agent for C, such that the agent configuration is one that ensures S regardless of the environment configuration)) ?
So, if S = {a,b,c,d} , then
$C = ⎡ ⎢ ⎣ \begin{matrix} a & b & a c & d & d e & f & a \end{matrix} ⎤ ⎥ ⎦$
would have $1_{S} ⊲ C$ , but, say
$D = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} a & a & a b & b & c c & d & f b & f & a \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦$
would have $1_{S} ⋪ D$ , because , while S can be ensured, there isn’t, for every outcome in S, an option which ensures S and which is compatible with that outcome ?

drocta 27 Oct 2020 1:44 UTC
3 points
on: A Correspondence Theorem
There are a few places where I believe you mean to write $f_{2} (X_{2})$ a but instead have $f (X_{2})$ instead. For example, in the line above the “Applicability” heading.
I like this.

drocta 19 Jun 2023 21:53 UTC
2 points
0
in reply to: Jeremy Gillen’s comment on: Infrafunctions and Robust Optimization
Well, I was kinda thinking of $ν$ as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of “modify yourself to no-longer be a quantilizer” would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, presumably, not quantilizers, and the types of self-modification actions that would be available are not the same). Though, “create a successor agent” could still be in the human distribution.
Of course, one doesn’t have practical access to “the true probability distribution of human behaviors in context M”, so I guess I was imagining a trained approximation to this distribution.
Hm, well, suppose that the distribution over human-like behaviors includes both making an agent which is a quantilizer and making one which isn’t, both of equal probability. Hm. I don’t see why a general quantilizer in this case would pick the quantilizer over the plain optimizer, as the utility...
Hm...
I get the idea that the “quantilizers correspond to optimizing an infra-function of form [...]” thing is maybe dealing with a distribution over a single act?
Or.. if we have a utility function over histories until the end of the episode, then, if one has a model of how the environment will be and how one is likely to act in all future steps, given each of one’s potential actions in the current step, one gets an expected utility conditioned on each of the potential actions in the current step, and this works as a utility function over actions for the current step,
and if one acts as a quantilizer over that, each step.. does that give the same behavior as an agent optimizing an infra-function defined using the condition with the $L_{1}$ norm described in the post, in terms of the utility function over histories for an entire episode, and reference distributions for the whole episode?
argh, seems difficult...

drocta 19 Jun 2023 20:24 UTC
2 points
0
in reply to: Jeremy Gillen’s comment on: Infrafunctions and Robust Optimization
I thought CDT was considered not reflectively-consistent because it fails Newcomb’s problem?
(Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn’t have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?)
Like, isn’t it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form “at some point in the future, you will be measured, and based on this measurement, your future response will be predicted, and based on this the boxes will be filled”)
My understanding of reflective stability was “the agent would not want to modify its method of reasoning”. (E.g., a person with an addiction is not reflectively stable, because they want the thing (and pursue the thing), but would rather not want (or pursue) the thing.
The idea being that, any ideal way of reasoning, should be reflectively stable.
And, I thought that what was being described in the part of this article about recovering quantilizers, was not saying “here’s how you can use this framework to make quantalizers better”, so much as “quantilizers fit within this framework, and can be described within it, where the infrafunction that produces quantilizer-behavior is this one: [the (convex) set of utility functions which differ (in absolute value) from the given one, by, in expectation under the reference policy, at most epsilon]”
So, I think the idea is that, a quantilizer for a given utility function $U$ and reference distribution $ν$ is, in effect, optimizing for an infrafunction that is/corresponds-to the set of utility functions $V$ satisfying the bound in question,
and, therefore, any quantilizer, in a sense, is as if it “has this bound” (or, “believes this bound”)
And that therefore, any quantilizer should -
- wait.. that doesn’t seem right..? I was going to say that any quantilizer should therefore be reflectively stable, but that seems like it must be wrong? What if the reference distribution includes always taking actions to modify oneself in a way that would result in not being a quantilizer? uhhhhhh
Ah, hm, it seems to me like the way I was imagining the distribution $ν$ and the context in which you were considering it, are rather different. I was thinking of $ν$ as being an accurate distribution of behaviors of some known-to-be-acceptably-safe agent, whereas it seems like you were considering it as having a much larger support, being much more spread out in what behaviors it has as comparably likely to other behaviors, with things being more ruled-out rather than ruled-in ?

drocta 30 May 2023 17:06 UTC
2 points
0
in reply to: Yaakov T’s comment on: All AGI Safety questions welcome (especially basic ones) [May 2023]
In the line that ends with “even if God would not allow complete extinction.”, my impulse is to include ” (or other forms of permanent doom)” before the period, but I suspect that this is due to my tendency to include excessive details/notes/etc. and probably best not to actually include in that sentence.
(Like, for example, if there were no more adult humans, only billions of babies grown in artificial wombs (in a way staggered in time) and then kept in a state of chemically induced euphoria until the age of 1, and then killed, that technically wouldn’t be human extinction, but, that scenario would still count as doom.)
Regarding the part about “it is secular scientific-materialists who are doing the research which is a threat to my values” part: I think it is good that it discusses this! (and I hadn’t thought about including it)
But, I’m personally somewhat skeptical that CEV really works as a solution to this problem? Or at least, in the simpler ways of it being described.
Like, I imagine there being a lot of path-dependence in how a culture’s values would “progress” over time, and I see little reason why a sequence of changes of the form “opinion/values changing in response to an argument that seems to make sense” would be that unlikely to produce values that the initial values would deem horrifying? (or, which would seem horrifying to those in an alternate possible future that just happened to take a difference branch in how their values evolved)
[EDIT: at this point, I start going off on a tangent which is a fair bit less relevant to the question of improving Stampy’s response, so, you might want to skip reading it, idk]
My preferred solution is closer to, “we avoid applying large amounts of optimization pressure to most topics, instead applying it only to topics where there is near-unanimous agreement on what kinds of outcomes are better (such as, “humanity doesn’t get wiped out by a big space rock”, “it is better for people to not have terrible diseases”, etc.), while avoiding these optimizations having much effect on other areas where there is much disagreement as to what-is-good.

Though, it does seem plausible to me, as a somewhat scary idea, that the thing I just described is perhaps not exactly coherent?
(that being said, even though I have my doubts about CEV, at least in the form described in the simpler ways it is described, I do think it would of course be better than doom.
Also, it is quite possible that I’m just misunderstanding the idea of CEV in a way that causes my concerns, and maybe it was always meant to exclude the kinds of things I describe being concerned about?)

drocta 27 Jun 2022 23:07 UTC
2 points
on: A Toy Model of Gradient Hacking
When you say “which yields a solution of the form $f (w) = c_{1} / (1 - w) + c_{2}$ ”, are you saying that $f^{'} (w) / f (w) = 1 / (1 - w)$ yields that, or are you saying that $(1 - w) f^{'} (w) - f (w) > 0$ yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form $f (w) = c_{1} / (1 - w)$ .
But, if the latter, then, I would think that the solutions would be more solutions than that?
Like, what about $g (w) := c_{1} / (1 - w) + c_{2} \cdot (1 - \frac{1}{10^{6}} + \frac{1}{10^{6}} cos (w))$ ? (where, say, $c_{1} = ε + δ$ and $c_{2} = - δ$
$g^{'} (w) = \frac{c_{1}}{(1 - w)^{2}} + c_{2} \cdot (- \frac{1}{10^{6}} sin (w))$ . so $(1 - w) g^{'} (w) - g (w) = c_{2} \cdot (- \frac{1}{10^{6}} (1 - w) sin (w)) - c_{2} \cdot (1 - \frac{1}{10^{6}} + \frac{1}{10^{6}} cos (w))$
$= - c_{2} \cdot (1 - 10^{- 6} \cdot (1 - cos (w) - (1 - w) sin (w)))$
which, for $c_{2} < 0$ , and $w \in [0, 1]$ , is positive, and so g should also be a solution to $(1 - w) f^{'} (w) - f (w) > 0$ , yes?
Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?
Separately (btw, what’s the convention if one has unrelated comments? to put them in one comment or separate comments?) :
As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.
Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren’t especially big) of the weights in the network, and then for the loss function, take the main part of the network’s output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network’s weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)
Why “a linear combination of the weights” instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between “how the net should be”, and “a net that is doing gradient hacking”, then it seems like “some particular weighted sum of the weights”, might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the $w_{1}$ and $w_{2}$ separate.
On the other hand, if the network isn’t recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I’d think it couldn’t protect these later parts from being changed by gradient descent, or at least, it would be harder.)

(I’m imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)

drocta 31 May 2022 19:58 UTC
2 points
on: Paper: Teaching GPT3 to express uncertainty in words
I was surprised by how the fine-tuning was done for the verbalized confidence.
My initial expectation was that it would make the loss be based on like, some scoring rule based on the probability expressed and the right answer.
Though, come to think of it, I guess seeing as it would be assigning logits values to different expressions of probabilities, it would have to… what, take the weighted average of the scores it would get if it gave the different probabilities? And, I suppose that if many training steps were done on the same question/answer pairs, then the confidences might just get pushed towards 0% or 100%?
Ah, but for the indirect logit it was trained using the “is it right or wrong” with the cross-entropy loss thing. Ok, cool.

drocta 20 Feb 2021 23:02 UTC
2 points
on: AI Safety Beginners Meetup
The link in the rss feed entry for this at https://agentfoundations.org/rss goes to https://www.alignmentforum.org/events/vvPYYTscRXFBvdkXe/ai-safety-beginners-meetup which is a broken link (though, easily fixed by replacing “events” with “posts” in the url) .
[edit: it appears that it is no longer in the rss feed? It showed up in my rss feed reader.]
I think this has also happened with other “event” type posts in the rss feed before, but I may be remembering wrong.
I suspect this is some bug in how the rss feed is generated, but possibly it is a known bug which just hasn’t been deemed important enough to fix yet.

I assume that when the event is updated that the additional information will include how to join the meetup?
I am interested in attending.