What if we think about it the following way? ML researchers range from _theorists_ (who try to produce theories that describe how ML/AI/intelligence works at the deep level and how to build it) to _experimenters_ (who put things together using some theory and lots of trial and error and try to make it perform well on the benchmarks). Most people will be somewhere in between on this spectrum but people focusing on interpretability will be further towards theorists than most of the field.
Now let’s say we boost the theorists and they produce a lot of explanations that make better sense of the state of the art that experimenters have been playing with. The immediate impact of this will be improved understanding of our best models and this is good for safety. However, when the experimenters read these papers, their search space (of architectures, hyperparameters, training regimes, etc.) is reduced and they are now able to search more efficiently. Standing on the shoulders of the new theories they produce even better performing models (however they still incorporate a lot of trial and error because this is what experimenters do).
So what we achieved is better understanding of the current state of the art models combined with new improved state of the art that we still don’t quite understand. It’s not immediately clear whether we’re better off this way. Or is this model too coarse to see what’s going on?
If geoengineering approaches successfully counteract climate change, and it’s cheaper to burn carbon and dim the sun than generate power a different way (or not use the power), then presumably civilization is better off burning carbon and dimming the sun.
AFAIK, the main arguments against solar radiation management (SRM) are:
1. High level of CO2 in the atmosphere creates other problems too (e.g. ocean acidification) but those problems are less urgent / impactful so we’ll end up not caring about them if we implement SRM. Reducing CO2 emissions allows us to “do the right thing” using already existing political momentum.
2. Having the climate depend on SRM gives a lot of power to those in control of SRM and makes the civilization dependent on SRM. We are bad at global cooperation as is and having SRM to manage will put additional stress on that. This is a more fragile solution than reducing emissions.
It’s certainly possible to argue against either of these points, especially introducing the assumption that humanity as a whole is close enough to a rational agent. My opinion is that geoengineering solutions lead to more fragility than reducing emissions and we would be better off avoiding them or at least doing something along the lines of carbon sequestration and not SRM. It also seems increasingly likely that we won’t have that option. Our emission reduction efforts are too slow and once we hit +5ºC and beyond the option to “turn this off tomorrow” will look too attractive.
I think things are not so bad. If our talking of consciousness leads to a satisfactory functional theory, we might conclude that we have solved the hard problem (at least the “how” part). Not everyone will be satisfied, but it will be hard to make an argument that we should care about the hard problem of consciousness more than we currently care about the hard problem of gravity.
I haven’t read Nagel’s paper but from what I have read _about_ it, it seems like his main point is that it’s impossible to fully explain subjective experience by just talking about physical processes in the brain. It seems to me that we do get closer to such explanation by thinking about analogies between conscious minds and AIs. Whether we’ll be able to get all the way there is hard to predict but it seems plausible that at some point our theories of consciousness would be “good enough”.
The web of concepts where connections conduct karma between nodes is quite similar to a neural net (a biological one). It also seems to be a good model for System 1 moral reasoning and this explains why moral arguments based on linking things to agreed good or agreed bad concepts work so well. Thank you, this was enlightening.
I’ve been doing some ad hoc track-backs while trying to do anapanasati meditation and I found them quite interesting. Never tried to go for track-backs specifically but it does seem like a good idea and the explanations and arguments in this post were quite convincing. I’m going to try it in my next sessions.
I also learned about staring into regrets, which sounds like another great technique to try. This post is just a treasure trove, thank you!
I find that trees of claims don’t always work because context gets lost as you traverse the tree.
Imagine we have an claim A supported by B that is supported by C. If I think that C does support B in some cases but is irrelevant when specifically talking about A, there’s no good way to express this. Actually even arguing about relevance of B to A is not really possible, there’s only impact vote, but often that was too limiting to express my point.
To some extent both of those cases can be addressed via comments. However, the comments are not very prominent in the UI, are often used for meta discussion instead, and don’t contribute to the scoring and visualization.
One idea that I thought about is creating an additional claim that states how B is relevant to A (and then it can have further sub-claims). However, “hows” are not binary claims, so they wouldn’t fit well into the format and into the visualization. It seems like complicating the model this way won’t be worth it for the limited improvement that we’re likely to see.
If we can do this, then it would give us a possible route to a controlled intelligence explosion, in which the AI designs a more capable successor AI because that is the task it has been assigned, rather than for instrumental reasons, and humans can inspect the result and decide whether or not to run it.
How would humans decide whether something designed by a superintelligent AI is safe to run? It doesn’t sound safe by design because even if we rule out safety-compromising divergence in the toy intelligence explosion, how would we know that successor AI is safe for the real world—nobody has designed it for that. We certainly shouldn’t think that we can catch potential problems with the design by our own inspection—we can’t even do it reliably for designs produced by non-superintelligent software developers.
This reminds me of problems with boxed AIs: all is good unless they can affect the real world but this is a limitation on their usefulness and if they are superintelligent we might not see them leaking out of the box.
I’m not sure if I can provide useful feedback but I’d be interested in reading it.
I enjoyed reading the hand-written text from images (although I found it a bit surprising that I did). I feel that the resulting slower reading pace fit the content well and that it allowed me to engage with it better. It was also aesthetically pleasant.
Content-wise I found that it more or less agrees with my experience (I have been meditating every day for ~1 hour for a bit over a month and after that non-regularly). It also gave me some insight in terms of every-day mindfulness and some motivation for resuming regular practice or at least making it more regular.
My favorite quote was the this:
To act with equanimity is to be able to see a plan as having a 1% chance of success and see it as your best bet anyway, if best bet it is—and in that frame of mind, to be able to devote your whole being toward that plan; and yet, to be able to drop it in a moment if sufficient evidence accumulates in favor of another way.
(thanks to @gjm for transcribing it, so I didn’t have to :)
This reminded me of this post. I like that you specifically mention that reasoning vs. pattern-matching is a spectrum and context-dependent. The advice about using examples is also good, that definitely worked for me.
Both posts also remind me of Mappers and Packers. Seems like all three are exploring roughly the same personality feature from different angles.
To see if I understand this right, I’ll try to look at (in)adequacy in terms of utility maximization.
The systems that we looked at can be seen as having utility functions. Sometimes the utility function is explicitly declared by creators of the system but more often it’s implicit in its design or just assumed by an observer. For markets it will be some combination of ease of trade, adjacenty of price in sell and buy offers, etc., for academia—the amount of useful scientific progress per dollar, for medicine—amount of saved and improved lives (appropriately weighted) per dollar, and so forth.
We might have different (or nonexistent) precise definitions of the utility functions, but we all agree that “saving ten thousand lives for just ten dollars each” increases medicine’s utility value and completing modestly priced research that has great benefit for humanity increases academia’s utility value.
Then the adequacy of the system is the ability of the system to maximize its utility value. And lack of such ability is inadequacy. It might also make sense to talk in comparatives: system A is more adequate than system B at maximizing utility function F, system A is more adequate at maximizing F1 than it is an maximizing F2.
I see the following possible reasons for inadequacy:
The system in its current state maximizes a different function than what its creators intended (in other words it’s not aligned with the intention of its creators),
The observer has a different idea about the implied or intended utility function or misunderstands how things work.
The system maximizes what we want but does it poorly.
The system is not rational and doesn’t consistently maximize any utility function (and perhaps the creators, if any, didn’t have anything precise in mind in the first place).
In real world systems we usually see a combination of all of these reasons. Technically, (4) would rule out the other options, but insofar that (4) might be indistinguishable from maximizing utility imperfectly, they could still apply.
I know this sounds quite imprecise compared to typical utility maximization talk but I think it would be a non-trivial amount of work to define things more precisely. Does this seem to go in the right direction though?
Just in case anyone is interested, here’s a non-paywalled version of this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3176828/
I’m not sure I’m completely solid on how FHE works, so perhaps this won’t work, but here’s an idea of how B can exploit this approach:
Let’s imagine that Check_trustworthy(A_source) = 1. After step 3 of the parent comment B would know E1 = Encrypt(1, A_key). If Check_trustworthy(A_source) returned 0, B would instead know E0 = Encrypt(0, A_key) and the following steps works similarly. B knows which one it is by looking at msg_3.
B has another program: Check_blackmail(X, source) that simulates behaviour of an agent with the given source code in situation X and returns 1 if it would be blackmailable or 0 if not.
B knows Encrypt(A_source, A_key) and they can compute F(X) = Encrypt(Check_blackmail(X, A_source), A_key) for any X using FHE properties of the encryption scheme.
Let’s define W(X) = if(F(X) = E1, 1, 0). It’s easy to see that W(X) = Check_blackmail(X, A_source), so now B can compute that for any X.
This example is a lie that could be classified as “aggression light” (because it maximises my utility at the expense of victim’s utility), whereas the examples in the post are trying to maximise other’s utility. What I find interesting is that the second example from the post (protecting Joe) almost fits your formula but it seems intuitively much more benign.
One of the reasons I feel better about lying to protect Joe is that there I maximise his utility (not mine) at expense of yours (it’s not clear if you lose anything, but what’s important is that I’m mostly doing it for Joe). It’s much easier to morally justify aggression in the name of someone else where I am just “protecting the weak”.