1) From the LW user perspective, the way AF is integrated in a way which signals there are two classes of users, where the AF members are something like “the officially approved experts” (specialists, etc.), together with omega badges, special karma, application process, etc. In such setup it is hard to avoid for the status-tracking subsystem which humans generally have to not care about what is “high status”. At the same time: I went through the list of AF users, and it seems much better representation of something which Rohin called “viewpoint X” than the field of AI alignment in general. I would expect some subtle distortion as a result
2) The LW team seem quite keen about e.g. karma, cash prizes on questions, omegas, daily karma updates, and similar technical measures which in S2-centric views bring clear benefits (sorting of comments, credible signalling of interest in questions, creating high-context environment for experts,...). Often these likely have some important effects on S1 motivations / social interactions / etc. I’ve discussed karma and omegas before, creating an environment driven by prizes risks eroding the spirit of cooperativeness and sharing of ideas which is one of virtues of AI safety community, and so on. “Herding elephants with small electric jolts” is a poetic description of effects people’s S1 get from downvotes and strong downvotes.
As a datapoint—my reasons for mostly not participating in discussion here:
The karma system messes up with my S1 motivations and research taste; I do not want to update toward “LW average taste”—I don’t think LW average taste is that great. Also IMO on the margin it is better for the field to add ppl who are trying to orient themselves in AI alignment independently, in contrast to people guided by “what’s popular on LW”
Commenting seems costly; feels like comments are expected to be written very clearly and reader-friendly, which is time costly
Posting seems super-costly; my impression is many readers are calibrated on quality of writing of Eliezer, Scott & likes, not on informal research conversation
Quality of debate on topics I find interesting is much worse than in person
Not the top reason, but still… System of AF members vs. hoi polloi, omegas, etc. creates some subtle corruption/distortion field. My overall vague impression is the LW team generally tends to like solutions which look theoretically nice, and tends to not see subtler impacts on the elephants. Where my approach would be to try move much of the elephants-playing-status-game out of the way, what’s attempted here sometimes feels a bit like herding elephants with small electric jolts.
No. It’s planned so you can attend both events.
FWIW I also think it’s quite possible the current equilibrium is decent (which is part of reasons why I did not posted something like “How did I turned karma off” with simple instruction about how to do it on the forum, which I did consider). On the other hand I’d be curious about more people trying it and reporting their experiences.
I suspect many people kind of don’t have this action in the space of things they usually consider—I’d expect what most people would do is 1) just stop posting 2) write about their negative experience 3) complain privately.
Actually I turned the karma for all comments, not just mine. The bold claim is my individual taste in what’s good on the EA forum is in important ways better than the karma system, and the karma signal is similar to sounds made by a noisy mob. If I want I can actually predict what average sounds will the crowd make reasonably well, so it is not any new source of information. But it still messes up with your S1 processing and motivations.
Continuing with the party metaphor, I think it is generally not that difficult to understand what sort of behaviour will make you popular at a party, and what sort of behaviours even when they are quite good in a broader scheme of things will make you unpopular at parties. Also personally I often feel something like “I actually want to have good conversations about juicy topics in a quite place, unfortunately you all people are congregating at this super loud space, with all these status games, social signals, and ethically problematic norms how to treat other people” toward most parties.
Overall I posted this here because it seemed like an interesting datapoint. Generally I think it would be great if people moved toward writing information rich feedback instead of voting, so such shift seems good. From what I’ve seen on EA forum it’s quite rarely “many people” doing anything. More often it is like 6 users upvote a comment, 1user strongly downvotes it, something like karma 2 is a result. I would guess you may be in larger risk of distorted perception that this represents some meaningful opinion of the community. (Also I see some important practical cases where people are misled by “noises of the crowd” and it influences them in a harmful way.)
What I noticed on the EA forum is the whole karma thing is messing up with my S1 processes and makes me unhappy on average. I’ve not only turned off the notifications, but also hidden all karma displays in comments via css, and the experience is much better.
Reasons for some careful optimism
in Part I., it can be the case that human values are actually complex combination of easy to measure goals + complex world models, so the structure of the proxies will be able to represent what we really care about. (I don’t know. Also the result can still stop represent our values with further scaling and evolution.)
in Part II., it can be the case that influence-seeking patterns are more computationally costly than straightforward patterns, and they can be in part suppressed by optimising for processing costs, bounded-rationality style. To some extend, influence-seeking patterns attempting to grow and control the whole system seems to me to be something happening also within our own minds. I would guess some combinational of immune system + metacognition + bounded rationality + stabilisation by complexity is stabilising many human minds. (I don’t know if anything of that can scale arbitrarily.)
Short summary of how is the lined paper important: you can think about bias as some sort of perturbation. You are then interested in the “cascade of spreading” of the perturbation, and especially factors like the distribution of sizes of cascades. The universality classes tell you this can be predicted by just a few parameters (Table 1 in the linked paper) depending mainly on local dynamic (forecaster-forecaster interactions). Now if you have a good model of the local dynamic, you can determine the parameters and determine into which universality class the problem belongs. Also you can try to infer the dynamics if you have good data on your interactions.
I’m afraid I don’t know enough about how “forecasting communities” work to be able to give you some good guesses what may be the points of leverage. One quick idea, if you have everybody on the same platform, may be to do some sort fo A/B experiment—manipulate the data so some forecasters would see the predictions of other with an artificially introduced perturbation, and see how their output will be different from the control group. If you have data on “individual dynamics” liken that, and some knowledge of network structure, the theory can help you predict the cascade size distribution.
(I also apologize for not being more helpful, but I really don’t have time to work on this for you.)
I was a bit confused by we but aren’t sure how to reason quantitatively about the impacts, and how much the LW community could together build on top of our preliminary search, which seemed to nudge toward original research. Outsourcing literature reviews, distillation or extrapolation seem great.
Generally, there is a substantial literature on the topic within the field of network science. The right keywords for Google scholar are something like spreading dynamics in complex networks. Information cascades does not seem to be the best choice of keywords.
There are many options how you can model the state of the node (discrete states, oscillators, continuous variables, vectors of anything of the above,...), multiple options how you may represent the dynamics (something like Ising model / softmax, versions of voter model, oscillator coupling, …) and multiple options how you model the topology (graphs with weighted or unweighted edges, adaptive wiring or not, topologies based on SBM, or scale-free networks, or Erdős–Rényi, or Watts-Strogatz, or real-world network data,… This creates somewhat large space of options, which were usually already explored somewhere in the literature.
What is possibly the single most important thing to know about this, there are universality classes of systems which exhibit similar behaviour; so you can often ignore the details of the dynamics/topology/state representation.
Overall I would suggest to approach this with some intellectual humility and study existing research more, rather then try to reinvent large part of network science on LessWrong. (My guess is something like >2000 research years were spent on the topic often by quite good people.)
It would be cool to try some style-matching between the text and images. Ultimately, having some “personality vector” which would be used both in image and text generation. (A very crude version could be to create a NN translator from the style space to word2vec space and include the words in the GPT prompts)
As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
I’m afraid it is generally infeasible to avoid modelling humans at least implicitly. One reason for that is that basically any practical ontology we use is implicitly human. In a sense the only implicitly non-human knowledge is quantum field theory (and even that is not clear).
For example: while human-independent methods to measure negative side effects seem like human-independent, it seems to me lot of ideas about humans creep into the details. The proposals I’ve seen generally depend on some coarse-graining of states - you at least want to somehow remove time from the state, but generally you do coarse-graining based on …actually, what humans value. (If this research agenda would be trying to avoid implicit human models, I would expect people spending a lot of effort on measures of quantum entaglement, decoherence, and similar topics.)
Just a few comments
In the abstract, one open problem about “not-goal directed agents” is “when they turn into goal directed?”; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things
From the “alternative solutions”, in my view, what is under-investigated are attempts to limit capabilities—make “bounded agents”. One intuition behind it is that humans are functional just because goals and utilities are “broken” in a way compatible with our planning and computational bounds. I’m worried that efforts in this direction got bucketed with “boxing”, and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like “just don’t connect it to the internet”)
I’m particularly happy about your points on the standard claims about expected utility maximization. My vague impression is too many people on LW kind of read the standard texts, take note that there is a persuasive text from Eliezer on a topic, and take the matter as settled.
Not only it is hard to disentangle manipulation and explanation; it is actually difficult to disentangle even manipulation and just asking the human about preferences (like here).
Manipulation via incorrect “understanding” is IMO somewhat easier problem (understanding can be possibly tested by something like simulating the human’s capacity to predict). Manipulation via messing up with our internal multi-agent system of values seems subtle and harder. (You can imagine AI roughly in the shape of Robin Hanson, explaining to one part of the mind how some of the other parts work. Or just drawing the attention of consciousness to some sub-agents and not others.)
My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.
One hypothesis why we do so well: we “simulate” other people on a very similar hardware, and relatively similar mind (when compared to the abstract set of planners). Which is a sort of strong implicit prior. (Some evidence for that is we have much more trouble inferring goals of other people if their brains function far away from what’s usual on some dimension)
As Raemon noted, mentorship bottleneck is actually a bottleneck. Senior researchers who should mentor are the most bottlenecked resource in the field, and the problem is unlikely to be solved by financial or similar incentives. Motivating too much is probably wrong, because mentoring competes with time to do research, evaluate grants, etc. What can be done is
improve the utilization of time of the mentors (e.g. mentoring teams of people instead of individuals)
do what can be done on peer-to-peer basis
use mentors from other fields to teach people generic skills, e.g. how to do research
prepare better materials for onboarding
Is there another way to spend money that seems clearly more cost-effective at this point, and if so what? In my opinion for example AI safety camps were significantly more effective. I have maybe 2-3 ideas which would be likely more effective (sorry but shareable in private only).
Btw, when it comes to any practical implications, both of these repugnant conclusions depend on likely incorrect aggregating of utilities. If we aggregate utilities with logarithms/exponentiation in the right places, and assume the resources are limited, the answer to the question “what is the best population given the limited resources” is not repugnant.