Another possible risk: Accidentally swallowing the iodine. This happened to me. I was using a squeezable nasal irrigation device, I squirted some of the mixture into my mouth, and it went right down my throat. I called Poison Control, followed their instructions (IIRC they told me to consume a lot of starchy food, I think maybe I took some activated charcoal too), and ended up being fine.
Ebenezer Dukakis
The older get and the more I use the internet, the more skeptical I become of downvoting.
Reddit is the only major social media site that has downvoting, and reddit is also (in my view) the social media site with the biggest groupthink problem. People really seem to dislike being downvoted, which causes them to cluster in subreddits full of the like-minded, taking potshots at those who disagree instead of having a dialogue. Reddit started out as one the most intelligent sites on the internet due to its programmer-discussion origins; the decline has been fairly remarkable IMO. Especially when it comes to any sort of controversial or morality-related dialogue, reddit commenters seem to be participating in a Keynesian beauty contest more than they are thinking.
When I look at the stuff that other people downvote, their downvotes often seem arbitrary and capricious. (It can be hard to separate out my independent opinion of the content from my downvotes-colored opinion so I can notice this.) When I get the impulse to downvote something, it’s usually not the best side of me that’s coming out. And yet getting downvoted still aggravates me a lot. My creativity and enthusiasm are noticeably diminished for perhaps 24-48 hours afterwards. Getting downvoted doesn’t teach me anything beyond just “don’t engage with those people”, often with an added helping of “screw them”.
We have good enough content-filtering mechanisms nowadays that in principle, I don’t think people should be punished for posting “bad” content. It should be easy to arrange things so “good” content gets the lion’s share of the attention.
I’d argue the threat of punishment is most valuable when people can clearly predict what’s going to produce punishment, e.g. committing a crime. For getting downvoted, the punishment is arbitrary enough that it causes a big behavioral no-go zone.
The problem isn’t that people might downvote your satire. The problem is that human psychology is such that even an estimated 5% chance of your satire being downvoted is enough to deter you from posting it, since in the ancestral environment social exclusion was asymmetrically deadly relative to social acceptance. Conformity is the natural result.
Specific proposals:
-
Remove the downvote button, and when the user hits “submit” on their post or comment, an LLM reads the post or comment and checks it against a long list of site guidelines. The LLM flags potential issues to the user, and says: “You can still post this if you want, but since it violates 3 of the guidelines, it will start out with a score of −3. Alternatively, you can rewrite it and submit it to me again.” That gets you quality control without the capricious-social-exclusion aspect.
-
Have specific sections of the site, or specific times of the year, where the voting gets turned off. Or keep the voting on, but anonymize the post score and the user who posted it, so your opinion isn’t colored by the content’s current score / user reputation.
This has been a bit of a rant, but here are a couple of links to help point at what I’m trying to say:
-
https://vimeo.com/60898177 -- this Onion satire was made over a decade ago. I think it’s worth noting how absurd our internet-of-ubiquitous-feedback-mechanisms seems from the perspective of comedians from the past. (And it is in fact absurd in my view, but it can be hard to see the water you’re swimming in. Browsing an old-school forum without any feedback mechanisms makes the difference seem especially stark. The analogy that’s coming to mind is a party where everyone’s on cocaine, vs a party where everyone is sober.)
-
https://celandine13.livejournal.com/33599.html—classic post, “Errors vs. Bugs and the End of Stupidity”
-
If a post gets enough comments that low karma comments can’t get much attention, they still compete with new high-quality comments, and cut into the attention for the latter.
Seems like this could be addressed by changing the comment sorting algorithm to favor recent comments more?
Using game theory to elect a centrist in the 2024 US Presidential Election
If you think prediction markets are valuable it’s likely because you think they price things well—probably due to some kind of market efficiency… well why hasn’t that efficiency led to the creation of prediction markets...
Prediction markets generate information. Information is valuable as a public good. Failure of public good provision is not a failure of prediction markets.
I suspect the best structure long term will be something like: Use a dominant assurance contract (summary in this comment) to solve the public goods problem and generate a subsidy, then use that subsidy to sponsor a prediction market.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Sorry, I think I must have misunderstood your comment. When you wrote:
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective.
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Thanks a lot for the reply, this is valuable info.
From my perspective, unlike the OP, you seem to generally know what you are doing.
I appreciate the kind words, but I’ve made no systematic effort to acquire knowledge—everything I posted in this thread is just bits and pieces I picked up over the years.
As you can see from elsewhere in this thread, I suspect I might have given myself an internal injury about a month ago from doing deep tissue massage, likely due to being on a low dose of an anticoagulant supplement (nattokinase).
However, I do think this sort of injury is generally rare. And my health would be in far worse shape if it wasn’t for massage.
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.
If the argument hasn’t been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.
No prob.
As a general rule, I think pinching is safer than poking, because you can be more certain that you are just massaging muscle rather than artery or bone. And it seems more effective too, especially if you create slack in the muscle you’re treating. However, pinching is rather hard on your fingers and forearms, so you’re liable to give yourself RSI if you overdo it (which in theory should be treatable with massage, I guess, but you might need to get a friend to do it if you’re no longer able to massage yourself!)
Another thing massage books mention is that you’re technically supposed to always treat a muscle and its antagonist (roughly, the muscle which performs the opposite motion, I believe?) in the same session. If you don’t do this, the antagonist is liable to tense up in response to its complement being released? However, the risk here is more like “annoying, hard-to-diagnose chronic pain” as opposed to the sort of injury that could send you to the ER.
I think there is a lot of alpha in massage therapy. I’ve been doing it for years, and it’s helped with a surprising variety of problems (e.g. had migraines at one point, massaging deep in my shoulder and use of the acupressure pillow I mentioned elsewhere seemed to help a lot). It’d be cool if there were people on LW who were true experts at it, including safety expertise obviously (I don’t consider myself an expert there).
One of the best massage therapists I ever visited was a practitioner of what he called “neuromuscular therapy”. He told me about this site called somasimple.com, made it sound like LessWrong but for discussing the science of chronic pain. That was many years ago though. I think maybe he got his training from the National Association of Myofascial Trigger Point Therapists. IIRC, there are a number of groups like that which are endorsed by the Trigger Point Therapy Workbook that I linked elsewhere in this thread.
I think massage therapist could be a good career for those concerned about AI-driven automation, because some people will always be willing to pay a premium for a human therapist. I believe licensing is done on a state-by-state basis in the US. Perhaps best to check for a state which has licensing reciprocity agreements with other states, if you want some flexibility in your living situation.
Viliam, can you recommend any resources for massage safety? I’ve been doing self-massage for years, it’s saved my career from multiple chronic pain conditions. I try to read about safety when I can, but I don’t know of any good central resource, and this is actually the first time I learned about the varicose veins thing...
It’s not just a matter of the neck being sensitive, it’s also the arteries that go through the neck. You don’t want to massage an artery in that area, because you could knock some plaque off of the inside of the artery and give someone a stroke. Rule of thumb is never massage a pulse, and know where the arteries in the neck go.
For example, this book advises against deep massage in the suboccipital triangle area in the back of your head—the author claims you could give someone a stroke that way. (BTW, I would suggest you probably not do the “quick vertebral artery test” described in that book, I remember finding some stuff online about how it is inaccurate and/or dangerous.)
Similarly, you might not necessarily think of the area behind your jaw and below your ear as part of the neck, but it’s a sensitive area because there’s an artery right around there, and there is a small stalactite shaped bone you could break right off. That artery goes down the front of your neck and beneath your collarbone. Definitely read a guide for safety before attempting to treat that area—there’s one in the book I linked.
The temple is also not a place to apply heavy pressure. The book I linked recommends just letting your head rest on your fingertips if you want to massage that area. (Or you could make creative use of the acupressure mat I linked elsewhere in this thread, especially if you shave your head first.)
In general, light pressure is safer than heavy pressure, and can be more effective if you go about it right. I like to to experiment with sustained pinches and pokes (like, over 2-3 minutes even, if I find a good spot) as I very gradually move inwards in response to tiny, barely perceptible sub-millimeter release sensations in the muscle. This can work as a very slow massage stroke too. (I know I’m not describing this very well, I’m just trying to give people ideas for experiments to try.) Careful not to overdo it though, this sort of approach is very hard on the small muscles in your forearm. Actually, learning forearm massage is a great place to start, because then you have a shot at repairing RSI or other overuse injuries (including from massage!) in your forearm. [EDIT: Note, RSI is often just the tip of the iceberg, you probably have lots of upper body tension if you get RSI, and that’s quite likely the root cause.] Buying massage hand tools online is another good way to save your forearm muscles. And the book I linked has great ergonomics advice.
I would be wary of deep massage in the abdominal region. You don’t want to damage organs or tear open someone’s abdominal aorta (or even weaken the wall of their aorta). Internal bleeding can be life-threatening. Important organs like the kidneys aren’t always well protected. You could cause organ bruising or worse. EDIT: Risks of internal bleeding or bruising are especially severe if someone is on an anticoagulant like warfarin. Avoiding acupressure could also be wise in that case.
I’m currently recovering from what I believe is an internal injury I gave myself from doing a super intense deep back massage [edit: while on a low dose of an anticoagulant supplement—likely just a bruise]. Prior to that I did many years of massage with ~no issues, although I did try to follow safety tips from massage books.
If you have tense muscles in your abdomen, I think finding creative ways to lie (or even wall sit) using a mat like this is a much safer option than doing massage:
https://www.amazon.com/ProSource-Acupressure-Pillow-Relief-Relaxation/dp/B00I1QCPIK/
It costs negative time to use an acupressure mat if it helps you fall asleep ;-) I’ve tried a lot of things for sleep, and the acupressure mat has been one of my most powerful tools.
The pillow that comes with the mat is a good tool for the back of your neck, another sensitive region I’m wary of massaging. Lots of people have tension there. I sometimes notice my cognition improving after I release the muscles in the back of my neck and the back of my head. I think it’s due to increased blood flow to my brain. The release from the pillow will be most intense if you shave your head first, for full contact.
Seems like there are 2 possibilities here:
-
The majority of the leadership, engineers, etc. at OpenAI/DeepMind/Anthropic don’t agree that we’d be collectively better off if they all shut down.
-
The majority do agree, they just aren’t solving the collective action problem.
If (2) is the case, has anyone thought about using a dominant assurance contract?
The dominant assurance contract adds a simple twist to the crowdfunding contract. An entrepreneur commits to produce a valuable public good if and only if enough people donate, but if not enough donate, the entrepreneur commits not just to return the donor’s funds but to give each donor a refund bonus. To see how this solves the public good problem consider the simplest case. Suppose that there is a public good worth $100 to each of 10 people. The cost of the public good is $800. If each person paid $80, they all would be better off. Each person, however, may choose not to donate, perhaps because they think others will not donate, or perhaps because they think that they can free ride.
Now consider a dominant assurance contract. An entrepreneur agrees to produce the public good if and only if each of 10 people pay $80. If fewer than 10 people donate, the contract is said to fail and the entrepreneur agrees to give a refund bonus of $5 to each of the donors. Now imagine that potential donor A thinks that potential donor B will not donate. In that case, it makes sense for A to donate, because by doing so he will earn $5 at no cost. Thus any donor who thinks that the contract will fail has an incentive to donate. Doing so earns free money. As a result, it cannot be an equilibrium for more than one person to fail to donate. We have only one more point to consider. What if donor A thinks that every other donor will donate? In this case, A knows that if he donates he won’t get the refund bonus, since the contract will succeed. But he also knows that if he doesn’t donate he won’t get anything, but if does donate he will pay $80 and get a public good which is worth $100 to him, for a net gain of $20. Thus, A always has an incentive to donate. If others do not donate, he earns free money. If others do donate, he gets the value of the public good. Thus donating is a win-win, and the public good problem is solved.[2]
Maybe this would look something like: We offer a contract to engineers at specific major AI labs. If at least 90% of the engineers at each of those specific labs sign the contract by end of 2024, they agree to all mass quit their jobs. If not, everyone who signed the contract gets $500 at the end of 2024.
I’m guessing that coordination among the leadership has already been tried and failed. But if not, another idea is to structure the dominance assurance contract as an investment round, so it ends up being a financial boost for safety-conscious organizations that are willing to sign the contract, if not enough organizations sign.
One story for why coordination does not materialize:
-
Meta engineers self-select for being unconcerned with safety. They aren’t going to quit any time soon. If offered a dominance assurance contract, they won’t sign either early or late.
-
DeepMind engineers feel that DeepMind is more responsible than Meta. They think a DeepMind AGI is more likely to be aligned than a Meta AGI, and they feel it would be irresponsible to quit and let Meta build AGI.
-
OpenAI engineers feel that OpenAI is more responsible than Meta or DeepMind, by similar logic it’s irresponsible for them to quit.
-
Anthropic engineers feel that Anthropic is more responsible than OpenAI/DeepMind/Meta, by similar logic it’s irresponsible for them to quit.
Overall, I think people are overrating the importance of a few major AI labs due to their visibility. There are lots of researchers at NeurIPS, mostly not from the big AI labs in the OP. Feels like people are over-focused on OpenAI/DeepMind/Anthropic due to their visibility and social adjacency.
-
AutoGPT isn’t a company, it’s a little open-source project. Any companies working on agents aren’t publicizing their work so far.
They raise $12M: https://twitter.com/Auto_GPT/status/1713009267194974333
You could be right that they haven’t incorporated as a company. I wasn’t able to find information about that.
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
Here’s what I wrote previously:
...AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem—could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they’re willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]
In more detail, the way I would do it would be: I give AutoGPT a task, and it says “OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?” Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that’s mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated).
So to answer your specific questions:
-
The definition of “sufficiently clarified” is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn’t be a huge issue because we’ve got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description.
-
If the learned model for “sufficiently clarified” is bad, then sometimes AutoGPT will consider a task “sufficiently clarified” when it really isn’t (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn’t do what they wanted, and it gets added as a new training example to the dataset for the “sufficiently clarified” model. So the learned model for “sufficiently clarified” gets better over time. This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we’re looking at the equivalent of an occasional hallucination, which hurts reliability a little bit.
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
Recall your original claim: “inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.”
The thought experiment here is that we take the exact same AutoGPT code and just run it on a faster processor. So no, it’s not “exploiting its ability to think faster in order to more closely micromanage its tasks”. But it does have “greater capabilities” in the sense of doing everything faster—due to a faster processor.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously. Perhaps I could get a week’s worth of work done in an hour, instead of a day’s worth of work. Or just get a week’s worth of work done in well under an hour. But since it’s the exact same code, your original “inasmuch as AutoGPT optimizes strongly” claim would not appear to apply.
I really dislike how people use the word “optimization” because it bundles concepts together in a way that’s confusing. In this specific case, your “inasmuch as AutoGPT optimizes strongly” claim is true, but only in a very specific sense. Specifically, if AutoGPT has some model of what the user means, and it tries to identify the very maximal state of the world that corresponds to that understanding—then subsequently works to bring about that state of the world. In the broad sense of an “optimizer”, there are ways to make AutoGPT a stronger “optimizer” that don’t exacerbate this problem, such as running it on a faster processor, or giving it access to new APIs, or even (I would argue) having it micromanage its tasks more closely, as long as that doesn’t affect it’s notion of “desired states of the world” (e.g. for simplicity, no added task micromanagement when reasoning about “desired states of the world”, but it’s OK in other circumstances). [Caveat: giving access to e.g. new APIs could make AutoGPT more effective at implementing its model of user prefs, so it’s therefore a bigger footgun if that model happens to be bad. But I don’t think new APIs will worsen the user pref model.]
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-) I’m not actually seeing a disagreement here.
-
Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user’s preferences and following them. And if it’s not precisely robustly aimed at it, if we’ve miscommunicated what “faithfulness” means, then it’ll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users’ requests.
I think this argument only makes sense if it makes sense to think of the “AutoGPT clarification module” as trying to pursue this goal at all costs. If it’s just a while loop that asks clarification questions until the goal is “sufficiently clarified”, then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don’t think this is one of them.
Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week.
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
I mean, it’s trying to achieve some goal out in the world. The goal’s specification is the “metric”, and while it’s not trying to maliciously “game” it, it is trying to achieve it. The goal’s specification as it understands it, that is, not the goal as it’s intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts. If it’s e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.)
I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.
Yeah, I just don’t think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that.
But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.
They could, but people seem to assume they will, with poor justification. I agree it’s a reasonable heuristic for identifying potential problems, but it shouldn’t be the only heuristic.
I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.
One of my mental models for alignment work is “contingency planning”. There are a lot of different ways AI research could go. Some might be dangerous. Others less so. If we can forecast possible dangers in advance, we can try to steer towards safer designs, and generate contingency plans with measures to take if a particular forecast for AI development ends up being correct.
The risk here is “person with a hammer” syndrome, where people try to apply mental models from thinking about superintelligent consequentialists to other AI systems in a tortured way, smashing round pegs into square holes. I wish people would look at the territory more, and do a little bit more blue sky security thinking about unknown unknowns, instead of endlessly trying to apply the classic arguments even when they don’t really apply.
A specific research proposal would be: Develop a big taxonomy or typology of how AGI could work by identifying the cruxes researchers have, then for each entry in your typology, give it an estimated safety rating, try to identify novel considerations which apply to it, and also summarize the alignment proposals which are most promising for that particular entry.
Some other possibilities:
Pretty people self-select towards interests and occupations that reward beauty. If you’re pretty, you’re more likely to be popular in high school, which interferes with the dedication necessary to become a great programmer.
A big reason people are prettier in LA is they put significant effort into their appearance—hair, makeup, orthodontics, weight loss, etc.
Perhaps hunter/gatherer tribes had gender-based specialization of labor. If men are handling the hunting and tribe defense which requires the big muscles, there’s less need for women to pay the big-muscle metabolic cost.