What sort of epistemic infrastructure do you think is importantly missing for the alignment research community?
edoarad
Introduction to forecasting worksheet
Emotions and Effective Altruism
I remember reading Nate Soares’ Replacing Guilt Series and identifying strongly with the feeling of Cold Resolve described there. I since tried a bit to give it some other words and describe it using familiar-er emotions, but nothing really good.
I think that Liget , an emotion found in an isolated tribe at the philippines, might describe a similar emotion (except the head-throwing part). I’m not sure that I can explain that better than the linked article.
Measuring Epistemic Rationality
A taxonomy of objections to AI Risk from the paper:
What’s your take on Elicit?
One project which implements something like this is ‘Circles’. I remember it was on hold several years ago but seems to be running now—link
When posting a link post, instead of a text post, it is not clear what would be the result. There is still an option to write text, which appears strictly as text right after submitting, but when the post is viewed (from the search bar) only some portion of the text is visible and there is no indication that this is a link post.
It would be much more comfortable if editing of a post could be done only using the keyboard. For example, when adding a link, apart from defining a keyboard shortcut, it should also be possible to press enter to submit the link. I also think it would be more comfortable to add HTML support.
Also, do you have MathJax? or something similar to write math?
That’s right, and a poor framing on my part 😊
I am interested in a consensus among academic economists, or in economic arguments for rent control. Specifically because I’m mostly interested in utilitarian reasoning, but I’d also be curious about what other disciplines have to say.
Some thoughts:
1. More people would probably rank tags if it could be done directly through the tag icon instead of using the pop-up window.
2. When searching for new tags, I’d like them sorted probably by relevance (say, some preference for: being a prefix, being a popular tag, alphabetical ordering).
3. When browsing through all posts tagged with some tag, I’d maybe prefer to see higher karma posts first, or to have it factored in the ordering.
4. Perhaps it might be easier to have a hierarchy of tags—so that voting for Value learning also votes for AI Alignment say
And kudos for the neat explanation and an interesting theoretical framework :)
I may be mistaken. I tried reversing your argument, and I bold the part that doesn’t feel right.
Optimistic errors are no big deal. The agent will randomly seek behaviours that get rewarded, but as long as these behaviours are reasonably rare (and are not that bad) then that’s not too costly.
But pessimistic errors are catastrophic. The agent will systematically make sure not to fall into behaviors that avoid high punishment, and will use loopholes to avoid penalties even if that results in the loss of something really good. So even if these errors are extremely rare initially, they can totally mess up my agent.
So I think that maybe there is inherently an asymmetry between reward and punishment when dealing with maximizers.
But my intuition comes from somewhere else. If the difference between pessimism and optimism is given by a shift by a constant then it ought not matter for a utility maximizer. But your definition goes at errors conditional on the actual outcome, which should perhaps behave differently.
[Question] Is there an academic consensus around Rent Control?
The (meta-)field of Digital Humanities is fairly new. TODO: Estimating its success and its challenges would help me form a stronger opinion on this matter.
I think that the debate around the incentives to make aligned systems is very interesting, and I’m curious if Buck and Rohin formalize a bet around it afterwards.
I feel like Rohin point of view compared to Buck is that people and companies are in general more responsible, in that they are willing to pay extra costs to ensure safety—not necessarily out of a solution to a race-to-the-bottom situation. Is there another source of disagreement, conditional on convergence on the above?
Pessimistic errors are no big deal. The agent will randomly avoid behaviors that get penalized, but as long as those behaviors are reasonably rare (and aren’t the only way to get a good outcome) then that’s not too costly.
But optimistic errors are catastrophic. The agent will systematically seek out the behaviors that receive the high reward, and will use loopholes to avoid penalties when something actually bad happens. So even if these errors are extremely rare initially, they can totally mess up my agent.I’d love to see someone analyze this thoroughly (or I’ll do it if there will be an interest). I don’t think it’s that simple, and it seems like this is the main analytical argument.
For example, if the world is symmetric in the appropriate sense in terms of what actions get you rewarded or penalized, and you maximize expected utility instead of satisficing in some way, then the argument is wrong. I’m sure there is good literature on how to model evolution as a player, and the modeling of the environment shouldn’t be difficult.
I find the classification of the elements of robust agency to be helpful, thanks for the write up and the recent edit.
I have some issues with Coherence and Consistency:
First, I’m not sure what you mean by that so I’ll take my best guess which in its idealized form is something like: Coherence is being free of self contradictions and Consistency is having the tool to commit oneself to future actions. This is going by the last paragraph of that section-
There are benefits to reliably being able to make trades with your future-self, and with other agents. This is easier if your preferences aren’t contradictory, and easier if your preferences are either consistent over time, or at least predictable over time.
Second, the only case for Coherence is that reasons that coherence helps you make trade with your future self. My reasons for it are more strongly related to avoiding compartmentalization and solving confusions, and making clever choices in real time given my limited rationality.
Similarly, I do not view trades with future self as the most important reason for Consistency. It seems that the main motivator here for me is some sort of trade between various parts of me. Or more accurately, hacking away at my motivation schemes and conscious focus, so that some parts of me will have more votes than others.
Third, there are other mechanisms for Consistency. Accountability is a major one. Also, reducing noise in the environment externally and building actual external constraints can be helpful.
Forth, Coherence can be generalized to a skill that allows you to use your gear lever understanding of yourself and your agency to update your gears to what would be the most useful. This makes me wonder if the scope here is too large, and that gears level understanding and deliberate agency aren’t related to the main points as much. These may all help one to be trustworthy, in that one’s reasoning can judged to be adequate—including for oneself—which is the main thing I’m taking out from here.
Fifth (sorta), I have reread the last section, and I think that I understand now that your main motivation for Coherence and Consistency is that the conversation between rationalists can be made much more effective in that they can more easily understand each other’s point of view. This I view related to Game Theoretic Soundness, more than the internal benefits of Coherence and Consistency which are probably more meaningful overall.
Non-Bayesian Utilitarian that are ambiguity averse sometimes need to sacrifice “expected utility” to gain more certainty (in quotes because that need not be well defined).
after posting, I have tried to change a link post to a text post. It seemed to be possible when editing the original post, but I have discovered later that the changes were not kept and that the post is still in the link format.
What are the best examples of progress in AI Safety research that we think have actually reduced x-risk?
(Instead of operationalizing this explicitly, I’ll note that the motivation is to understand whether doing more work toward technical AI Safety research is directly beneficial as opposed to mostly irrelevant or having second-order effects. )