one man’s modus tollens is another man’s modus ponens:
“making progress without empirical feedback loops is really hard, so we should get feedback loops where possible”
“in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard”
Yeah something in this space seems like a central crux to me.
I personally think (as a person generally in the MIRI-ish camp of “most attempts at empirical work are flawed/confused”), that it’s not crazy to look at the situation and say “okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops.”
I think there are some constraints on how the empirical work can possibly work. (I don’t think I have a short thing I could write here, I have a vague hope of writing up a longer post on “what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping”)
you gain general logical facts from empirical work, which can aide providing a blurry image of the manifold that the precise theoretical work is trying to build an exact representation of
This model is too oversimplified! Reality is more complex than this model suggests, making it less useful in practice. We should really be taking these into account. [optional: include jabs at outgroup]
This model is too complex! It takes into account a bunch of unimportant things, making it much harder to use in practice. We should use this simplified model instead. [optional: include jabs at outgroup]
Sometimes this even results in better models over time.
Quite a large proportion of my 1:1 arguments start when I express some low expectation of the other person’s argument being correct. This is almost always taken to mean that I believe that some opposing conclusion is correct. Usually I have to give up before being able to successfully communicate the distinction, let alone addressing the actual disagreement.
Schemes for taking multiple unaligned AIs and trying to build an aligned system out of the whole
I think this is just not possible.
Schemes for taking aligned but less powerful AIs and leveraging them to align a more powerful AI (possibly with amplification involved)
This breaks if there are cases where supervising is harder than generating, or if there is a discontinuity. I think it’s plausible something like this could work but I’m not super convinced.
Some aspirational personal epistemic rules for keeping discussions as truth seeking as possible (not at all novel whatsoever, I’m sure there exist 5 posts on every single one of these points that are more eloquent)
If I am arguing for a position, I must be open to the possibility that my interlocutor may turn out to be correct. (This does not mean that I should expect to be correct exactly 50% of the time, but it does mean that if I feel like I’m never wrong in discussions then that’s a warning sign: I’m either being epistemically unhealthy or I’m talking to the wrong crowd.)
If I become confident that I was previously incorrect about a belief, I should not be attached to my previous beliefs. I should not incorporate my beliefs into my identity. I should not be averse to evidence that may prove me wrong. I should always entertain the possibility that even things that feel obviously true to me may be wrong.
If I convince someone to change their mind, I should avoid say things like “I told you so”, or otherwise try to score status points out of it.
I think in practice I adhere closer to these principles than most people, but I definitely don’t think I’m perfect at it.
(Sidenote: it seems I tend to voice my disagreement on factual things far more often (though not maximally) compared to most people. I’m slightly worried that people will interpret this as me disliking them or being passive aggressive or something—this is typically not the case! I have big disagreements about the-way-the-world-is with a bunch of my closest friends and I think that’s a good thing! If anything I gravitate towards people I can have interesting disagreements with.)
I should always entertain the possibility that even things that feel obviously true to me may be wrong.
I find it a helpful framing to instead allow things that feel obviously false to become more familiar, giving them the opportunity to develop a strong enough voice to explain how they are right. That is, the action is on the side of unfamiliar false things, clarifying their meaning and justification, rather than on the side of familiar true things, refuting their correctness. It’s harder to break out of a familiar narrative from within.
No noticeable effects from vitamin D (both with and without K2), even though I used to live somewhere where the sun barely shines and also I never went outside, so I was almost certainly deficient.
I tried Selenium (200mg) twice and both times I felt like utter shit the next day.
Glycine (2g) for some odd reason makes me energetic, which makes it really bad as a sleep aid. 1g taken a few hours before bedtime is substantially less disruptive to sleep, but I haven’t noticed substantial improvements.
Unlike oral phenylephrine, intranasal phenylephrine does things, albeit very temporarily, and is undeniably the most effective thing I’ve tried, though apparently you’re not supposed to use it too often, so I only use it when it gets really bad.
If it ever becomes a point of dispute in an object level discussion what a word means, you should either use a commonly accepted definition, or taboo the term if the participants think those definitions are bad for the context of the current discussion.
(If the conversation participants are comfortable with it, the new term can occupy the same namespace as the old tabooed term (i.e going forward, we all agree that the definition of X is Y for the purposes of this conversation, and all other definitions no longer apply))
If any of the conversation participants want to switch to the separate discussion of “which definition of X is the best/most useful/etc”, this is fine if all the other participants are fine as well. However, this has to be explicitly announced as a change in topic from the original object level discussion.
Competence: An optimizer is more competent if it achieves the objective more frequently on distribution
Capabilities Robustness: An optimizer is more capabilities robust if it can handle a broader range of OOD world states (and thus possible pertubations) competently.
Generality: An optimizer is more general if it can represent and achieve a broader range of different objectives
Real-world objectives: whether the optimizer is capable of having objectives about things in the real world.
Some observations: it feels like capabilities robustness is one of the big things that makes deception dangerous, because it means that the model can figure out plans that you never intended for it to learn (something not very capabilities robust would just never learn how to deceive if you don’t show it). This feels like the critical controller/search-process difference: controller generalization across states is dependent on the generalization abilities of the model architecture, whereas search processes let you think about the particular state you find yourself in. The actions that lead to deception are extremely OOD, and a controller would have a hard time executing the strategy reliably without first having seen it, unless NN generalization is wildly better than I’m anticipating.
Real world objectives is definitely another big chunk of deception danger; caring about the real world leads to nonmyopic behavior (though maybe we’re worried about other causes of nonmyopia too? not sure tbh), I’m actually not sure how I feel about generality: on the one hand, it feels intuitive that systems that are only able to represent one objective have got to be in some sense less able to become more powerful just by thinking more; on the other hand I don’t know what a rigorous argument for this would look like. I think the intuition relates to the idea of general reasoning machinery being the same across lots of tasks, and this machinery being necessary to do better by thinking harder, and so any model without this machinery must be weaker in some sense. I think this feeds into capabilities robustness (or lack thereof) too.
Examples of where things fall on these axes:
A rock would be none of the properties.
A pure controller (i.e a thermostat, “pile of heuristics”) can be competent, but not as capabilities robust, not general at all, and have objectives over the real world.
An analytic equation solver would be perfectly competent and capablilities robust (if it always works), not very general (it can only solve equations), and not be capable of having real world objectives.
A search based process can be competent, would be more capabilities robust and general, and may have objectives over the real world.
A deceptive optimizer is competent, capabilities robust, and definitely has real world objectives
Another generator-discriminator gap: telling whether an outcome is good (outcome->R) is much easier than coming up with plans to achieve good outcomes. Telling whether a plan is good (plan->R) is much harder, because you need a world model (plan->outcome) as well, but for very difficult tasks it still seems easier than just coming up with good plans off the bat. However, it feels like the world model is the hardest part here, not just because of embeddedness problems, but in general because knowing the consequences of your actions is really really hard. So it seems like for most consequentialist optimizers, the quality of the world model actually becomes the main thing that matters.
This also suggests another dimension along which to classify our optimizers: the degree to which they care about consequences in the future (I want to say myopia but that term is already way too overloaded). This is relevant because the further in the future you care about, the more robust your world model has to be, as errors accumulate the more steps you roll the model out (or the more abstraction you do along the time axis). Very low confidence but maybe this suggests that mesaoptimizers probably won’t care about things very far in the future because building a robust world model is hard and so perform worse on the training distribution, so SGD pushes for more myopic mesaobjectives? Though note, this kind of myopia is not quite the kind we need for models to avoid caring about the real world/coordinating with itself.
A thought pattern that I’ve noticed myself and others falling into sometimes: Sometimes I will make arguments about things from first principles that look something like “I don’t see any way X can be true, it clearly follows from [premises] that X is definitely false”, even though there are people who believe X is true. When this happens, it’s almost always unproductive to continue to argue on first principles, but rather I should do one of: a) try to better understand the argument and find a more specific crux to disagree on or b) decide that this topic isn’t worth investing more time in, register it as “not sure if X is true” in my mind, and move on.
For many such questions, “is X true” is the wrong question. This is common when X isn’t a testable proposition, it’s a model or assertion of causal weight. If you can’t think of existence proofs that would confirm it, try to reframe as “under what conditions is X a useful model?”.
One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it’s at the finish line for good outcomes: if alignment doesn’t make it there first, then we automatically lose, but even if it does, if alignment doesn’t continue to improve proportional to capabilities, we might also fail at some later point. However, I think it’s plausible we’re not even on track for the necessary condition, so I’ll focus on that within this post.
Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there’s a worryingly large chance that we just won’t have the alignment progress needed at the critical juncture.
I also think it’s plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be “locked in” because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.
There are a number of areas where this model could be violated:
Capabilities could turn out to be less accelerated than alignment by AI assistance. It seems like capabilities is mostly just throwing more hardware at the problem and scaling up, whereas alignment is much more conceptually oriented.
After research is mostly/fully automated, orgs could simply allocate more auto-research time to alignment than AGI.
Alignment(/coordination to slow down) could turn out to be easy. It could turn out that applying the same amount of effort to alignment and AGI results in alignment being solved first.
However, I don’t think these violations are likely for the following reasons respective:
It’s plausible that our current reliance on scaling is a product of our theory not being good enough and that it’s already possible to build AGI with current hardware if you have the textbook from the future. Even if the strong version of the claim isn’t true, one big reason that the bitter lesson is true is that bespoke engineering is currently expensive, and if it became suddenly a lot cheaper we would see a lot more of it and consequently squeezing more out of the same hardware. It also seems likely that before total automation, there will be a number of years where automation is best modelled as a multiplicative factor on human researcher effectiveness. In that case, because of the sheer number of capabilities researchers compared to alignment researchers, alignment researchers would have to benefit a lot more to just break even.
If it were the case that orgs would pivot, I would expect them to currently be allocating a lot more to alignment than they do currently. While it’s still plausible that orgs haven’t allocated more to alignment because they think AGI is far away, and that a world where automated research is a thing is a world where orgs would suddenly realize how close AGI is and pivot, that hypothesis hasn’t been very predictive so far. Further, because I expect the tech for research automation to be developed at roughly the same time by many different orgs, it seems like not only does one org have to prioritize alignment, but actually a majority weighted by auto research capacity have to prioritize alignment. To me, this seems difficult, although more tractable than the other alignment coordination problem, because there’s less of a unilateralist problem. The unilateralist problem still exists to some extent: orgs which prioritize alignment are inherently at a disadvantage compared to orgs that don’t, because capabilities progress feeds recursively into faster progress whereas alignment progress is less effective at making future alignment progress faster. However, on the relevant timescales this may become less important.
I think alignment is a very difficult problem, and that moreover by its nature it’s incredibly easy to underestimate. I should probably write a full post about my take on this at some point, and I don’t really have space here to really dive into it here, but a quick meta level argument for why we shouldn’t lean on alignment easiness even if there is a non negligible chance of easiness is that a) given the stakes, we should exercise extreme caution and b) there are very few problems we have that are in the same reference class as alignment, and of the few that are even close, like computer security, they don’t inspire a lot of confidence.
I think exploring the potential model violations further is a fruitful direction. I don’t think I’m very confident about this model.
Is the correlation between sleeping too long and bad health actually because sleeping too long is actually causally upstream of bad health effects, or only causally downstream of some common cause like illness?
Afaik, both. Like a lot of shit things—they are caused by depression, and they cause depression, horrible reinforcing loop. While the effect of bad health on sleep is obvious, you can also see this work in reverse; e.g. temporary severe sleep restriction has an anti-depressive effect. Notable, though with not many useful clinical applications, as constant sleep deprivation is also really unhealthy.
Unsupervised learning can learn things humans can’t supervise because there’s structure in the world that you need deeper understanding to predict accurately. For example, to predict how characters in a story will behave, you have to have some kind of understanding in some sense of how those characters think, even if their thoughts are never explicitly visible.
Unfortunately, this understanding only has to be structured in a way that makes reading off the actual unsupervised targets (i.e next observation) easy.
An incentive structure for scalable trusted prediction market resolutions
We might want to make a trustable committee for resolving prediction markets. We might be worried that individual resolvers might build up reputation only to exit-scam, due to finite time horizons and non transferability of reputational capital. However, shareholders of a public company are more incentivized to preserve the value of the reputational capital. Based on this idea, we can set something up as follows:
Market creators pay a fee for the services of a resolution company
There is a pool of resolvers who give a first-pass resolution. Each resolver locks up a deposit.
If an appeal is requested, a resolution passes up through a series of committees of more and more senior resolvers
At the top, a vote is triggered among all shareholders
It’s amazing how many proposals for dealing with institutional distrust sound a lot like “make a new institution, with the same structure, but with better actors.” You lose me at “trustable committee”, especially when you don’t describe how THOSE humans are motivated by truth and beauty, rather than filthy lucre. Adding more layers of committees doesn’t help, unless you define a “final, un-appealable decision” that’s sooner than the full shareholder vote.
the core of the proposal really boils down to “public companies have less incentive to cash in on reputation and exit scam than individuals”. this proposal is explicitly not “the same structure but with better actors”.
Mathematically proven to be impossible (i.e perfect compression)
Impossible under currently known laws of physics (i.e perpetual motion machines)
A lot of people have thought very hard about it and cannot prove that it’s impossible, but strongly suspect it is impossible (i.e solving NP problems in P)
A lot of people have thought very hard about it, and have not succeeded, but we have no strong reason to expect it to be impossible (i.e AGI)
There is a strong incentive for success, and the markets are very efficient, so that for participants with no edge, success is basically impossible (i.e beating the stock market)
There is a strong incentive for a thing, but a less efficient market, and it seems nobody has done it successfully (i.e a new startup idea that seems nobody seems to be doing)
Hopefully this is a useful reference for conversations that go like this:
A: Why can’t we just do X to solve Y?
B: You don’t realize how hard Y is, you can’t just think up a solution in 5 minutes
A: You’re just not thinking outside the box, [insert anecdote about some historical figure who figured out how to do a thing which was once considered impossible in some sense]
B: No you don’t understand, it’s like actually not possible, not just like really hard, because of Z
A: That’s what they said about [historical figure]!
(random shower thoughts written with basically no editing)
Sometimes arguments have a beat that looks like “there is extreme position X, and opposing extreme position Y. what about a moderate ‘Combination’ position?” (I’ve noticed this in both my own and others’ arguments)
I think there are sometimes some problems with this.
Usually almost nobody is on the most extreme ends of the spectrum. Nearly everyone falls into the “Combination” bucket technically, so in practice you have to draw the boundary between “combination enough” vs “not combination enough to count as combination”, which is sometimes fraught. (There is a dual argument beat that looks like “people too often bucket things into distinct buckets, what about thinking of things as a spectrum.” I think this does the opposite mistake, because sometimes there really are relatively meaningful clusters to point to. (this seems quite reminiscent of one Scottpost that I can’t remember the name of rn))
In many cases, there is no easy 1d spectrum. Being a “combination” could refer to a whole set of mutually exclusive sets of views. This problem gets especially bad when the endpoints differ along many axes at once. (Another dual argument here that looks like “things are more nuanced than they seem” which has its own opposite problems)
Of the times where this is meaningful, I would guess it almost always happens when the axis one has identified is interesting and captures some interesting property of the world. That is to say, if you’ve identified some kind of quantity that seems to be very explanatory, just noting that fact actually produces lots of value, and then arguing about how or whether to bucket that quantity up into groups has sharply diminishing value.
In other words, introducing the frame that some particular latent in the world exists and is predictive is hugely valuable; when you say “and therefore my position is in between other people’s”, this is valuable due to the introduction of the frame. The actual heavy lifting happened in the frame, and the part where you point to some underexplored region of the space implied by that frame is actually not doing much work.
I hypothesize one common thing is that if you don’t draw this distinction, then it feels like the heavy lifting comes in the part where you do the pointing, and then you might want to do this within already commonly accepted frames. From the inside I think this feels like existing clusters of people being surprisingly closed minded, whereas the true reason is that the usefulness of the existing frame has been exhausted.
related take: “things are more nuanced than they seem” is valuable only as the summary of a detailed exploration of the nuance that engages heavily with object level cruxes; the heavy lifting is done by the exploration, not the summary
TL;DR: This is basically empty individualism except identity is disentangled from cooperation (accomplished via FDT), and each agent can have its own subjective views on what would count as continuity of identity and have preferences over that. I claim that:
Continuity is a property of the subjective experience of each observer-moment (OM), not necessarily of any underlying causal or temporal relation. (i.e I believe at this moment that I am experiencing continuity, but this belief is a fact of my current OM only. Being a Boltzmann brain that believes I experienced all the moments leading up to that moment feels exactly the same as “actually” experiencing things.)
Each OM may have beliefs about the existence of past OMs, and about causal/temporal relations between those past OMs and the current OM (i.e one may believe that a memory of the past did in fact result from the faithful recording of a past OM to memory, as opposed to being spawned out of thin air as a Boltzmann brain loaded with false memories.)
Something like preference utilitarianism is true and it is ok to have preferences about things you cannot observe, or prefer the world to be in one of two states that you cannot in any way distinguish. As a motivating example, one can have preferences between taking atomic actions (a) enter the experience machine and erase all memories of choosing to be in an experience machine and (b) doing nothing.
Each OM may have preferences for its subjective experience of continuity to correspond to some particular causal structure between OMs, despite this being impossible for that OM to observe or verify. This is where the subjectivity is introduced: each OM can have its own opinion on which other OMs it considers to also be “itself”), and it can have preferences over its self-OMs causally leading to itself in a particular way. This does not have to be symmetric; for instance, your past self may consider your future self to be more self like than your future self considers past self.
Continuity of self as viewed by each OM is decoupled from decision theoretic cooperation. i.e they coincide in a typical individual, who considers their past/future selves to be also themself, and cooperates decision theoretically (i.e you consider past/future you getting utility to both count as “you” getting utility). However it is also possible to cooperate to the same extent with OMs with whom you do not consider yourself to be the same self (i.e twin PD), or to not coordinate with yourself (i.e myopia/ADHD).
(related: FDT and myopia being much the same thing; you can think of caring about future selves’ rewards because you consider yourself to implement a similar enough algorithm to your future self as acausal trade. This has the nice property of unifying myopia and preventing acausal trade, in that acausal trade is really just caring about OMs that would not be considered the same “self”. This is super convenient because basically every time we talk about myopia for preventing deceptive mesaoptimization we have to hedge by saying “and also we need to prevent acausal trade somehow”, and this lets us unify the two things.)
Properties of this theory:
This theory allows one to have preferences such as “I want to have lots of subjective experiences into the future” or “I prefer to have physical continuity with my past self” despite rejecting any universal concept of identity which seems pretty useful
This theory is fully compatible with all sorts of thought experiments by simply not providing an answer as to which OM your current OM leads to “next”. This is philosophically unsatisfying but I think the theory is still useful nonetheless
Coordination is solved through decision theory, which completely disentangles it from identity.
Imagine if aliens showed up at your doorstep and tried to explain to you that making as many paperclips as possible was the ultimate source of value in the universe. They show pictures of things that count as paperclips and things that don’t count as paperclips. They show you the long rambling definition of what counts as a paperclip from Section 23(b)(iii) of the Declaration of Paperclippian Values. They show you pages and pages of philosophers waxing poetical about how paperclips are great because of their incredible aesthetic value. You would be like, “yeah I get it, you consider this thing to be a paperclip, and you care a lot about them.” You could probably pretty accurately tell whether the aliens would approve of anything you’d want to do. And then you wouldn’t really care, because you value human flourishing, not paperclips. I mean, it’s so silly to care about paperclips, right?
Of course, to the aliens, who have not so subtly indicated that they would blow up the planet and look for a new, more paperclip-loving planet if they were to detect any anti-paperclip sentiments, you say that you of course totally understand and would do anything for paperclips, and that you definitely wouldn’t protest being sent to the paperclip mines.
I think I’d be confused. Do they care about more or better paperclips, or do they care about worship of paperclips by thinking beings? Why would they care whether I say I would do anything for paperclips, when I’m not actually making paperclips (or disassembling myself to become paperclips)?
I thought it would be obvious from context but the answers are “doesn’t really matter, any of those examples work” and “because they will send everyone to the paperclip mines after ensuring there are no rebellious sentiments”, respectively. I’ve edited it to be clearer.
random thoughts. no pretense that any of this is original or useful for anyone but me or even correct
It’s ok to want the world to be better and to take actions to make that happen but unproductive to be frustrated about it or to complain that a plan which should work in a better world doesn’t work in this world. To make the world the way you want it to be, you have to first understand how it is. This sounds obvious when stated abstractly but is surprisingly hard to adhere to in practice.
It would be really nice to have some evolved version of calibration training where I take some historical events and try to predict concrete questions about what happened, and give myself immediate feedback and keep track of my accuracy and calibration. Backtesting my world model so to speak. Might be a bit difficult to measure accuracy improvments due to non iid ness of the world, but worth trying the naive thing regardless. Would be interesting to try and autogen using GPT3.
Feedback loops are important. Unfortunately, from the inside it’s very easy to forget. In particular, setting up feedback loops is often high friction, because it’s hard to measure the thing we care about. Fixing this general problem is probably hard but in the meantime I can try to setup feedback loops for important things like productivity, world modelling, decision making, etc
Lots of things have very counterintuitive or indirect values. If you don’t take this into account and you make decisions based on maximizing value you might end up macnamara-ing yourself hard.
The stages of learning something: (1) “this is super overwhelming! I don’t think I’ll ever understand it. there are so many things I need to keep track of. just trying to wrap my mind around it makes me feel slightly queasy” (2) “hmm this seems to actually make some sense, I’m starting to get the hang of this” (3) “this is so simple and obviously true, I’ve always known it to be true, I can’t believe anyone doesn’t understand this” (you start noticing that your explanations of the thing become indistinguishable from the things you originally felt overwhelmed by) (4) “this new thing [that builds on top of the thing you just learned] is super overwhelming! I don’t think I’ll ever understand it”
The feeling of regret really sucks. This is a bad thing, because it creates an incentive to never reflect on things or realize your mistakes. This shows up as a quite painful aversion to reflecting on mistakes, doing a postmortem, and improving. I would like to somehow trick my brain into reframing things somehow. Maybe thinking of it as a strict improvement over the status quo of having done things wrong? Or maybe reminding myself that the regret will be even worse if I don’t do anything because I’ll regret not reflecting in addition
Thought pattern that I’ve noticed: I seem to have two sets of epistemic states at any time: one more stable set that more accurately reflects my “actual” beliefs that changes fairly slowly, and one set of “hypothesis” beliefs that changes rapidly. Usually when I think some direction is interesting, I alternate my hypothesis beliefs between assuming key claims are true or false and trying to convince myself either way, and if I succeed then I integrate it into my actual beliefs. In practice this might look like alternating between trying to prove something is impossible and trying to exhibit an example, or taking strange premises seriously and trying to figure out its consequences. I think this is probably very confusing to people because usually when talking to people who are already familiar with alignment I’m talking about implications of my hypothesis beliefs, because that’s the frontier of what I’m thinking about, and from the outside it looks like I’m constantly changing my mind about things. Writing this up partially to have something to point people to and partially to push myself to communicate this more clearly.
I think this pattern is common among intellectuals, and I’m surprised it’s causing confusion. Are you labeling your exploratory beliefs and statements appropriately? An “epistemic status” note for posts here goes a long way, and in private conversation I often say out loud “I’m exploring here, don’t take it as what I fully believe” in conversations at work and with friends.
I think I do a poor job of labelling my statements (at least, in conversation. usually I do a bit better in post format). Something something illusion of transparency. To be honest, I didn’t even realize explicitly that I was doing this until fairly recent reflection on it.
one man’s modus tollens is another man’s modus ponens:
“making progress without empirical feedback loops is really hard, so we should get feedback loops where possible” “in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard”
Yeah something in this space seems like a central crux to me.
I personally think (as a person generally in the MIRI-ish camp of “most attempts at empirical work are flawed/confused”), that it’s not crazy to look at the situation and say “okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops.”
I think there are some constraints on how the empirical work can possibly work. (I don’t think I have a short thing I could write here, I have a vague hope of writing up a longer post on “what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping”)
you gain general logical facts from empirical work, which can aide providing a blurry image of the manifold that the precise theoretical work is trying to build an exact representation of
A common cycle:
This model is too oversimplified! Reality is more complex than this model suggests, making it less useful in practice. We should really be taking these into account. [optional: include jabs at outgroup]
This model is too complex! It takes into account a bunch of unimportant things, making it much harder to use in practice. We should use this simplified model instead. [optional: include jabs at outgroup]
Sometimes this even results in better models over time.
Corollary to Others are wrong != I am right (https://www.lesswrong.com/posts/4QemtxDFaGXyGSrGD/other-people-are-wrong-vs-i-am-right): It is far easier to convince me that I’m wrong than to convince me that you’re right.
Quite a large proportion of my 1:1 arguments start when I express some low expectation of the other person’s argument being correct. This is almost always taken to mean that I believe that some opposing conclusion is correct. Usually I have to give up before being able to successfully communicate the distinction, let alone addressing the actual disagreement.
The following things are not the same:
Schemes for taking multiple unaligned AIs and trying to build an aligned system out of the whole
I think this is just not possible.
Schemes for taking aligned but less powerful AIs and leveraging them to align a more powerful AI (possibly with amplification involved)
This breaks if there are cases where supervising is harder than generating, or if there is a discontinuity. I think it’s plausible something like this could work but I’m not super convinced.
Some aspirational personal epistemic rules for keeping discussions as truth seeking as possible (not at all novel whatsoever, I’m sure there exist 5 posts on every single one of these points that are more eloquent)
If I am arguing for a position, I must be open to the possibility that my interlocutor may turn out to be correct. (This does not mean that I should expect to be correct exactly 50% of the time, but it does mean that if I feel like I’m never wrong in discussions then that’s a warning sign: I’m either being epistemically unhealthy or I’m talking to the wrong crowd.)
If I become confident that I was previously incorrect about a belief, I should not be attached to my previous beliefs. I should not incorporate my beliefs into my identity. I should not be averse to evidence that may prove me wrong. I should always entertain the possibility that even things that feel obviously true to me may be wrong.
If I convince someone to change their mind, I should avoid say things like “I told you so”, or otherwise try to score status points out of it.
I think in practice I adhere closer to these principles than most people, but I definitely don’t think I’m perfect at it.
(Sidenote: it seems I tend to voice my disagreement on factual things far more often (though not maximally) compared to most people. I’m slightly worried that people will interpret this as me disliking them or being passive aggressive or something—this is typically not the case! I have big disagreements about the-way-the-world-is with a bunch of my closest friends and I think that’s a good thing! If anything I gravitate towards people I can have interesting disagreements with.)
I find it a helpful framing to instead allow things that feel obviously false to become more familiar, giving them the opportunity to develop a strong enough voice to explain how they are right. That is, the action is on the side of unfamiliar false things, clarifying their meaning and justification, rather than on the side of familiar true things, refuting their correctness. It’s harder to break out of a familiar narrative from within.
In the spirit of https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle , some anecdotes about things I have tried, in the hopes that I can be someone else’s “one guy on a message board. None of this is medical advice, etc.
No noticeable effects from vitamin D (both with and without K2), even though I used to live somewhere where the sun barely shines and also I never went outside, so I was almost certainly deficient.
I tried Selenium (200mg) twice and both times I felt like utter shit the next day.
Glycine (2g) for some odd reason makes me energetic, which makes it really bad as a sleep aid. 1g taken a few hours before bedtime is substantially less disruptive to sleep, but I haven’t noticed substantial improvements.
Unlike oral phenylephrine, intranasal phenylephrine does things, albeit very temporarily, and is undeniably the most effective thing I’ve tried, though apparently you’re not supposed to use it too often, so I only use it when it gets really bad.
House rules for definitional disputes:
If it ever becomes a point of dispute in an object level discussion what a word means, you should either use a commonly accepted definition, or taboo the term if the participants think those definitions are bad for the context of the current discussion. (If the conversation participants are comfortable with it, the new term can occupy the same namespace as the old tabooed term (i.e going forward, we all agree that the definition of X is Y for the purposes of this conversation, and all other definitions no longer apply))
If any of the conversation participants want to switch to the separate discussion of “which definition of X is the best/most useful/etc”, this is fine if all the other participants are fine as well. However, this has to be explicitly announced as a change in topic from the original object level discussion.
A few axes along which to classify optimizers:
Competence: An optimizer is more competent if it achieves the objective more frequently on distribution
Capabilities Robustness: An optimizer is more capabilities robust if it can handle a broader range of OOD world states (and thus possible pertubations) competently.
Generality: An optimizer is more general if it can represent and achieve a broader range of different objectives
Real-world objectives: whether the optimizer is capable of having objectives about things in the real world.
Some observations: it feels like capabilities robustness is one of the big things that makes deception dangerous, because it means that the model can figure out plans that you never intended for it to learn (something not very capabilities robust would just never learn how to deceive if you don’t show it). This feels like the critical controller/search-process difference: controller generalization across states is dependent on the generalization abilities of the model architecture, whereas search processes let you think about the particular state you find yourself in. The actions that lead to deception are extremely OOD, and a controller would have a hard time executing the strategy reliably without first having seen it, unless NN generalization is wildly better than I’m anticipating.
Real world objectives is definitely another big chunk of deception danger; caring about the real world leads to nonmyopic behavior (though maybe we’re worried about other causes of nonmyopia too? not sure tbh), I’m actually not sure how I feel about generality: on the one hand, it feels intuitive that systems that are only able to represent one objective have got to be in some sense less able to become more powerful just by thinking more; on the other hand I don’t know what a rigorous argument for this would look like. I think the intuition relates to the idea of general reasoning machinery being the same across lots of tasks, and this machinery being necessary to do better by thinking harder, and so any model without this machinery must be weaker in some sense. I think this feeds into capabilities robustness (or lack thereof) too.
Examples of where things fall on these axes:
A rock would be none of the properties.
A pure controller (i.e a thermostat, “pile of heuristics”) can be competent, but not as capabilities robust, not general at all, and have objectives over the real world.
An analytic equation solver would be perfectly competent and capablilities robust (if it always works), not very general (it can only solve equations), and not be capable of having real world objectives.
A search based process can be competent, would be more capabilities robust and general, and may have objectives over the real world.
A deceptive optimizer is competent, capabilities robust, and definitely has real world objectives
Another generator-discriminator gap: telling whether an outcome is good (outcome->R) is much easier than coming up with plans to achieve good outcomes. Telling whether a plan is good (plan->R) is much harder, because you need a world model (plan->outcome) as well, but for very difficult tasks it still seems easier than just coming up with good plans off the bat. However, it feels like the world model is the hardest part here, not just because of embeddedness problems, but in general because knowing the consequences of your actions is really really hard. So it seems like for most consequentialist optimizers, the quality of the world model actually becomes the main thing that matters.
This also suggests another dimension along which to classify our optimizers: the degree to which they care about consequences in the future (I want to say myopia but that term is already way too overloaded). This is relevant because the further in the future you care about, the more robust your world model has to be, as errors accumulate the more steps you roll the model out (or the more abstraction you do along the time axis). Very low confidence but maybe this suggests that mesaoptimizers probably won’t care about things very far in the future because building a robust world model is hard and so perform worse on the training distribution, so SGD pushes for more myopic mesaobjectives? Though note, this kind of myopia is not quite the kind we need for models to avoid caring about the real world/coordinating with itself.
A thought pattern that I’ve noticed myself and others falling into sometimes: Sometimes I will make arguments about things from first principles that look something like “I don’t see any way X can be true, it clearly follows from [premises] that X is definitely false”, even though there are people who believe X is true. When this happens, it’s almost always unproductive to continue to argue on first principles, but rather I should do one of: a) try to better understand the argument and find a more specific crux to disagree on or b) decide that this topic isn’t worth investing more time in, register it as “not sure if X is true” in my mind, and move on.
For many such questions, “is X true” is the wrong question. This is common when X isn’t a testable proposition, it’s a model or assertion of causal weight. If you can’t think of existence proofs that would confirm it, try to reframe as “under what conditions is X a useful model?”.
One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it’s at the finish line for good outcomes: if alignment doesn’t make it there first, then we automatically lose, but even if it does, if alignment doesn’t continue to improve proportional to capabilities, we might also fail at some later point. However, I think it’s plausible we’re not even on track for the necessary condition, so I’ll focus on that within this post.
Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there’s a worryingly large chance that we just won’t have the alignment progress needed at the critical juncture.
I also think it’s plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be “locked in” because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.
There are a number of areas where this model could be violated:
Capabilities could turn out to be less accelerated than alignment by AI assistance. It seems like capabilities is mostly just throwing more hardware at the problem and scaling up, whereas alignment is much more conceptually oriented.
After research is mostly/fully automated, orgs could simply allocate more auto-research time to alignment than AGI.
Alignment(/coordination to slow down) could turn out to be easy. It could turn out that applying the same amount of effort to alignment and AGI results in alignment being solved first.
However, I don’t think these violations are likely for the following reasons respective:
It’s plausible that our current reliance on scaling is a product of our theory not being good enough and that it’s already possible to build AGI with current hardware if you have the textbook from the future. Even if the strong version of the claim isn’t true, one big reason that the bitter lesson is true is that bespoke engineering is currently expensive, and if it became suddenly a lot cheaper we would see a lot more of it and consequently squeezing more out of the same hardware. It also seems likely that before total automation, there will be a number of years where automation is best modelled as a multiplicative factor on human researcher effectiveness. In that case, because of the sheer number of capabilities researchers compared to alignment researchers, alignment researchers would have to benefit a lot more to just break even.
If it were the case that orgs would pivot, I would expect them to currently be allocating a lot more to alignment than they do currently. While it’s still plausible that orgs haven’t allocated more to alignment because they think AGI is far away, and that a world where automated research is a thing is a world where orgs would suddenly realize how close AGI is and pivot, that hypothesis hasn’t been very predictive so far. Further, because I expect the tech for research automation to be developed at roughly the same time by many different orgs, it seems like not only does one org have to prioritize alignment, but actually a majority weighted by auto research capacity have to prioritize alignment. To me, this seems difficult, although more tractable than the other alignment coordination problem, because there’s less of a unilateralist problem. The unilateralist problem still exists to some extent: orgs which prioritize alignment are inherently at a disadvantage compared to orgs that don’t, because capabilities progress feeds recursively into faster progress whereas alignment progress is less effective at making future alignment progress faster. However, on the relevant timescales this may become less important.
I think alignment is a very difficult problem, and that moreover by its nature it’s incredibly easy to underestimate. I should probably write a full post about my take on this at some point, and I don’t really have space here to really dive into it here, but a quick meta level argument for why we shouldn’t lean on alignment easiness even if there is a non negligible chance of easiness is that a) given the stakes, we should exercise extreme caution and b) there are very few problems we have that are in the same reference class as alignment, and of the few that are even close, like computer security, they don’t inspire a lot of confidence.
I think exploring the potential model violations further is a fruitful direction. I don’t think I’m very confident about this model.
Is the correlation between sleeping too long and bad health actually because sleeping too long is actually causally upstream of bad health effects, or only causally downstream of some common cause like illness?
Afaik, both. Like a lot of shit things—they are caused by depression, and they cause depression, horrible reinforcing loop. While the effect of bad health on sleep is obvious, you can also see this work in reverse; e.g. temporary severe sleep restriction has an anti-depressive effect. Notable, though with not many useful clinical applications, as constant sleep deprivation is also really unhealthy.
GPT-2-xl unembedding matrix looks pretty close to full rank (plot is singular values)
Unsupervised learning can learn things humans can’t supervise because there’s structure in the world that you need deeper understanding to predict accurately. For example, to predict how characters in a story will behave, you have to have some kind of understanding in some sense of how those characters think, even if their thoughts are never explicitly visible.
Unfortunately, this understanding only has to be structured in a way that makes reading off the actual unsupervised targets (i.e next observation) easy.
An incentive structure for scalable trusted prediction market resolutions
We might want to make a trustable committee for resolving prediction markets. We might be worried that individual resolvers might build up reputation only to exit-scam, due to finite time horizons and non transferability of reputational capital. However, shareholders of a public company are more incentivized to preserve the value of the reputational capital. Based on this idea, we can set something up as follows:
Market creators pay a fee for the services of a resolution company
There is a pool of resolvers who give a first-pass resolution. Each resolver locks up a deposit.
If an appeal is requested, a resolution passes up through a series of committees of more and more senior resolvers
At the top, a vote is triggered among all shareholders
It’s amazing how many proposals for dealing with institutional distrust sound a lot like “make a new institution, with the same structure, but with better actors.” You lose me at “trustable committee”, especially when you don’t describe how THOSE humans are motivated by truth and beauty, rather than filthy lucre. Adding more layers of committees doesn’t help, unless you define a “final, un-appealable decision” that’s sooner than the full shareholder vote.
the core of the proposal really boils down to “public companies have less incentive to cash in on reputation and exit scam than individuals”. this proposal is explicitly not “the same structure but with better actors”.
Levels of difficulty:
Mathematically proven to be impossible (i.e perfect compression)
Impossible under currently known laws of physics (i.e perpetual motion machines)
A lot of people have thought very hard about it and cannot prove that it’s impossible, but strongly suspect it is impossible (i.e solving NP problems in P)
A lot of people have thought very hard about it, and have not succeeded, but we have no strong reason to expect it to be impossible (i.e AGI)
There is a strong incentive for success, and the markets are very efficient, so that for participants with no edge, success is basically impossible (i.e beating the stock market)
There is a strong incentive for a thing, but a less efficient market, and it seems nobody has done it successfully (i.e a new startup idea that seems nobody seems to be doing)
Hopefully this is a useful reference for conversations that go like this:
A: Why can’t we just do X to solve Y? B: You don’t realize how hard Y is, you can’t just think up a solution in 5 minutes A: You’re just not thinking outside the box, [insert anecdote about some historical figure who figured out how to do a thing which was once considered impossible in some sense] B: No you don’t understand, it’s like actually not possible, not just like really hard, because of Z A: That’s what they said about [historical figure]!
(random shower thoughts written with basically no editing)
Sometimes arguments have a beat that looks like “there is extreme position X, and opposing extreme position Y. what about a moderate ‘Combination’ position?” (I’ve noticed this in both my own and others’ arguments)
I think there are sometimes some problems with this.
Usually almost nobody is on the most extreme ends of the spectrum. Nearly everyone falls into the “Combination” bucket technically, so in practice you have to draw the boundary between “combination enough” vs “not combination enough to count as combination”, which is sometimes fraught. (There is a dual argument beat that looks like “people too often bucket things into distinct buckets, what about thinking of things as a spectrum.” I think this does the opposite mistake, because sometimes there really are relatively meaningful clusters to point to. (this seems quite reminiscent of one Scottpost that I can’t remember the name of rn))
In many cases, there is no easy 1d spectrum. Being a “combination” could refer to a whole set of mutually exclusive sets of views. This problem gets especially bad when the endpoints differ along many axes at once. (Another dual argument here that looks like “things are more nuanced than they seem” which has its own opposite problems)
Of the times where this is meaningful, I would guess it almost always happens when the axis one has identified is interesting and captures some interesting property of the world. That is to say, if you’ve identified some kind of quantity that seems to be very explanatory, just noting that fact actually produces lots of value, and then arguing about how or whether to bucket that quantity up into groups has sharply diminishing value.
In other words, introducing the frame that some particular latent in the world exists and is predictive is hugely valuable; when you say “and therefore my position is in between other people’s”, this is valuable due to the introduction of the frame. The actual heavy lifting happened in the frame, and the part where you point to some underexplored region of the space implied by that frame is actually not doing much work.
I hypothesize one common thing is that if you don’t draw this distinction, then it feels like the heavy lifting comes in the part where you do the pointing, and then you might want to do this within already commonly accepted frames. From the inside I think this feels like existing clusters of people being surprisingly closed minded, whereas the true reason is that the usefulness of the existing frame has been exhausted.
related take: “things are more nuanced than they seem” is valuable only as the summary of a detailed exploration of the nuance that engages heavily with object level cruxes; the heavy lifting is done by the exploration, not the summary
Subjective Individualism
TL;DR: This is basically empty individualism except identity is disentangled from cooperation (accomplished via FDT), and each agent can have its own subjective views on what would count as continuity of identity and have preferences over that. I claim that:
Continuity is a property of the subjective experience of each observer-moment (OM), not necessarily of any underlying causal or temporal relation. (i.e I believe at this moment that I am experiencing continuity, but this belief is a fact of my current OM only. Being a Boltzmann brain that believes I experienced all the moments leading up to that moment feels exactly the same as “actually” experiencing things.)
Each OM may have beliefs about the existence of past OMs, and about causal/temporal relations between those past OMs and the current OM (i.e one may believe that a memory of the past did in fact result from the faithful recording of a past OM to memory, as opposed to being spawned out of thin air as a Boltzmann brain loaded with false memories.)
Something like preference utilitarianism is true and it is ok to have preferences about things you cannot observe, or prefer the world to be in one of two states that you cannot in any way distinguish. As a motivating example, one can have preferences between taking atomic actions (a) enter the experience machine and erase all memories of choosing to be in an experience machine and (b) doing nothing.
Each OM may have preferences for its subjective experience of continuity to correspond to some particular causal structure between OMs, despite this being impossible for that OM to observe or verify. This is where the subjectivity is introduced: each OM can have its own opinion on which other OMs it considers to also be “itself”), and it can have preferences over its self-OMs causally leading to itself in a particular way. This does not have to be symmetric; for instance, your past self may consider your future self to be more self like than your future self considers past self.
Continuity of self as viewed by each OM is decoupled from decision theoretic cooperation. i.e they coincide in a typical individual, who considers their past/future selves to be also themself, and cooperates decision theoretically (i.e you consider past/future you getting utility to both count as “you” getting utility). However it is also possible to cooperate to the same extent with OMs with whom you do not consider yourself to be the same self (i.e twin PD), or to not coordinate with yourself (i.e myopia/ADHD).
(related: FDT and myopia being much the same thing; you can think of caring about future selves’ rewards because you consider yourself to implement a similar enough algorithm to your future self as acausal trade. This has the nice property of unifying myopia and preventing acausal trade, in that acausal trade is really just caring about OMs that would not be considered the same “self”. This is super convenient because basically every time we talk about myopia for preventing deceptive mesaoptimization we have to hedge by saying “and also we need to prevent acausal trade somehow”, and this lets us unify the two things.)
Properties of this theory:
This theory allows one to have preferences such as “I want to have lots of subjective experiences into the future” or “I prefer to have physical continuity with my past self” despite rejecting any universal concept of identity which seems pretty useful
This theory is fully compatible with all sorts of thought experiments by simply not providing an answer as to which OM your current OM leads to “next”. This is philosophically unsatisfying but I think the theory is still useful nonetheless
Coordination is solved through decision theory, which completely disentangles it from identity.
Imagine if aliens showed up at your doorstep and tried to explain to you that making as many paperclips as possible was the ultimate source of value in the universe. They show pictures of things that count as paperclips and things that don’t count as paperclips. They show you the long rambling definition of what counts as a paperclip from Section 23(b)(iii) of the Declaration of Paperclippian Values. They show you pages and pages of philosophers waxing poetical about how paperclips are great because of their incredible aesthetic value. You would be like, “yeah I get it, you consider this thing to be a paperclip, and you care a lot about them.” You could probably pretty accurately tell whether the aliens would approve of anything you’d want to do. And then you wouldn’t really care, because you value human flourishing, not paperclips. I mean, it’s so silly to care about paperclips, right?
Of course, to the aliens, who have not so subtly indicated that they would blow up the planet and look for a new, more paperclip-loving planet if they were to detect any anti-paperclip sentiments, you say that you of course totally understand and would do anything for paperclips, and that you definitely wouldn’t protest being sent to the paperclip mines.
I think I’d be confused. Do they care about more or better paperclips, or do they care about worship of paperclips by thinking beings? Why would they care whether I say I would do anything for paperclips, when I’m not actually making paperclips (or disassembling myself to become paperclips)?
I thought it would be obvious from context but the answers are “doesn’t really matter, any of those examples work” and “because they will send everyone to the paperclip mines after ensuring there are no rebellious sentiments”, respectively. I’ve edited it to be clearer.
random thoughts. no pretense that any of this is original or useful for anyone but me or even correct
It’s ok to want the world to be better and to take actions to make that happen but unproductive to be frustrated about it or to complain that a plan which should work in a better world doesn’t work in this world. To make the world the way you want it to be, you have to first understand how it is. This sounds obvious when stated abstractly but is surprisingly hard to adhere to in practice.
It would be really nice to have some evolved version of calibration training where I take some historical events and try to predict concrete questions about what happened, and give myself immediate feedback and keep track of my accuracy and calibration. Backtesting my world model so to speak. Might be a bit difficult to measure accuracy improvments due to non iid ness of the world, but worth trying the naive thing regardless. Would be interesting to try and autogen using GPT3.
Feedback loops are important. Unfortunately, from the inside it’s very easy to forget. In particular, setting up feedback loops is often high friction, because it’s hard to measure the thing we care about. Fixing this general problem is probably hard but in the meantime I can try to setup feedback loops for important things like productivity, world modelling, decision making, etc
self self improvement improvement: feeling guilty about not self improving enough and trying to fix your own ability to fix your own abilities
Lots of things have very counterintuitive or indirect values. If you don’t take this into account and you make decisions based on maximizing value you might end up macnamara-ing yourself hard.
The stages of learning something: (1) “this is super overwhelming! I don’t think I’ll ever understand it. there are so many things I need to keep track of. just trying to wrap my mind around it makes me feel slightly queasy” (2) “hmm this seems to actually make some sense, I’m starting to get the hang of this” (3) “this is so simple and obviously true, I’ve always known it to be true, I can’t believe anyone doesn’t understand this” (you start noticing that your explanations of the thing become indistinguishable from the things you originally felt overwhelmed by) (4) “this new thing [that builds on top of the thing you just learned] is super overwhelming! I don’t think I’ll ever understand it”
The feeling of regret really sucks. This is a bad thing, because it creates an incentive to never reflect on things or realize your mistakes. This shows up as a quite painful aversion to reflecting on mistakes, doing a postmortem, and improving. I would like to somehow trick my brain into reframing things somehow. Maybe thinking of it as a strict improvement over the status quo of having done things wrong? Or maybe reminding myself that the regret will be even worse if I don’t do anything because I’ll regret not reflecting in addition
Thought pattern that I’ve noticed: I seem to have two sets of epistemic states at any time: one more stable set that more accurately reflects my “actual” beliefs that changes fairly slowly, and one set of “hypothesis” beliefs that changes rapidly. Usually when I think some direction is interesting, I alternate my hypothesis beliefs between assuming key claims are true or false and trying to convince myself either way, and if I succeed then I integrate it into my actual beliefs. In practice this might look like alternating between trying to prove something is impossible and trying to exhibit an example, or taking strange premises seriously and trying to figure out its consequences. I think this is probably very confusing to people because usually when talking to people who are already familiar with alignment I’m talking about implications of my hypothesis beliefs, because that’s the frontier of what I’m thinking about, and from the outside it looks like I’m constantly changing my mind about things. Writing this up partially to have something to point people to and partially to push myself to communicate this more clearly.
I think this pattern is common among intellectuals, and I’m surprised it’s causing confusion. Are you labeling your exploratory beliefs and statements appropriately? An “epistemic status” note for posts here goes a long way, and in private conversation I often say out loud “I’m exploring here, don’t take it as what I fully believe” in conversations at work and with friends.
I think I do a poor job of labelling my statements (at least, in conversation. usually I do a bit better in post format). Something something illusion of transparency. To be honest, I didn’t even realize explicitly that I was doing this until fairly recent reflection on it.