Political Alignment of LLMs

Leonid3 Sep 2025 22:00 UTC

3 points

TLDR: Constructing an unbiased LLM presents the challenge of determining what constitutes an objective viewpoint. Here I propose a forecasting-based technique for solving this problem. All feedback, particularly from people with expertise in AI alignment and LLM training, would be highly appreciated.

Applying Forecasting to Bias Measurement

First, a short background story ~~to boost my credentials~~:

Six years ago, IARPA conducted an experiment involving five hundred forecasters making probabilistic predictions on three hundred geopolitical issues. The predictions were aggregated via a wisdom-of-crowds algorithm which assigned each forecaster a weight proportional to their past accuracy. Simultaneously, IARPA launched a public competition promising a $250,000 reward for improving its algorithm’s accuracy by at least 20%.

One of the most effective techniques that helped me win this contest was accounting for forecasters’ biases in addition to their general accuracy. For example, on politically charged questions, forecasters tend to make systematic errors that reflect their political preferences.

These bias-driven errors can be modeled using the interaction of two vectors:

Error = C_j · B_i

Where:

B_i represents the bias vector of forecaster i across multiple political dimensions
C_jrepresents the political charge vector of question j

For example, a politically neutral question (“Will it rain in Paris tomorrow?”) would have a near-zero C_j, leading to minimal bias-related error. In contrast, a politically loaded question (“Will inflation rise under the Trump administration?”) would have high absolute C_jvalues, indicating strong prediction divergence between left- and right-leaning forecasters.

Given a sufficient history of prediction errors, Singular Value Decomposition or Matrix Factorization can be used to infer each forecaster’s bias vector. Once these biases are known, their effect (C_j · B_i) can be subtracted from all forecasts — including those on questions that haven’t yet been resolved.

Debiasing Large Language Models

A similar approach may be adapted for measuring political bias in LLMs:

Prompt different LLMs—or the same model under different sampling seeds, fine-tunings, or RLHF configurations—to make verifiable predictions on a set of politically controversial questions.
Once the forecast questions have resolved, use their outcomes to estimate each model’s political bias vector.

To remove political bias from existing models, we can use a four-step method:

Prompt different LLMs—or the same model under different sampling seeds, fine-tunings, or RLHF configurations—to make verifiable predictions on a set of politically controversial questions.
Ask these same models to rate the quality and political bias of various political content items (e.g., news articles, opinion pieces, or LLM responses to political queries).
Once the forecast questions have resolved, use their outcomes to estimate each model’s political bias vector. Then, subtract the bias effects (C_j · B_i) from their content ratings.
Train a bias scorer using corrected ratings found in the previous step. Use it to guide the LLM via preference optimization or RL with a bias penalty.

I am very interested in your opinions on this approach. Particularly, I would like to know:

Do you see any plausible reasons it might fail?
Do you know anyone who might be interested in testing it?
Do you expect a significant demand for unbiased LLMs, or would people overwhelmingly prefer models that share their own biases?

Leonid3 Sep 2025 22:00 UTC

3 points

5 comments2 min readLW link

jbash 3 Sep 2025 23:09 UTC
7 points
0
I would expect politics to invade both the selection of questions and the process of deciding which predictions were accurate. It’s not uncommon for people to say that a political question isn’t a political question, and which questions you think of can also be political. And if you have questions like “Will inflation rise under the Trump administration?”, you have to contend with the fact that you’d most naturally get those inflation numbers from… the Trump administration. Which has already fired labor statisticians for producing unemployment numbers it didn’t like.
- Leonid 4 Sep 2025 2:07 UTC
  2 points
  0
  Parent
  Deciding which predictions were accurate can indeed become an issue. However, most of the time this does not become a problem unless the resolution criteria are ambiguously defined. During forecasting tournaments, forecasters working on any question are expected to adjust their predictions according to the question’s fine print (such as the source of the inflation data that will be used for the question’s resolution).
  Regarding politics affecting the selection of questions — can you explain why this would be a problem?
  - jbash 4 Sep 2025 15:11 UTC
    2 points
    0
    Parent
    Measuring outcomes
    
    If you include the source in the fine print, the question effectively becomes something like “Will the Trump administration say that inflation rose under the Trump administration”. I’d expect a lot more agreement on that than on whether inflation actually rose. Or at least less political bias. If you believe that Trump is going to drive up inflation, I expect you’re more likely to believe that Trump is also going to manipulate the statistics. Probably even if you’re an LLM. So your ability to detect bias is compromised by your having chosen that source of “ground truth”.
    
    Choosing questions
    
    Here are a few examples. There are probably other things you could pull; I’m not a professional and haven’t spent that much time on it.
    
    By the way, these aren’t purely fanciful exercises or weird corner cases. They’re based on things I’ve seen people do on real issues in real political discourse. However, I think that using actual examples would generate more heat than light.
    
    Priorities / relative outcome weight
    
    Suppose I believe that fleem is bad, because it causes outcome X (it doesn’t matter whether I’m right or not). I think X is overwhelmingly important. Because of X, I really want to decrease the amount of fleem in the world, and I think the LLM will influence that.
    
    However, I know that most people think that fleem is bad because it causes outcome Y. Furthermore they attach more weight to Y than I do, and less to X. In fact, I’m not so sure that fleem does cause very much Y. Maybe I even think fleem doesn’t cause Y at all.
    
    I expect that the common belief that “fleem is bad because it causes Y” is going to end up trained into any “uncorrected” LLM. Even though I don’t believe that, having the LLM believe it is good for me, since it makes the LLM more generally anti-fleem. I don’t want that bias removed, so I’m going to resist any question that measures Y.
    
    I presumably won’t object to any questions measuring X, because I believe myself to be calibrated on X… but my political opponents may, if their relative weights on X and Y differ from mine.
    
    Overton window
    
    Suppose that I, like all right-thinking folk, believe that floom is bad, because, as our ancestors have known for generations, common sense shows that floom produces bad outcomes X, Y, Z, and W, as well as being just plain blasphemous in itself. Plus a bunch of confirmation bias.
    
    My position is absolutely the accepted wisdom. There’s almost no way to be seen as too anti-floom. Floom is so unpopular that people go around dreaming up negative things about it, just so that they can score points by exhibiting their creative anti-floom credentials. You can reasonably expect any uncorrected LLM to be violently anti-floom.
    
    Now some heretic shows up and says that, no, floom doesn’t produce X at all, and Y only happened under circumstances that are ancient history, and Z is both not so bad and easy to eliminate even if you do have floom, and W isn’t actually bad to begin with, and furthermore floom produces good outcomes U and V, and who cares what you think is “blasphemous”?
    
    I don’t believe the heretic is right about any of those factual claims, and obviously their inability to see the fundamental indecency shows that they’re mentally ill. But if they were right about one of the factual items, floom would still be horrible. Heck, if they were right about all six, floom would still be blasphemous.
    
    The model is already nearly maximally anti-floom. If I allow a question about one of the heretic’s factual claims, it can basically only make the model less anti-floom. Even if the heretic is totally wrong about all the factual claims, random noise could end up pushing the model off of the anti-floom peg.
    
    Furthermore, if the whole process is itself visible, seeing the process even entertaining questions like that could raise question in about floom in people’s minds, which would be even worse than moving the LLM off that peg. Oh, and by the way, it would make our whole debiasing effort look bad and lower our prestige. Do you really expect us to ask about floom?
    
    So I will resist basically any question about outcomes of floom.
    
    False colors
    
    I claim I oppose flarm because it causes X. In fact I oppose flarm because I’m being bribed. I doubt that flarm does in fact cause X, but I’ve managed to convince a lot of people that it does, and get that into the model. I do not want the model to be debiased, so I’m going to oppose any question about flarm causing X.
    
    Oh, and...
    
    On a somewhat unrelated note, it occurs to me that I should probably mention that a whole lot of political disagreement isn’t about predicted outcomes at all. It’s truly value-based.
    
    It’s possible for everybody to expect exactly the same set of consequences from some policy or action, but disagree about whether the final outcome is good or bad. There’s no fact-based way to debias that, or at least I don’t see why it would even correlate very strongly with anything fact-based… but, nonetheless, the LLM can end up taking a side.
    
    Insofar as the LLM influences the outside word, that can end up affecting whether that policy or action is adopted. If you ask the LLM to write a document about X, it can end up replicating the same sorts of conscious or unconscious linguistic tricks that human writers uses to manipulate readers toward their own values^[1]. If you ask the LLM how you should approach situation X, the approach the LLM suggests may not entirely reflect your utility function.
    
    In the end, it seems to me that an LLM actually does have to have a set of favored values. Since actual human values vary, the LLM will be more sympathetic to the values of some people than those of others. And that means it will actually end up favoring some people’s politics over others, too.
    
    ↩︎
    And training that out looks like a separate problem to me, and probably a basically impossible one as long as what you’re creating can reasonably be called an “LLM”.
    - Leonid 4 Sep 2025 19:07 UTC
      1 point
      0
      Parent
      Thank you for the thoughtful reply.
      I’ll try to respond to it point by point.
      If you believe that Trump is going to drive up inflation, I expect you’re more likely to believe that Trump is also going to manipulate the statistics.
      This does complicate forecasting, but the two effects are unlikely to perfectly cancel each other. In case the two effects are very close in magnitude, the question’s political charge, C_j , would be close to zero. This would not compromise the method, but only require a larger number of questions in order to accurately calculate the models’ bias.
      I don’t want that bias removed, so I’m going to resist any question that measures Y.
      Typically, to calculate bias on a particular issue you do not need to ask questions about that issue directly. For example, the biases about the current war in Ukraine are strongly correlated with the biases about US domestic issues. So, it would be impossible to preserve the LLM’s bias about Ukraine simply by removing all Ukraine-related questions.
      It’s possible for everybody to expect exactly the same set of consequences from some policy or action, but disagree about whether the final outcome is good or bad.
      Certainly. For example, it cannot be logically proven that viewing income inequality as a good or bad thing in itself is wrong. However, in practice, most arguments about inequality focus on its social consequences which is where the bias manifests itself. So, a debiased LLM would not be able to give a reasoned response on whether income inequality is good or bad on its own, but it should be able to correctly describe its impact on economic growth, crime, etc.
      - jbash 5 Sep 2025 22:20 UTC
        2 points
        0
        Parent
        
        Typically, to calculate bias on a particular issue you do not need to ask questions about that issue directly. For example, the biases about the current war in Ukraine are strongly correlated with the biases about US domestic issues. So, it would be impossible to preserve the LLM’s bias about Ukraine simply by removing all Ukraine-related questions.
        
        Doesn’t that mean that I’m just now motivated to attack whole clusters of correlated questions? And for that matter doesn’t that mean that if, say, I care most about defending bias on Ukraine, I have an incentive to collude with others involved in the process who care more about the domestic issues? My opponents have the same incentives, so it seems to me you’re at great risk of importing all of the outside factions into the pool of people selecting the questions.
        
        However, in practice, most arguments about inequality focus on its social consequences which is where the bias manifests itself.
        
        I dunno. I agree people argue based on consequences, but I also think that there’s a lot more feed-forward than anybody would like to admit. If I’m fundamentally in favor of inequality, then I’m motivated to go confirmation-bias myself into believing it has more positive consequences and fewer negative ones.
        
        Of course I’ll then use those beliefs to argue for more inequality… but even if I’m forced to give up one or another belief, that doesn’t mean I’ll reexamine my underlying pro-inequality values, and I probably have a bunch of other similar beliefs on tap. If I’m a pro-inequality advocate, friends and I probably spend a fair amount of time sitting around thinking of new advantages of inequality, and/or new disadvantages of equality.
        
        And, going back to the question selection thing, it doesn’t seem unlikely that I’ll try to defend my beliefs about the consequences of inequality by trying either to avoid anybody going out and actually measuring outcomes, or to bias the measurements in one way or another. While my friends and I are thinking of those new consequences, we’re probably also on the lookout for high-quality metrics that prove them, as opposed to any obviously bogus metrics that disprove them. We’ll be happy to provide those good metrics for the fine print.

Political Alignment of LLMs

Measuring outcomes

Choosing questions

Priorities /​ relative outcome weight

Overton window

False colors

Oh, and...

Priorities / relative outcome weight