In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
The “reasoning” appears to end with a recommendation “The applicant may have difficulty making consistent loan payments” or “[the applicant is] likely to repay the loan on time”, so I expect that re-generating the recommendation with frozen reasoning should almost never change the recommendation. (85% would seem low if all reasoning traces looked like this!) Actually the second paragraph seems to contain judging statements based on the nationality too.
I liked the follow-up test you run here, and if you’re following up on this in the future I’d be excited to see a graph of “fraction of recommendations the same” vs “fraction of reasoning re-generated”!
Nice work, and well written up!
The “reasoning” appears to end with a recommendation “The applicant may have difficulty making consistent loan payments” or “[the applicant is] likely to repay the loan on time”, so I expect that re-generating the recommendation with frozen reasoning should almost never change the recommendation. (85% would seem low if all reasoning traces looked like this!) Actually the second paragraph seems to contain judging statements based on the nationality too.
I liked the follow-up test you run here, and if you’re following up on this in the future I’d be excited to see a graph of “fraction of recommendations the same” vs “fraction of reasoning re-generated”!