sanxiyn

Karma: 924

sanxiyn 29 Jul 2025 0:08 UTC
9 points
4
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.

sanxiyn 22 Jul 2025 3:35 UTC
7 points
0
on: GDM also claims IMO gold medal
I think it is important to note that Gemini 2.5 Pro Capable of Winning Gold at IMO 2025, with good enough scaffolding and prompt engineering.

sanxiyn 16 Jul 2025 0:56 UTC
3 points
0
in reply to: Thane Ruthenis’s comment on: Grok 4 Various Things
Do you have any Solomonoff inductor you know? I don’t, and I would like an introduction.

sanxiyn 13 Jul 2025 5:35 UTC
1 point
0
on: OpenAI Model Differentiation 101
Ethan Mollick’s Using AI Right Now: A Quick Guide from 2025-06 is in the same genre and pretty much says the same thing, but the presentation is a bit different and it may suit you better, so check it out. Naturally it doesn’t discuss Grok 4, but it also does discuss some things missing here.

sanxiyn 29 Jun 2025 1:27 UTC
5 points
0
in reply to: Zac Hatfield-Dodds’s comment on: The next wave of model improvements will be due to data quality
Anthropic does have a data program, although it is only for Claude Code, and it is opt in. See About the Development Partner Program. It gives you 30% discount in exchange.

sanxiyn 18 Jun 2025 3:43 UTC
2 points
0
in reply to: Vladimir_Nesov’s comment on: Serving LLM on Huawei CloudMatrix
CloudMatrix was not, but Huawei Ascend has been there for a long time, and was used to train LLM even back in 2022. I didn’t realize AI 2027 predated CloudMatrix but I still think ignoring China for Compute Production was unjustified.

sanxiyn 17 Jun 2025 23:29 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Serving LLM on Huawei CloudMatrix
This is a good argument and I think it is mostly true, but this absolutely should be in AI 2027 Compute Forecast page. Simply not saying a word about the topic makes it looks unserious and incompetent. In fact, that reaction happened repeatedly in my discussion with my friends in South Korea.

sanxiyn 11 Jun 2025 4:53 UTC
14 points
0
on: AI companies’ eval reports mostly don’t support their claims
I know cyber eval results are underelicitation. Sonnet 4 can find zero day vulnerabilities, we are now in process of disclosing. If you can’t get it to do that it’s your skill issue.

sanxiyn 20 May 2025 1:04 UTC
5 points
0
on: Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
Preordered ebook version on Amazon. I am also interested in doing Korean translation.

sanxiyn 15 May 2025 4:43 UTC
5 points
4
in reply to: Thane Ruthenis’s comment on: Fighting Obvious Nonsense About AI Diffusion
I disagree on DeekSeek and innovation. Yes R1 is obviously a reaction to o1, but its MoE model is pretty innovative, and it is Llama 4 that obviously copied DeepSeek. But yes I agree innovation is unpopular in China. But from interviews of DeepSeek founder Liang Wenfeng, we know DeepSeek was explicitly an attempt to overcome China’s unwillingness to innovate.

sanxiyn 27 Mar 2025 5:30 UTC
2 points
1
in reply to: solhando’s comment on: Recent AI model progress feels mostly like bullshit
Maybe we are talking about different problems, but we found instructing models to give up (literally “give up”, I just checked the source) under certain conditions to be effective.

sanxiyn 27 Mar 2025 5:23 UTC
14 points
0
in reply to: ryan_greenblatt’s comment on: Recent AI model progress feels mostly like bullshit
Our experience so far is while reasoning models don’t improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case.

sanxiyn 25 Mar 2025 22:49 UTC
23 points
3
on: Recent AI model progress feels mostly like bullshit
I happen to work on the exact sample problem (application security pentesting) and I confirm I observe the same. Sonnet 3.5/3.6/3.7 were big releases, others didn’t help, etc. As for OpenAI o-series models, we are debating whether it is model capability problem or model elicitation problem, because from interactive usage it seems clear it needs different prompting and we haven’t yet seriously optimized prompting for o-series. Evaluation is scarce, but we built something along the line of CWE-Bench-Java discussed in this paper, this was a major effort and we are reasonably sure we can evaluate. As for grounding, fighting false positives, and avoiding models to report “potential” problems to sound good, we found grounding on code coverage to be effective. Run JaCoCo, tell models PoC || GTFO, where PoC is structured as vulnerability description with source code file and line and triggering input. Write the oracle verifier of this PoC: at the very least you can confirm execution reaches the line in a way models can’t ever fake.

sanxiyn 19 Feb 2025 11:47 UTC
4 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
OpenAI wasted a whole year between GPT-3 and GPT-4. (Source: Greg Brockman said this in an OpenAI developer event.) So yes, I think OpenAI was 12+ months ahead at one time.

sanxiyn 7 Feb 2025 12:51 UTC
−2 points
0
on: Detecting Strategic Deception Using Linear Probes
This comment is probably not very useful, but my first thought was: “we invented a polygraph for AI!”.

sanxiyn 14 Nov 2024 4:36 UTC
3 points
0
in reply to: David Johnston’s comment on: o1 is a bad idea
When I imagine models inventing a language my imagination is something like Shinichi Mochizuki’s Inter-universal Teichmüller theory invented for his supposed proof of abc conjecture. It is clearly something like mathematical English and you could say it is “quite intelligible” compared to “neuralese”, but at the end, it is not very intelligible.