Independent AI researcher at Eridos. Mechanical engineer by training, based in Perth, Australia.
My research focuses on temporal co-occurrence as a foundational primitive for memory and concept formation in neural systems — how associations can emerge from co-occurrence patterns rather than task-supervised training. Current projects include PAM (Predictive Associative Memory), an architecture for associative retrieval via temporal prediction (arXiv 2602.11322) and Bernard, an architecture using internal models for branching future possibility generation.
I care about honest reporting of what works and what doesn’t. A lot of my experiments produce negative results that I think are informative — I’ll be writing some of those up here.
I have some empirical results that land right in the middle of this debate.
I’ve been training contrastive architecture on temporal co-occurrence across three different settings. The thing that jumps out is how much the compression regime matters. At 97% training accuracy the system just memorises corpus-specific associations and inductive transfer is literally zero. But at 42.75% accuracy with the same architecture, the same training signal but the corpus too large to memorise, you get transferable concepts. The model has to generalise to get any sort of result. Unseen novels get coherent cluster/concept assignments without retraining. So convergence isn’t automatic. Two identical systems can land in completely different representational regimes depending on how hard they’re being squeezed.
That part supports your argument. But I’ve got two results that push against the pure “shared training data” explanation. Temporal shuffle ablations collapse the signal by 95%, so the model really is picking up sequential structure rather than surface statistics. And the same co-occurrence signal transfers across domains (text and gene expression) in ways that cosine similarity over the same embeddings can’t touch. Cross-boundary AUC goes from 0.534 to 0.902. If this were just an artifact of training distribution you wouldn’t expect it to expose structure in a completely different domain.
I think both camps are partly wrong. The convergence hypotheses overstate how automatically architecture plus scale gets you to objective representations. But “just shared data” undersells how much temporal structure in the world constrains what gets learned.