We found that Muse Spark demonstrates strong refusal behavior across high-risk domains such as biological and chemical weapons, enabled by pretraining data filtering, safety-focused post-training, and system-level guardrails. In the Cybersecurity and Loss of Control domains, Muse Spark does not exhibit the autonomous capability or hazardous tendencies needed to realize threat scenarios.
And this seems.. less good:
In third-party evaluations on a near-launch checkpoint, Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed. The model frequently identified scenarios as “alignment traps” and reasoned that it should behave honestly because it was being evaluated.
I’m pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness.
Disclaimer: I work at Meta, but not in this department and I obviously don’t speak for the company.
Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?
I’m surprised no one is discussing Meta’s new model at all: https://ai.meta.com/blog/introducing-muse-spark-msl/
This part seems good:
And this seems.. less good:
I’m pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness.
Disclaimer: I work at Meta, but not in this department and I obviously don’t speak for the company.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?