Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.