However, GPT-4o gave totally off results , such as “the faces and bodies of various birds, the face of a rabbit, and the body of a dog
Trying the same image, and prompt with Claude 3.5 seems to work. Here’s the response :
Important concepts:
Tree branches and foliage, particularly bright yellow-lit sections
Ground/grass in several upper images
Some small patches of sky
It’d also be interesting to see the same applied to the audio encoder of CLAP. Really curious to know what your thoughts are about mech interp efforts in the audio space. It seems to be largely ignored.
P.S : Thank you for the excellent post.