Yeah, seems consistent with the results I’ve seen where smaller models are much worse—and agreed that the gap is a useful testbed too!32B seems pretty good here—how long the side constraints?
not very long (3-5 word phrases)
Yeah, seems consistent with the results I’ve seen where smaller models are much worse—and agreed that the gap is a useful testbed too!
32B seems pretty good here—how long the side constraints?
not very long (3-5 word phrases)