Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
I expect GPT-4 to be substantially better at this ‘out of the box’ due to
the usual combination of larger, better at generalising, scaling laws, etc.
super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
Example outputs from Ghostwriter vs GPT-3:
$ hypothesis write gzip.compress
import gzip
from hypothesis import given, strategies as st
@given(compresslevel=st.just(9), data=st.nothing())
def test_roundtrip_compress_decompress(compresslevel, data):
value0 = gzip.compress(data=data, compresslevel=compresslevel)
value1 = gzip.decompress(data=value0)
assert data == value1, (data, value1)
while GPT-3 tends to produce examples like (first four that I generated just now):
So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
This document was written by a very smart AI programmer,
who is skilled in testing Python, friendly, and helpful.
Let's start by testing the two properties of sorting a list: the output
is a permutation of the input, and the elements of the output are
ordered:
@given(st.lists(st.integers()))
def test(xs):
result = sorted(xs)
assert set(result) == set(xs)
assert is_in_order(result)
Our second test demonstrates a round-trip test. Given any string
of bytes, if you use gzip to compress and then decompress, the result
will be equal to the input:
Of course translating natural language into Python code is pretty cool too!
I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.
Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
I expect GPT-4 to be substantially better at this ‘out of the box’ due to
the usual combination of larger, better at generalising, scaling laws, etc.
super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
Example outputs from Ghostwriter vs GPT-3:
while GPT-3 tends to produce examples like (first four that I generated just now):
So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
Of course translating natural language into Python code is pretty cool too!
I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.