Zac Hatfield-Dodds comments on What will GPT-4 be incapable of?

Zac Hatfield-Dodds 7 Apr 2021 15:24 UTC
1 point
0
Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.

I expect GPT-4 to be substantially better at this ‘out of the box’ due to
- the usual combination of larger, better at generalising, scaling laws, etc.
- super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
- the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
Example outputs from Ghostwriter vs GPT-3:
```
$ hypothesis write gzip.compress
import gzip
from hypothesis import given, strategies as st

@given(compresslevel=st.just(9), data=st.nothing())
def test_roundtrip_compress_decompress(compresslevel, data):
    value0 = gzip.compress(data=data, compresslevel=compresslevel)
    value1 = gzip.decompress(data=value0)
    assert data == value1, (data, value1)
```
while GPT-3 tends to produce examples like (first four that I generated just now):
```
@given(st.bytes(st.uuid4()))
def test(x):
       expected = x
       result = bytes(gzip(x))
       assert bytes(result) == expected

@given(st.bytes())
def test_round_trip(xs):
    compressed, uncompressed = gzip_and_unzip(xs)
    assert is_equal(compressed, uncompressed)

@given(st.bytes("foobar"))
def test(xs):
    assert gzip(xs) == xs

@given(st.bytes())
def test(xs):
    zipped_xs = gzip(xs)
    uncompressed_xs = zlib.decompress(zipped_xs)
    assert zipped_xs == uncompressed_xs
```
So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
```
This document was written by a very smart AI programmer, 
who is skilled in testing Python, friendly, and helpful.

Let's start by testing the two properties of sorting a list: the output
is a permutation of the input, and the elements of the output are 
ordered:

@given(st.lists(st.integers()))
def test(xs):
    result = sorted(xs)
    assert set(result) == set(xs)
    assert is_in_order(result)

Our second test demonstrates a round-trip test. Given any string 
of bytes, if you use gzip to compress and then decompress, the result 
will be equal to the input:
```
Of course translating natural language into Python code is pretty cool too!
- Ericf 7 Apr 2021 20:53 UTC
  2 points
  0
  Parent
  I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
  - Zac Hatfield-Dodds 8 Apr 2021 2:44 UTC
    1 point
    0
    Parent
    It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.