Michaël Trazzi comments on What will GPT-4 be incapable of?

Michaël Trazzi 7 Apr 2021 14:39 UTC
1 point
0
I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from “normal” input).

if we take the simplest testing case, let’s say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/random strings, etc.) then you could show it examples on how to generate tests from function names… but then you could also just do it with reg-ex, so I guess with hypothesis.

so maybe the right question to ask is: what do you expect GPT-4 to do better than GPT-3 relative to the train distribution (which will have maybe 1-2y of more github data) + context window? What’s the bottleneck? When would you say “I’m pretty sure there’s enough capacity to do it”? What are the few-shot examples you feed your model?
- Zac Hatfield-Dodds 7 Apr 2021 15:24 UTC
  1 point
  0
  Parent
  Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
  
  I expect GPT-4 to be substantially better at this ‘out of the box’ due to
  - the usual combination of larger, better at generalising, scaling laws, etc.
  - super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
  - the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
  Example outputs from Ghostwriter vs GPT-3:
```
$ hypothesis write gzip.compress
import gzip
from hypothesis import given, strategies as st

@given(compresslevel=st.just(9), data=st.nothing())
def test_roundtrip_compress_decompress(compresslevel, data):
    value0 = gzip.compress(data=data, compresslevel=compresslevel)
    value1 = gzip.decompress(data=value0)
    assert data == value1, (data, value1)
```
  while GPT-3 tends to produce examples like (first four that I generated just now):
```
@given(st.bytes(st.uuid4()))
def test(x):
       expected = x
       result = bytes(gzip(x))
       assert bytes(result) == expected

@given(st.bytes())
def test_round_trip(xs):
    compressed, uncompressed = gzip_and_unzip(xs)
    assert is_equal(compressed, uncompressed)

@given(st.bytes("foobar"))
def test(xs):
    assert gzip(xs) == xs

@given(st.bytes())
def test(xs):
    zipped_xs = gzip(xs)
    uncompressed_xs = zlib.decompress(zipped_xs)
    assert zipped_xs == uncompressed_xs
```
  So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
```
This document was written by a very smart AI programmer, 
who is skilled in testing Python, friendly, and helpful.

Let's start by testing the two properties of sorting a list: the output
is a permutation of the input, and the elements of the output are 
ordered:

@given(st.lists(st.integers()))
def test(xs):
    result = sorted(xs)
    assert set(result) == set(xs)
    assert is_in_order(result)

Our second test demonstrates a round-trip test. Given any string 
of bytes, if you use gzip to compress and then decompress, the result 
will be equal to the input:
```
  Of course translating natural language into Python code is pretty cool too!
  - Ericf 7 Apr 2021 20:53 UTC
    2 points
    0
    Parent
    I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
    - Zac Hatfield-Dodds 8 Apr 2021 2:44 UTC
      1 point
      0
      Parent
      It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.