example: optimising code for minimum compute usage, memory usage, wallclock time, latency, big-O complexity, program length, etc
making progress involves trying a bag of heuristics known to human experts (and can be therefore found in some dataset)
example: optimising code for program length (codegolf) or big-O complexity (competitive programming)
Then it seems to me almost guaranteed that AI will be 99.9 percentile if not 100 percentile when compared against human experts.
100 percentile is obtained when, even though the human expert could have found the solution by repeatedly applying the bag of tricks, no human expert actually bothered to do this for cost and time reasons.
For future progress
Most obvious avenue is relaxing constraint 2. Even if you can’t machine-score partial solutions, you can ask the LLM to guess if it is making partial progress or not, and use that.
Relaxing constraint 1 and 2 will also probably work if the solutions are human-scoreable both fully and partially. The bottleneck now becomes the amount of human feedback you can get. Using the bag of tricks is O(1) because you can parallelise LLM calls and each LLM call is faster than a human.
I’m not sure what relaxing constraint 3 looks like. I’m also not sure what it looks like for a human to invent a new heuristic.
Then it seems to me almost guaranteed that AI will be 99.9 percentile if not 100 percentile when compared against human experts.
Are you talking about current AI, or future AI? Before or after training on that task?
Concretely, “minimize program length while maintaining correctness” seems to be significantly beyond the capabilities of the best publicly available scaffolded LLMs today for all but the simplest programs, and the trends in conciseness for AI-generated code do not make me optimistic that that will change in the near future.
I think this is solveable with today’s software stack and compute, it is just that no lab has bothered to do it. Maybe check back in a year, and downgrade my reputation otherwise. I could set up a manifold market if it is important.
If you have a problem where:
full solution is machine-verifiable
example: math, software
partial solution is machine-scoreable
example: optimising code for minimum compute usage, memory usage, wallclock time, latency, big-O complexity, program length, etc
making progress involves trying a bag of heuristics known to human experts (and can be therefore found in some dataset)
example: optimising code for program length (codegolf) or big-O complexity (competitive programming)
Then it seems to me almost guaranteed that AI will be 99.9 percentile if not 100 percentile when compared against human experts.
100 percentile is obtained when, even though the human expert could have found the solution by repeatedly applying the bag of tricks, no human expert actually bothered to do this for cost and time reasons.
For future progress
Most obvious avenue is relaxing constraint 2. Even if you can’t machine-score partial solutions, you can ask the LLM to guess if it is making partial progress or not, and use that.
Relaxing constraint 1 and 2 will also probably work if the solutions are human-scoreable both fully and partially. The bottleneck now becomes the amount of human feedback you can get. Using the bag of tricks is O(1) because you can parallelise LLM calls and each LLM call is faster than a human.
I’m not sure what relaxing constraint 3 looks like. I’m also not sure what it looks like for a human to invent a new heuristic.
Are you talking about current AI, or future AI? Before or after training on that task?
Concretely, “minimize program length while maintaining correctness” seems to be significantly beyond the capabilities of the best publicly available scaffolded LLMs today for all but the simplest programs, and the trends in conciseness for AI-generated code do not make me optimistic that that will change in the near future.
I think this is solveable with today’s software stack and compute, it is just that no lab has bothered to do it. Maybe check back in a year, and downgrade my reputation otherwise. I could set up a manifold market if it is important.