Advameg, Inc. CEO
Founder, city-data.com
https://twitter.com/LechMazur
Author: County-level COVID-19 machine learning case prediction model.
Author: AI assistant for melody composition.
Advameg, Inc. CEO
Founder, city-data.com
https://twitter.com/LechMazur
Author: County-level COVID-19 machine learning case prediction model.
Author: AI assistant for melody composition.
I’ve just created a NYT Connections benchmark. 267 puzzles, 3 prompts for each, uppercase and lowercase.
Results:
GPT-4 Turbo: 31.0
Claude 3 Opus: 27.3
Mistral Large: 17.7
Mistral Medium: 15.3
Gemini Pro: 14.2
Qwen 1.5 72B Chat: 10.7
Claude 3 Sonnet: 7.6
GPT-3.5 Turbo: 4.2
Mixtral 8x7B Instruct: 4.2
Llama 2 70B Chat: 3.5
Nous Hermes 2 Yi 34B: 1.5
Partial credit is given if the puzzle is not fully solved
There is only one attempt allowed per puzzle, 0-shot. Humans get 4 attempts and a hint when they are one step away from solving a group
Gemini Advanced is not yet available through the API
(Edit: I’ve added bigger models from together.ai and from Mistral)
“Anonymous” and 540B parameters, hmm… I’m sure it’s not from the company named after an even larger number.
GSM8K = grade school math word problems
DROP = reading comprehension benchmark requiring discrete reasoning over paragraphs
OpenBookQA = question-answering dataset modeled after open book exams for assessing human understanding of a subject. 5,957 multiple-choice elementary-level science questions
ANLI-A3 = adversarial benchmark designed to be challenging to current state-of-the-art models
Fox News’ Peter Doocy uses all his time at the White House press briefing to ask about an assessment that “literally everyone on Earth will die” because of artificial intelligence: “It sounds crazy, but is it?”
https://twitter.com/therecount/status/1641526864626720774
Related development: https://www.nature.com/articles/d41586-021-01968-y
“Meanwhile, an academic team has developed its own protein-prediction tool inspired by AlphaFold 2, which is already gaining popularity with scientists. That system, called RoseTTaFold, performs nearly as well as AlphaFold 2, and is described in a paper in Science paper also published on 15 July[2] ”
There is a new study out that found that 40% of Copilot’s code contributions in high-risk scenarios were vulnerable: https://arxiv.org/abs/2108.09293
Generated data can be low quality but indistinguishable. Unless your classifier has access to more data or is better in some other way (e.g. larger, better architecture), you won’t know. In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place? I’ve seen this in practice in my own project.
In Miami there absolutely are explicit payments for models to join tables. This can lead to all of them leaving on the dot at let’s say 3:00 AM, since that’s how long they were required to stay at the club to earn their pay. NYC has a different dynamic.
As you probably know, there are multiple theoretically-interesting ML ideas that achieve very good results on MNIST. Have you tried more challenging image recognition benchmarks, such as CIFAR-100, or some non-CV benchmark? Since you posted your code, I wouldn’t mind spending a bit of time looking over what you’ve accomplished. However, MNIST, which is now considered pretty much a toy benchmark (I don’t consider PI-MNIST to be a better benchmark), will likely be an obstacle to get others to also look at it in-depth, as it will be considered quite preliminary. Another practical point: using C and CUDA kernels also makes it less accessible to a good percentage of researchers.
Following your forecast’s closing date, MATH has reached 84.3% as per this paper if counting GPT-4 Code Interpreter: https://arxiv.org/abs/2308.07921v1
I wouldn’t recommend watching this talk to someone unfamiliar with the AI risk arguments, and I think promoting it would be a mistake. Yudkowsky seemed better on Lex Friedman’s podcast. A few more Rational Animations-style AI risk YouTube videos would be more effective.
“Squiggle Maximizer” and “Paperclip Maximizer” have to go. They’re misleading terms for the orthogonal AI utility function that make the concept seem like a joke when communicating with the general public. Better to use a different term, preferably something that represents a goal that’s valuable to humans. All funny-sounding insider jargon should be avoided cough notkilleveryoneism cough.
Nanotech is too science-fictiony and distracting. More realistic near-term scenarios (hacks of nuclear facilities like Stuxnet to control energy, out-of-control trading causing world economies to crash and leading to a full-on nuclear war, large-scale environmental disaster that’s lethal to humans but not machines, gain-of-function virus engineering, controlling important people through blackmail) would resonate better and emphasize the fragility of human civilization.
The chess analogy (“you will lose but I can’t tell you exactly how”) is effective. It’s also challenging to illustrate to people how something can be significantly more intelligent than them, and this analogy simultaneously helps convey that by reminding people how they easily lose to computers.
Ethereum’s market cap is 47% of Bitcoin’s. While you can argue that the market cap of cryptocurrencies is arbitrary, the price of one coin is even more arbitrary.
Yudkowsky argues his points well in longer formats, but he could make much better use of his Twitter account if he cares about popularizing his views. Despite having Musk responding to his tweets, his posts are very insider-like with no chance of becoming widely impactful. I am unsure if he is present on other social media, and I understand that there are some health issues involved, but a YouTube channel would also be helpful if he hasn’t completely given up.
I do think it is a fact that many people involved in AI research and engineering, such as his example of Chollet, have simply not thought deeply about AGI and its consequences.
I’d like to but it’ll have to wait until I’m finished with a commercial project where I’m using them or until I replace these techniques with something else in my code. I’ll post a reply here once I do. I’d expect somebody else to discover at least one of them in the meantime, they’re not some stunning insights.
From what I gather, Alameda wasn’t worth quite as much but he had a larger stake in it. Both appear to be worthless now. There is also a U.S. subsidiary ftx.us, which according to them “is a separate entity with separate management personnel, tech infrastructure, and licensing.” Some calculations I’ve seen put SBF’s net worth below $1 billion now and I think it’s probable that he’ll have to deal with some big legal issues.
Baidu and Pony.ai have permits for fully driverless robotaxis in China: https://www.globaltimes.cn/page/202303/1287492.shtml
Gwern, have you actually tried Bing Chat yet? If it is GPT-4, then it’s a big disappointment compared to how unexpectedly good ChatGPT was. It fails on simple logic and math questions, just like ChatGPT. I don’t find the ability to retrieve text from the web to be too impressive—it’s low-laying fruit that was long expected. It’s probably half-baked simply because Microsoft is in a hurry because they have limited time to gain market share before Google integrates Bard.
There have been a few papers with architectures showing performance that matches transformers on smaller datasets with scaling that looks promising. I can tell you that I’ve switched from attention to an architecture loosely based on one of these papers because it performed better on a smallish dataset in my project but I haven’t tested it on any standard vision or language datasets, so I don’t have any concrete evidence yet. Nevertheless, my guess is that indeed there is nothing special about transformers.
I don’t think PI-MNIST SOTA is really a thing. The OP even links to the original dropout paper from 2014, which shows this. MNIST SOTA is much less of a thing than it used to be but that’s at 99.9%+, not 98.9%.
Some anecdotal evidence: in the last few months I was able to improve on three 2021 conference-published, peer-reviewed DL papers. In each case, the reason I was able to do it was that the authors did not fully understand why the technique they used worked and obviously just wrote a paper around something that they experimentally found to be working. In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven’t been fixed or noticed in “Issues” for a long time. Seems that none of its users went step-by-step or tried to carefully understand what was going on.
What all four of these have in common is that they are still actually working, just not optimally. Their experimental results are not fake. This does not fill me with hope for the future of interpretability.
It won’t be able to multiply 5-digit integers (middle digits will be wrong).