There is a paper that shows the overreliance of in-context learning on superficial clues. it is from 2022, and the tested models are old. So, maybe newer ones are doing much better,but maybe it is not really learning, at least by some definitions.
Hopenope
i think the IMO result is best of 32 and USAMO is not
OpenAI is competing in the AtCoder world tour finals (heuristic division) with a new model/agent. It is a 10-hour competition with an optimization-based problem, and OpenAI’s model is currently at 2nd place.
Is optimizing CoT to look nice a big concern? There are other ways to show a nice CoT without optimizing for it. The frontrunners also have some incentives to not show the real CoT. Additionally, there is a good chance that people prefer a nice structured summary of CoT by a small LLM when reasonings become very long and convoluted.
What is the point of these benchmarks without knowing the training compute and data ? One of the main questions is their interpretability. Iterative refinement of these models may open new opportunities.
People with a history of seizures are usually excluded from these kinds of clinical trials, so it is not an apple to apple comparison. the problem is that bupropion interacts with a lot of drugs. seizure rates are also highly dose dependent(10 times higher if taking more than 450 mg daily). Generally, if you’re not taking any interacting medications, are on the 150–300 mg slow-release version, and have no history of seizures, then the risk is low.
As a doctor, I can tell you that even if you don’t have anxiety, it’s possible to develop some while taking bupropion/welbutrin. I used it personally and experienced the most severe anxiety I’ve ever had. It is also associated with a higher chance of seizures, and if you daydream a lot, it may make them worse. However, on the positive side, it often decreases inattention. Generally i like the drug , but it is not a first-line treatment for depression, and for good reasons.
I lived for a while in a failing country with high unemployment. The businesses and jobs that pay well become saturated very quickly. People are less likely to spend money and often delay purchasing new stuff or maintaining their homes. Many jobs exist because we dont have time to do them ourselves, and a significant number of these jobs will just vanish. It is really hard to prepare for a high unemployment rate society.
Overrefusal issues were way more common 1-2 years ago. models like gemini 1, and claude 1-2 had severe overrefusal issues.
Your argument is actually possible, but what evidences do you have, that make it the likely outcome?
the difficulty of alignment is still unknown. it may be totally impossible, or maybe some changes to current methods (deliberative alignment or constitutional ai) + some R&D automation can get us there.
The recurrent paper is actually scary, but some of the stuff there are actually questionable. is 8 layers enough for a 3.5b model? qwen 0.5b has 24 layers.there is also almost no difference between 180b vs 800b model, when r=1(table 4). is this just a case of overcoming insufficient number of layers here?
Is COT faithfulness already obsolete? How does it survive the concepts like latent space reasoning, or RL based manipulations(R1-zero)? Is it realistic to think that these highly competitive companies simply will not use them, and simply ignore the compute efficiency?
I am not sure if longer timelines are always safer. For example, when comparing a two-year timeline to a five-year one, there are a lot of advantages to the shorter timeline. In both cases you need to outsource a lot of alignment research to AI anyway, and the amount of compute and the number of players with significant compute are lower, which reduces both the racing pressure and takeoff speed.
What happened to Waluigi effect? It used to be a big issue, some people were against it, and suddenly it is pretty much forgotten. Are there any related research, or recent demos, that examine it in more detail?
If you have a very short timeline, and you don’t think that alignment is solvable in such a short time, then what can you still do to reduce the chance of x-risk?
Many expert level benchmarks totally overestimate the range and diversity of their experts’ knowledge. A person with a PhD in physics is probably undergraduate level in many parts of physics that are not related to his/her research area, and sometimes we even see that within expert’s domain (Neurologists usually forget about nerves that are not clinically relevant).
ED is not the only problem with finasteride. I saw a couple of cases of gynecomastia in medical school and stopped using finasteride after that. Minoxidil worked fine solo for 4 years, but applying it every night was annoying, and when I stopped using it, I went bald fast (Norwood 6 in 5 months!).