Weekdays ^^
Morpheus
“Sweets only on weekdays starting with S”. Depending on your lifestyle and preference for sweets, this can be easy to implement.
Are there any disorders impairing spatial attention that you think would also impair empathy? I asked GPT-4 for disorders of spatial attention and gave me Hemispatial neglect and Balint’s Syndrom. If things were really convenient with Hemispatial neglect, I can imagine that people always think of some of their thoughts and feelings as on the “left” side. Then they would have difficulties having those feelings once they have trouble attending to anything on the left side. For a cliché example, associating his love with his heart on the left side (Maybe that’s a bad example. Perhaps better would be something where someone would have trouble telling if something was their own or another person’s thought or feelings).
Wouldn’t “Neuron Polysemanticity is not ‘just’ Superposition” be a more fitting title?
A piece of advice I frequently hear: always make sure you call somebody in the company you’re applying for.
Is this still up-to-date advice? Or is messaging someone over LinkedIn or similar more appropriate? Mostly asking because I got the impression that the internet changed the norms to no one doing phone calls anymore.
If you find that you’re reluctant to delete computer files / emails, don’t empty the trash
In Gmail I like to scan the email headers and then I bulk select and archive them (
* a e
thanks to vim shortcuts). After 5 years of doing this I still didn’t run out of the free storage in Gmail. I already let Gmail sort the emails by “Primary” , “Promotions” , “Updates” etc. Usually the only important things are in “Primary” and 1 or 2 in “Updates”.
Can anyone here recommend particular tools to practice grammar? Or with strong opinions on the best workflow/tool to correct grammar on the fly? I already know Grammarly and LanguageTool, but Grammarly seems steep at $30 per month when I don’t know if it is any good. I have tried GPT-4 before, but the main problems I have there, is that it is too slow and changes my sentences more than I would like (I tried to make it do that less through prompting, which did not help that much).
I notice that feeling unconfident about my grammar/punctuation leads me to write less online, especially applying for jobs or fellowships, feels more icky because of it. That seems like an avoidable failure mode.
Ideally, I would like something like the German Orthografietrainer (It was created to teach middle and high school children spelling and grammar). It teaches you on a sentence by sentence basis where to put the commas and why by explaining the sentence structure (Illustrated through additional examples). Because it trains you with particularly tricky sentences, the training is effective, and I rapidly got better at punctuation than my parents within ~3 hours. Is there a similar tool for English that I have never heard of?
While writing this, I noticed that I did not have the free version of Grammarly enabled anymore and tried the free version while writing this. One trick I noticed is that it lists what kinds of error you are making across the whole text. So it is easy to infer what particular mistake I made in which spot, and then I correct it myself. Also, Grammarly did not catch a few simple spelling and punctuation mistakes that Grammarly caught (like “anymore” or the comma at the start of this sentence.). At the end, I also tried ProWritingAid, which found additional issues.
Trying to learn a language from scratch, just from text is a fun exercise for humans also. I recently tried this with Hindi after I had an disagreement with someone about the exact question of this post. I didn’t get very far in 2 hours though.
Trydactyl is amazing. You can disable the mode on specific websites by running the
blacklistadd
command. If you have configured that already, these settings can also be saved in your config file. Here’s my config (though careful before copying my config. It has fixamo_quiet enabled, a command that got Tridactyl almost removed when it was enabled by default. You should read what it does before you enable it.)Here are my ignore settings:
autocmd DocStart https://youtube.com mode ignore autocmd DocStart https://todoist.com mode ignore autocmd DocStart mail.google.com mode ignore autocmd DocStart calendar.google.com mode ignore autocmd DocStart keyma.sh mode ignore autocmd DocStart monkeytype.com mode ignore autocmd DocStart https://www.youtube.com mode ignore autocmd DocStart https://ilias.studium.kit.edu/ mode ignore autocmd DocStart localhost:888 mode ignore autocmd DocStart getguestimate.com mode ignore autocmd DocStart localhost:8888 mode ignore
Juggling: Anthony Gatto’s juggling routine from 2000. Anthony Gatto holds several juggling world records. This routine is infamous in the juggling world (here’s a decent juggler commenting on it). As well as the fact that he gave up juggling to work with concrete instead (because it pays the bills). Here’s more context on Gatto and his routine (the guy picking up the balls for him in the video is his father, for example):
Agreed. Especially the “electoral college is good actually” part is where I started laughing. If you don’t want tyranny by the majority, perhaps just not crippling your system by not using first-past-the-post voting would be a first step to a more sane system.
Absolutely love this essay! The green from the perspective of non-green thoughts really resonated with things I thought in the past and made me notice how I have been confused by green. Helpfull for AGI or not, this is is giving me a bunch of fresh thoughts about problems/confusing areas in my own life, so thanks!
A quick intuitive check for whether something is a natural latent over some parts of a system consists of two questions:
Are the parts (approximately) independent given the candidate natural latent?
I first had some trouble checking this condition intuitively. I might still not have got it correctly. I think one of the main things that got me confused first, is that if I want to reason about natural latents for “a” dog, I need to think about a group of dogs. Even though there are also natural latents for the individual dog (like fur color is a natural latent across the dog’s fur). Say I check the independence condition for a set of sets of either cats or dogs. So if I look at a single animal’s shoulder height in those sorted cluster, it tells me which of the two clusters it’s in, but once I updated on that information, my guesses for the dog height’s will not be able to improve.
An important example for something that is not a natural latent is the empirical mean in fat tailed distributions for real world sample sizes, while it is in thin-tailed ones. This doesn’t mean that they don’t have natural latents. This fact is what Nassim Taleb is harping on. For Pareto distributions (think: pandemics, earthquakes, wealth), one still has natural latents like the tail index (estimated from plotting the data on a log-log plot by dilettantes like me and more sophisticatedly by real professionals).
If I had more time I would have written a shorter letter.
TLDR: I looked into how much it would take to fine-tune gpt-4 to do Fermi estimates better. If you liked the post/paper on fine-tuning Language models to make predictions you might like reading this. I evaluated gpt-4 on the first dataset I found, but gpt-4 was already making better fermi estimates than the examples in the dataset, so I stopped there (my code).
First problem I encountered: there is no public access to fine-tuning gpt-4 so far. Ok, we might as well just do gpt-3.5 I guess.
First, I found this Fermi estimate dataset. (While doing this, I was thinking I should perhaps search more widely what kind of different AI benchmarks exist, since probably a dataset that is evaluating a similar capability is already out there, but I don’t know its name.)
Next I looked at this paper, where people used among other gpt-3.5 and gpt-4 on this benchmark. Clearly these people weren’t even trying, though, because gpt-4 does worse than gpt-3.5. One of the main issues I saw was that they were trying to make the LLM output the answer as a program in the domain specific language used in that dataset. They couldn’t even get the LLM to output valid programs more than 60% of the time (their metric compares on a log scale, if the answer by the LLM is within 3 orders of magnitude of the real answer. 1 is best 0 is more than 3 orders of magnitude away: fp-score(x) = max(0,1-1/3 * | log_10(prediction/answer)|)).
My conjecture was that just using python instead should give you better results.(This turned out to be true). I get a mean score of ~0.57 on 100 sample problems, so as good results with gpt-4-turbo as they get when they first provide “context” by giving the llm the values for the key variables needed to compute the answer (why would this task even still be hard at all?).
When gpt-4 turned out to get a worse fp-score than gpt-4-turbo on my 10 samples. I got suspicious and after looking at samples gpt-4 got a bad score, it was clear this was mostly to blame on bad quality of the dataset. 2 answers were flat-out not using the correct variables/confused, while gpt-4 was answering correctly. Once, the question didn’t make clear what unit to use. 2 of the samples gpt-4 gave a better answer. Once, using a better approach (using geometry instead of wrong figures of how much energy the earth gets from the sun, to determine the fraction of sun energy that the earth receives). Once, by having better numbers, input estimates like how many car miles are driven in total in the US.
So on this dataset, gpt-4 seems to be already at the point of data-saturation. I was actually quite impressed how well it was doing. When I had tried using gpt-4 for this task, I had always felt like it was doing quite badly. One guess I have is this is because when I ask gpt-4 for an estimate, it is often a practical question, which is actually harder than these artificial questions. In addition, the reason I ask gpt-4 is that the question is hard, and I expect to need to employ a lot of cognitive labor to do it myself.
Another data point with respect to this was the “Thinking physics exercises”. Which I tried with some of my friends. For that task, gpt-4 was better than people who were bad at this, but worse than people who were good at this (and given 5–10 minutes of thinking time) (although I did not rigorously evaluate that). GPT-4 is probably better than most humans at doing Fermi estimates given 10 minutes of time. Especially in domains one is unfamiliar with, since it has so much more breadth.
I would be interested to see what one would get out of actually making a high quality dataset by taking Fermi estimates from people I deem to produce high quality work in that area.
Not exactly what you were looking for, but recently I noticed that there were a bunch of John Wentworth’s posts that I had been missing out on that he wrote over the past 6 years. So if you get a lot out of them too, I recommend just sorting by ‘old’. I really liked don’t get distracted by the boilerplate (The first example made something click about math for me that hadn’t clicked before, which would have helped me to engage with some “boilerplate” in a more productive way.). I also liked constraints and slackness, but I didn’t go beyond the first exercise yet. There’s also more technical posts that I didn’t have the time to dig into yet.
bhauth doesn’t have as long a track record, but I got some interesting ideas from his blog which aren’t on his lesswrong account. I really liked proposed future economies and the legibility bottleneck.
This post warms my heart. Thank you.
The pdf linked by @CstineSublime is definitely towards the textbook. I’ve started reading it and it has been an excellent read so far. Will probably write a review later.
While there is currently a lot of attention on assessing language models, it puzzles me that no one seems to be independently assessing the quality of different search engines and recommender systems. Shouldn’t this be easy to do? The only thing I could find related to this is this Russian site (It might be propaganda from Yandex, as it is listed as the top quality site?). Taking their “overall search quality” rating at face value does seem to support the popular hypothesis that search quality of Google has slightly deteriorated over the last 10 years (although compared to 2009-2012, quality has been basically the same according to this measure).
The gpt-4 translated version of their blog states that they gave up actively maintaining this project in 2014, because search engine quality had become reasonable according to them:
For the first time in the history of the project, we have decided to shut down one of the analyzers: SEO pressing as a phenomenon has essentially become a thing of the past, and the results of the analyzer have ceased to be interesting.
Despite the fact that search engine optimization as an industry continues to thrive, search engine developers have made significant progress in combating the clutter of search results with specially promoted commercial results. The progress of search engines is evident to the naked eye, including in the graph of our analyzer over the entire history of measurements:
SEO Pressing Analyzer Graph
The result of the analyzer is the share of commercial sites in the search results for queries that do not have a clearly commercial meaning; when there are too many such sites in the search results, it is called susceptibility to SEO pressing. It is easy to see that a few years ago, more than half (sometimes significantly more than half) of the search results from all leading search engines consisted of sites offering specific goods or services. This is, of course, a lot: a query can have different meanings, and the search results should cater to as many of them as possible. At the same time, a level of 2-3 such sites seems decent, since a user who queries “Thailand” might easily be interested in tours to that country, and one who queries “power station” might be interested in power stations for a country home.
If we are worried that current recommender systems are already doing damage and expect things to get worse in the future, it might be good to actively monitor this to not get frog boiled.
I’ve recently tried to play this again with @Towards_Keeperhood. We think it was still working a year ago. He would be happy to pay a 50$ bounty for this to get fixed by reverting it to the previous version (or whatever happened there). If the code was public that would also be helpful, because then I might get to fixing it.