It’s an AWS firewall rule with bad defaults. We’ll fix it soon, but in the mean time, you can scrape if you change your user agent to something other than wget/curl/etc. Please use your name/project in the user-agent so we can identify you in logs if we need to, and rate-limit yourself conservatively.
This is something I’m curious about as well! A friend recently introduced me to LessWrong, and I’ve found myself really enjoying the posts here! I’d like to spend more focused time digging into them!
I’d like to create a dump of LessWrong so that I can use a tool like DocETL (https://www.docetl.org/) to better sift through articles that might be interesting to me. It’s been quite some time since jimrandomh replied to this post. So I just thought I’d check in before I attempted to crawl the site.
In this related post, someone mentions another website called greater wrong. But I’m not sure I understand the relationship between that website and this website. I’m a total newbie to this community haha.
What’s the most thoughtful way to get a dump of LessWrong? Is that even desirable by the folks that run this site?
greaterwrong is a website with the same content as lesswrong but different look, it gets its content from the lesswrong website. it’s basically just a different way to access the same posts.
lesswrong has a graphql api, which is probably the best way to read and dump posts on, if you rate limit conservatively. but that means some programming involved.
to just have a quick dump of everything it’s probably best to use wget on greaterwrong with rate limiting.
email admin@greaterwrong.com before doing so to make sure you do it in a way they approve of.
It’s an AWS firewall rule with bad defaults. We’ll fix it soon, but in the mean time, you can scrape if you change your user agent to something other than wget/curl/etc. Please use your name/project in the user-agent so we can identify you in logs if we need to, and rate-limit yourself conservatively.
thanks a lot for the answer!
This is something I’m curious about as well! A friend recently introduced me to LessWrong, and I’ve found myself really enjoying the posts here! I’d like to spend more focused time digging into them!
I’d like to create a dump of LessWrong so that I can use a tool like DocETL (https://www.docetl.org/) to better sift through articles that might be interesting to me. It’s been quite some time since jimrandomh replied to this post. So I just thought I’d check in before I attempted to crawl the site.
Also, it looks like https://www.lesswrong.com/robots.txt disallows hitting /allPosts?
In this related post, someone mentions another website called greater wrong. But I’m not sure I understand the relationship between that website and this website. I’m a total newbie to this community haha.
What’s the most thoughtful way to get a dump of LessWrong? Is that even desirable by the folks that run this site?
greaterwrong is a website with the same content as lesswrong but different look, it gets its content from the lesswrong website. it’s basically just a different way to access the same posts.
lesswrong has a graphql api, which is probably the best way to read and dump posts on, if you rate limit conservatively. but that means some programming involved.
to just have a quick dump of everything it’s probably best to use wget on greaterwrong with rate limiting. email admin@greaterwrong.com before doing so to make sure you do it in a way they approve of.
gotcha, thanks!