jimrandomh answers Why is lesswrong blocking wget and curl (scrape)?

jimrandomh 8 Nov 2023 22:06 UTC
21 points
0
It’s an AWS firewall rule with bad defaults. We’ll fix it soon, but in the mean time, you can scrape if you change your user agent to something other than wget/curl/etc. Please use your name/project in the user-agent so we can identify you in logs if we need to, and rate-limit yourself conservatively.
- nick lacombe 9 Nov 2023 1:53 UTC
  3 points
  0
  Parent
  thanks a lot for the answer!
- varungodbole 27 Jan 2025 16:32 UTC
  1 point
  0
  Parent
  This is something I’m curious about as well! A friend recently introduced me to LessWrong, and I’ve found myself really enjoying the posts here! I’d like to spend more focused time digging into them!
  I’d like to create a dump of LessWrong so that I can use a tool like DocETL (https://www.docetl.org/) to better sift through articles that might be interesting to me. It’s been quite some time since jimrandomh replied to this post. So I just thought I’d check in before I attempted to crawl the site.
  Also, it looks like https://www.lesswrong.com/robots.txt disallows hitting /allPosts?
  In this related post, someone mentions another website called greater wrong. But I’m not sure I understand the relationship between that website and this website. I’m a total newbie to this community haha.
  What’s the most thoughtful way to get a dump of LessWrong? Is that even desirable by the folks that run this site?
  - nick lacombe 30 Jan 2025 4:10 UTC
    1 point
    0
    Parent
    greaterwrong is a website with the same content as lesswrong but different look, it gets its content from the lesswrong website. it’s basically just a different way to access the same posts.
    
    lesswrong has a graphql api, which is probably the best way to read and dump posts on, if you rate limit conservatively. but that means some programming involved.
    
    to just have a quick dump of everything it’s probably best to use wget on greaterwrong with rate limiting. email admin@greaterwrong.com before doing so to make sure you do it in a way they approve of.
    - varungodbole 3 Feb 2025 2:30 UTC
      1 point
      0
      Parent
      gotcha, thanks!