Archiving link posts?

Link rot is a huge prob­lem. At the same time, many posts on Less Wrong—in­clud­ing some of the most im­por­tant posts, which talk about im­por­tant con­cepts or oth­er­wise ad­vance our col­lec­tive knowl­edge and un­der­stand­ing—are link posts, which means that a non-triv­ial chunk of our con­tent is hosted el­se­where—across a myr­iad other web­sites.

If Less Wrong means to be a repos­i­tory of the ra­tio­nal­ity com­mu­nity’s canon, we must take se­ri­ously the fact that (as gw­ern’s re­search in­di­cates) many or most of those ex­ter­nally-hosted pages will, in a few years, no longer be ac­cessible.

I’ve taken the liberty of putting to­gether a quick-and-dirty solu­tion. This is a page that, when loaded, scrapes the ex­ter­nal links (i.e., the link-post tar­gets) from the front page of GreaterWrong, and au­to­mat­i­cally sub­mits them to archive.is (af­ter check­ing each link to see whether it’s already been sub­mit­ted). A cron­job that loads the page daily en­sures that as new link-posts are posted, they will au­to­mat­i­cally be cap­tured and sub­mit­ted to archive.is.

This solu­tion does not cur­rently have any way to scrape and sub­mit links older than those which are on the front page to­day (2018-09-08). It is also not es­pe­cially el­e­gant.

It may be ad­vis­able to im­ple­ment au­to­matic link-post archiv­ing as a fea­ture of Less Wrong it­self. (Pro­gram­mat­i­cally sub­mit­ting URLs to archive.is is ex­tremely sim­ple. You send a POST re­quest to http://​​archive.is/​​sub­mit/​​, with a sin­gle field, url, with the URL as its value. The URL of the archived con­tent will then—af­ter some time, as archiv­ing is not in­stan­ta­neous—be ac­cessible via http://​​archive.is/​​timegate/​​[the com­plete origi­nal URL].)