I’ve been lurking on LW since 2013, but only started posting recently. My day job was “analytics broadly construed” although I’m currently exploring applied prio-like roles; my degree is in physics; I used to write on Quora and Substack but stopped, although I’m still on the EA Forum. I’m based in Kuala Lumpur, Malaysia.
Mo Putera
Awhile back johnswentworth wrote What Do GDP Growth Curves Really Mean? noting how real GDP (as we actually calculate it) is a wildly misleading measure of growth because it effectively ignores major technological breakthroughs – quoting the post, real GDP growth mostly tracks production of goods which aren’t revolutionized; goods whose prices drop dramatically are downweighted to near-zero, so that when we see slow, mostly-steady real GDP growth curves, that mostly tells us about the slow and steady increase in production of things which haven’t been revolutionized, and approximately-nothing about the huge revolutions in e.g. electronics. Some takeaways by the author that I think pertain to the “AI and GDP growth acceleration” discussion:
“even after a hypothetical massive jump in AI, real GDP would still look smooth, because it would be calculated based on post-jump prices, and it seems pretty likely that there will be something which isn’t revolutionized by AI… Whatever things don’t get much cheaper are the things which would dominate real GDP curves after a big AI jump”
“More generally, the smoothness of real GDP curves does not actually mean that technology progresses smoothly. It just means that we’re constantly updating the calculations, in hindsight, to focus on whatever goods were not revolutionized”
“using price as a proxy for value is just generally not great for purposes of thinking about long-term growth and technology shifts” (quote)
Ever since I read that post I’ve generally discounted claims that AI progress would accelerate GDP growth to double digits unless the claims address the methodological point about how GDP is calculated. This isn’t an argument against the potential for AI progress to revolutionize human civilization or whatever; it’s (as I interpret it) an argument against using anything remotely resembling GDP in trying to quantify how much AI progress is revolutionizing human civilization, because it just won’t capture it.
I think your ‘Towards a coherent process for metric design’ section alone is worth its weight in gold. Since most LW readers aren’t going to click on your linked paper (click-through rates being as low in general as they are, from my experience in marketing analytics), let me quote that section wholesale:
Given the various strategies and considerations discussed in the paper, as well as failure modes and limitations, it is useful to lay out a simple and coherent outline of a process for metric design. While this will by necessity be far from complete, and will include items that may not be relevant for a particular application, it should provide at least an outline that can be adapted to various metric design processes. Outside of the specific issues discussed earlier, there is a wide breadth of expertise and understanding that may be needed for metric design. Citations in this section will also provide a variety of resources for at least introductory further reading on those topics.
Understand the system being measured, including both technical (Blanchard & Fabrycky, 1990) and organizational (Berry & Houston, 1993) considerations.
Determine scope
What is included in the system?
What will the metrics be used for?
Understand the causal structure of the system
What is the logic model or theory? (Rogers, Petrosino, Huebner, & Hacsi, 2000)
Is there formal analysis (Gelman, 2010) or expert opinion (van Gelder, Vodicka, & Armstrong, 2016) that can inform this?
Identify stakeholders (Kenny, 2014)
Who will be affected?
Who will use the metrics?
Whose goals are relevant?
Identify the Goals
What immediate goals are being served by the metric(s)? How are individual impacts related to performance more broadly? (Ruch, 1994)
What longer term or broader goals are implicated?
Identify Relevant Desiderata
Availability
Cost
Immediacy
Simplicity
Transparency
Fairness
Corruptibility
Brainstorm potential metrics
What outcomes important to capture?
What data sources exist?
What methods can be used to capture additional data?
What measurements are easy to capture?
What is the relationship between the measurements and the outcomes?
What isn’t captured by the metrics?
Consider and Plan
Understand why and how the metric is useful. (Manheim, 2018)
Consider how the metrics will be used to diagnose issues or incentify people.(Dai, Dietvorst, Tuckfield, Milkman, & Schweitzer, 2017)
Plan how to use the metrics to develop the system, avoiding the “reward / punish” dichotomy. (Wigert & Harter, 2017)
Perform a pre-mortem (Klein, 2007)
Plan to routinely revisit the metrics (Atkins, Wanick, & Wills, 2017)
It is worth remarking though, that even a nuclear rocket might learn something useful from practicing the gravity turn maneuver. Just because you have an easy time leaving Earth’s atmosphere and have no need of finesse, doesn’t mean your travels won’t land you on Venus someday.
I’m reminded of the career advice page on Terry Tao’s blog. When I first found it many years ago as a student, I wondered why someone like Tao would bother to write about stuff like “work hard” and “write down what you’ve done” and “be patient” and “learn and relearn your field”. Wasn’t this for “mere mortals” like me who have to do the best we can with the (relatively) limited brains we’ve got, instead of prodigies who win IMO gold medals at 13 and get PhDs from Princeton at 21 etc? But for whatever reason this particular nuclear rocket practiced the gravity turn maneuver pretty seriously; and (at least in math circles) we know how he turned out.
Chris Olah for machine learning (I’m thinking in particular of his backpropagation essay), Qiaochu Yuan for math (I’d been following his writing on Math Overflow and MSE for years before discovering to my pleasant surprise that he’s also a frequent LW poster) as well as John Baez and Tim Gowers (their blog posts are, to me the gold standard for research-level math exposition), Sabine Hossenfelder for theoretical physics.
I was wondering about this too. I thought of Eugene Wei writing about Edward Tufte’s classic book The Visual Display of Quantitative Information, which he considers “[one of] the most important books I’ve read”. He illustrates with an example, just like dynomight did above, starting with this chart auto-created in Excel:
and systematically applies Tufte’s principles to eventually end up with this:Wei adds further commentary:
No issues for color blind users, but we’re stretching the limits of line styles past where I’m comfortable. To me, it’s somewhat easier with the colored lines above to trace different countries across time versus each other, though this monochrome version isn’t terrible. Still, this chart reminds me, in many ways, of the monochromatic look of my old Amazon Analytics Package, though it is missing data labels (wouldn’t fit here) and has horizontal gridlines (mine never did).
We’re running into some of these tradeoffs because of the sheer number of data series in play. Eight is not just enough, it is probably too many. Past some number of data series, it’s often easier and cleaner to display these as a series of small multiples. It all depends on the goal and what you’re trying to communicate.
At some point, no set of principles is one size fits all, and as the communicator you have to make some subjective judgments. For example, at Amazon, I knew that Joy wanted to see the data values marked on the graph, whenever they could be displayed. She was that detail-oriented. Once I included data values, gridlines were repetitive, and y-axis labels could be reduced in number as well.
Tufte advocates reducing non-data-ink, within reason, and gridlines are often just that. In some cases, if data values aren’t possible to fit onto a line graph, I sometimes include gridlines to allow for easy calculation of the relative ratio of one value to another (simply count gridlines between the values), but that’s an edge case.
For sharp changes, like an anomalous reversal in the slope of a line graph, I often inserted a note directly on the graph, to anticipate and head off any viewer questions. For example, in the graph above, if fewer data series were included, but Greece remained, one might wish to explain the decline in health expenditures starting in 2008 by adding a note in the plot area near that data point, noting the beginning of the Greek financial crisis (I don’t know if that’s the actual cause, but whatever the reason or theory, I’d place it there).
If we had company targets for a specific metric, I’d note those on the chart(s) in question as a labeled asymptote. You can never remind people of goals often enough.
And I thought, okay, sounds persuasive and all, but also this feels like Wei/Tufte is pushing their personal aesthetic on me, and I can’t really tell the difference (or whether it matters).
… most ratings range from 4 to 5. A 4.8 is a good rating. A 4.2 is underwhelming.
Is this US-centric? I was confused by this prior, since I see 2 and 3 star ratings all the time; 4.2 would catch my attention as being pretty good.
Your anecdote about the friend who’s overenthusiastic in their texts reminds me of my culture shock when I first arrived in the US. People were very often ‘fake-friendly’, for want of a better term; it was disorienting.
Nitpick that doesn’t bear upon the main thrust of the article:
2021: Here’s a random weightlifter I found coming in at over 400kg, I don’t have his DEXA but let’s say somewhere between 300 and 350kgs of muscle.
More plausibly Josh Silvas weighs 220-ish kg, not 400 kg, and there’s no way he has anywhere near 300+ kg of muscle. To contextualize, the heaviest WSM winners ever weighed around 200-210 kg (Hafthor, Brian); Brian in particular had a lean body mass of 156 kg back when he weighed 200 kg peaking for competition (‘peaking’ implies unsustainability), which is the highest DEXA figure I’ve ever found in years of following strength-related statistics.
I concur with your last paragraph, and see it as a special case of rationalist taboo (taboo “AGI”). I’d personally like to see a set of AGI timeline questions on Metaculus where only the definitions differ. I think it would be useful for the same forecasters to see how their timeline predictions vary by definition; I suspect there would be a lot of personal updating to resolve emergent inconsistencies (extrapolating from my own experience, and also from ACX prediction market posts IIRC), and it would be interesting to see how those personal updates behave in the aggregate.
There’s a tangentially related comment by Scott Alexander from over a decade ago, on the subject of writing advice, which I still think about from time to time:
The best way to improve the natural flow of ideas, and your writing in general, is to read really good writers so much that you unconsciously pick up their turns of phrase and don’t even realize when you’re using them. The best time to do that is when you’re eight years old; the second best time is now.
Your role models here should be those vampires who hunt down the talented, suck out their souls, and absorb their powers. Which writers’ souls you feast upon depends on your own natural style and your goals. I’ve gained most from reading Eliezer, Mencius Moldbug, Aleister Crowley, and G.K. Chesterton (links go to writing samples from each I consider particularly good); I’m currently making my way through Chesterton’s collected works pretty much with the sole aim of imprinting his writing style into my brain.
Stepping from the sublime to the ridiculous, I took a lot from reading Dave Barry when I was a child. He has a very observational sense of humor, the sort where instead of going out looking for jokes, he just writes about a topic and it ends up funny. It’s not hard to copy if you’re familiar enough with it. And if you can be funny, people will read you whether you have any other redeeming qualities or not.
Getting imprinted with good writers like this will serve you for your entire life. It will serve you whether you’re on your fiftieth draft of a thesis paper, or you’re rushing a Less Wrong comment in the three minutes before you have to go to work. It will even serve you in regular old non-written conversation, because wit and clarity are independent of medium.
Thinking of myself as a vampire hunting down the talented and feasting on their souls to absorb their powers is somewhat dramatic, but vividly memorable...
I’m guessing you’re referring to Brian Potter’s post Where Are The Robotic Bricklayers?, which to me is a great example of reality being surprisingly detailed. Quoting Brian:
Masonry seemed like the perfect candidate for mechanization, but a hundred years of limited success suggests there’s some aspect to it that prevents a machine from easily doing it. This makes it an interesting case study, as it helps define exactly where mechanization becomes difficult—what makes laying a brick so different than, say, hammering a nail, such that the latter is almost completely mechanized and the former is almost completely manual?
[Question] How should we think about the decision relevance of models estimating p(doom)?
Upvoting for the multiple levels of summarization. Feels respectful of readers’ attention too.
If I take this claimed strategy as a hypothesis (that radical introspective speedup is possible and trainable), how might I falsify it? I ask because I can already feel myself wanting to believe it’s true and personally useful, which is an epistemic red flag. Bonus points if the falsification test isn’t high cost (e.g. I don’t have to try it for years).
In a slightly different direction than proof assistants, I’m reminded of Terry Tao’s recent experience trying out GPT-4 to play the role of collaborator:
As I noted at this MathOverflow answer (with a concurrence by Bill Thurston), one of the most intellectually satisfying experiences as a research mathematician is interacting at the blackboard with one or more human co-authors who are exactly on the same wavelength as oneself while working collaboratively on the same problem. I do look forward to the day that I can have a similar conversation with an AI attuned to my way of thinking, or (in the more distant future) talking to an attuned AI version of a human colleague when that human colleague is not available for whatever reason. (Though in the latter case there are some non-trivial issues regarding security, privacy, intellectual property, liability, etc. that would likely need to be resolved first before such public AI avatars could be safely deployed.)
I have experimented with prompting GPT-4 to play the role of precisely such a collaborator on a test problem, with the AI instructed to suggest techniques and directions rather than to directly attempt solve the problem (which the current state-of-the-art LLMs are still quite terrible at). Thus far, the results have been only mildly promising; the AI collaborator certainly serves as an enthusiastic sounding board, and can sometimes suggest relevant references or potential things to try, though in most cases these are references and ideas that I was already aware of and could already evaluate, and were also mixed in with some less relevant citations and strategies. But I could see this style of prompting being useful for a more junior researcher, or someone such as myself exploring an area further from my own area of expertise. And there have been a few times now where this tool has suggested to me a concept that was relevant to the problem in a non-obvious fashion, even if it was not able to coherently state why it was in fact relevant. So while it certainly isn’t at the level of a genuinely competent collaborator yet, it does have potential to evolve into one as the technology improves (and is integrated with further tools, as I describe in my article).
Terry sounded more enthusiastic here:
I could feed GPT-4 the first few PDF pages of a recent math preprint and get it to generate a half-dozen intelligent questions that an expert attending a talk on the preprint could ask. I plan to use variants of such prompts to prepare my future presentations or to begin reading a technically complex paper. Initially, I labored to make the prompts as precise as possible, based on experience with programming or scripting languages. Eventually the best results came when I unlearned that caution and simply threw lots of raw text at the AI. …
I now routinely use GPT-4 to answer casual and vaguely phrased questions that I would previously have attempted with a carefully prepared search-engine query. I have asked it to suggest first drafts of complex documents I had to write.
Which isn’t to say that his experience has been all positive; the usual hallucination issues still crop up:
Current large language models (LLM) can often persuasively mimic correct expert response in a given knowledge domain (such as my own, research mathematics). But as is infamously known, the response often consists of nonsense when inspected closely. Both humans and AI need to develop skills to analyze this new type of text. The stylistic signals that I traditionally rely on to “smell out” a hopelessly incorrect math argument are of little use with LLM-generated mathematics. Only line-by-line reading can discern if there is any substance. Strangely, even nonsensical LLM-generated math often references relevant concepts. With effort, human experts can modify ideas that do not work as presented into a correct and original argument.
And going back to proof assistants:
One related direction where some progress is likely to be made in the near future is in using LLMs to semi-automate some aspects of formalizing a mathematical proof in a formal language such as Lean; see this recent talk by Jason Rute for a survey of the current state of the art. There are already some isolated examples in which a research paper is submitted in conjunction with a formally verified version of the proofs, and these new tools may make this practice more common. One could imagine journals offering an expedited refereeing process for such certified submissions in the near future, as the referee is freed to focus on other aspects of the paper such as exposition and impact.
Not an answer to your question, but Sarah Constantin’s essay seems relevant. As usual it’s hard not to just quote the entire piece:
But one thing I have noticed personally is that people have gotten intimidated by more formal and public kinds of online conversation. I know quite a few people who used to keep a “real blog” and have become afraid to touch it, preferring instead to chat on social media. It’s a weird kind of locus for perfectionism — nobody ever imagined that blogs were meant to be masterpieces. But I do see people fleeing towards more ephemeral, more stream-of-consciousness types of communication, or communication that involves no words at all (reblogging, image-sharing, etc.) There seems to be a fear of becoming too visible as a distinctive writing voice. …
What might be going on here?
Of course, there are pragmatic concerns about reputation and preserving anonymity. People don’t want their writing to be found by judgmental bosses or family members. But that’s always been true — and, at any rate, social networking sites are often less anonymous than forums and blogs.
It might be that people have become more afraid of trolls, or that trolling has gotten worse. Fear of being targeted by harassment or threats might make people less open and expressive. I’ve certainly heard many writers say that they’ve shut down a lot of their internet presence out of exhaustion or literal fear. And I’ve heard serious enough horror stories that I respect and sympathize with people who are on their guard.
But I don’t think that really explains why one would drift towards more ephemeral media. Why short-form instead of long-form? Why streaming feeds instead of searchable archives? Trolls are not known for their patience and rigor. Single tweets can attract storms of trolls. So troll-avoidance is not enough of an explanation, I think.
It’s almost as though the issue were accountability.
A blog is almost a perfect medium for personal accountability. It belongs to you, not your employer, and not the hivemind. The archives are easily searchable. The posts are permanently viewable. Everything embarrassing you’ve ever written is there. If there’s a comment section, people are free to come along and poke holes in your posts. This leaves people vulnerable in a certain way. Not just to trolls, but to critics.
You can preempt embarrassment by declaring that you’re doing something shitty on purpose. That puts you in a position of safety. You move to a space for trashy, casual, unedited talk, and you signal clearly that you don’t want to be taken seriously, in order to avoid looking pretentious and being deflated by criticism. I think that a lot of online mannerisms, like using all-lowercase punctuation, or using really self-deprecating language, or deeply nested meta-levels of meme irony, are ways of saying “I’m cool because I’m not putting myself out there where I can be judged. Only pompous idiots are so naive as to think their opinions are actually valuable.”
Here’s another angle on the same issue. If you earnestly, explicitly say what you think, in essay form, and if your writing attracts attention at all, you’ll attract swarms of earnest, bright-but-not-brilliant, mostly young white male, commenters, who want to share their opinions, because (perhaps naively) they think their contributions will be welcomed. It’s basically just “oh, are we playing a game? I wanna play too!” If you don’t want to play with them — maybe because you’re talking about a personal or highly technical topic and don’t value their input, maybe because your intention was just to talk to your friends and not the general public, whatever — you’ll find this style of interaction aversive. You’ll read it as sealioning. Or mansplaining. Or “well, actually”-ing. And you’ll gravitate to forms of writing and social media where it’s clear that debate is not welcome.
I think what’s going on with these kinds of terms is something like:
Author: “Hi! I just said a thing!”
Commenter: “Ooh cool, we’re playing the Discussion game! Can I join? Here’s my comment!” (Or, sometimes, “Ooh cool, we’re playing the Verbal Battle game! I wanna play! Here’s my retort!”)
Author: “Ew, no, I don’t want to play with you.”
There’s a bit of a race/gender/age/educational slant to the people playing the “commenter” role, probably because our society rewards some people more than others for playing the discussion game. Privileged people are more likely to assume that they’re automatically welcome wherever they show up, which is why others tend to get annoyed at them and want to avoid them.
Privileged people, in other words, are more likely to think they live in a high-trust society, where they can show up to strangers and be greeted as a potential new friend, where open discussion is an important priority, where they can trust and be trusted, since everybody is playing the “let’s discuss interesting things!” game.
The unfortunate reality is that most of the world doesn’t look like that high-trust society.
There’s more, do check it out.
At least 2 options to develop aligned AGI, in the context of this discussion:
Slow down capabilities and speed up alignment just enough that we solve alignment before developing AGI
e.g. the MTAIR project, in this paper, models the effect of a fire alarm for HLMI as “extra time” as speeding up safety research, leading to a higher chance that it is successful before timeline for HLMI
this seems intuitively more feasible, hence more likely
Stop capabilities altogether—this is what you’re recommending in the OP
this seems intuitively far less feasible, hence ~unlikely (I interpret e.g. HarrisonDurland’s comment as elaborating on this intuition)
What I don’t yet understand is why you’re pushing for #2 over #1. You would probably more persuasive if you addressed e.g. why my intuition that #1 is more feasible than #2 is wrong.
Edited to add: Matthijs Maas’ Strategic Perspectives on Transformative AI Governance: Introduction has this (oversimplified) mapping of strategic perspectives. I think you’d probably fall under (technical: pessimistic or very; governance: very optimistic), while my sense is most LWers (me included) are either pessimistic or uncertain on both axes, so there’s that inferential gap to address in the OP.
Applied Divinity Studies wrote a related post that might be of interest: How Long Should you Take to Decide on a Career? They consider a modified version of the secretary problem that accounts for the 2 problematic assumptions you noted (binary payoff and ignorance of opportunity cost); you can play with the Colab notebook if you’re curious. Interestingly, varying the parameters tends to pull the optimal starting point earlier (contra my initial intuition), sometimes by a lot. The optimal solution is so parameter-dependent that it made me instinctively want to retreat to intuition, but of course ADS anticipates this and argues against it: “formal models may be bad, but at least they can disabuse us of even worse intuitive mental models”.
Perhaps related is this classic post by Thrasymachus: https://www.lesswrong.com/posts/dC7mP5nSwvpL65Qu5/why-the-tails-come-apart. Scott Alexander uses that post as a jumping-off point to discuss a variety of topics, from ostensibly conflicting results in happiness research to the problem of figuring out a morality that can survive transhuman scenarios: https://www.lesswrong.com/posts/asmZvCPHcB4SkSCMW/the-tails-coming-apart-as-metaphor-for-life. (Or maybe I’m confused and this isn’t really related to what you’re talking about?)
This is obviously not a very realistic model, but it probably produces fairly realistic results. But again, this is an area for future improvement.
Curious from a modeling perspective: what improvements would be top of mind for you? Another way to phrase this: if someone else were to try modeling this, what aspects would you look at to tell if it’s an improvement or not?
What do you think about deep work (here’s a semi-arbitrarily-chosen explainer)? I suppose the Monday time block after the meeting lets you do that, but that’s maybe <10% of the workweek; you also did mention “If people want to focus deeply for a while, they can put on headphones”. That said, many of your points aren’t conducive to deep work (e.g. “If you need to be unblocked by someone, the fastest way is to just go to their desk and ask them in person” interrupts the other person’s deep work block, same with “use a real-time chat platform like Slack to communicate and add all team members to all channels relevant to the team”, and “if one of your teammates messages you directly, or if someone @ tags you, you want to respond basically immediately”).
I’ve always wondered about this, given my experience working at a few young-ish high-growth top-of-industry companies—I always hated the constant interruption but couldn’t deny how much faster everything moved; that said I mostly did deep work well after office hours (so a workweek would basically be 40 hours of getting interrupted to death followed by 20-30 hours of deep work + backlog-clearing), as did ~everyone else.