Morendil comments on Experiment Idea Thread - Spring 2011

Morendil 6 May 2011 21:44 UTC
2 points
0
Field: Software Engineering. Issue: what are the determinants of efficiency in getting stuff done that entails writing software.

At the Paris LW meetup, I described to Alexandros the particular subtopic about which I noticed confusion (including my own) - people call it the “10x thesis”. According to this, in a typical workgroup of software professionals (people paid to write code), there will be a ten to one ratio between productivities of the best and worst. According to a stronger version, these disparities are unrelated to experience.

The research in this area typically has the following setup: you get a group of N people in one room, and give them the same task to perform. Usually there is some experimental condition that you want to measure the effect of (for instance “using design patterns” vs “not using design patterns”), so you split them into subgroups accordingly. You then measure how long each takes to finish the task.

The “10x” result comes from interpreting the same kind of experimental data, but instead of looking at the effect of the experimental condition, you look at the variance itself. (Historically, this got noticed because it vexed researchers that the variance was almost always swamping out the effects of the experimental conditions.)

The issue that perplexes me is that taking a best-to-worst ratio in each group, in such cases, will give a measurement of variance that is composed of two things: first, how variable the time required to complete a task is intrinsically, and second, how different people in the relevant population (which is itself hard to define) differ in their effectiveness at completing tasks.

When I discussed this with Alexandros I brought up the “ideal experiment” I would want to use to measure the first component: take one person, give them a task, measure how long they take. Repeat N times.

However this experiment isn’t valid, because remembering how you solved the task the first time around saves a huge amount of time in successive attempts.

So my “ideal” experiment has to be amended: the same, but you wipe the programmer’s short-term memory each time, resetting them to the state they were in before the task. Now this is only an impossible experiment.

What surprised me was Alexandros’ next remark: “You can measure the same thing by giving the same task to N programmers, instead”.

This seems clearly wrong to me. There are two different probability distributions involved: one is within-subject, the other inter-subject. They do not necessarily have the same shape. What you measure when giving one task to N programmers is a joint probability distribution, the shape of which could be consistent with infinitely many hypotheses about the shape of the underlying distributions.

Thus, my question - what would be a good experimental setup and statistical tools to infer within-subject variation, which cannot be measured, from what we can measure?

Bonus question: am I totally confused about the matter?
- Perplexed 6 May 2011 22:13 UTC
  10 points
  0
  Parent
  1. Give one task to N programmers.
  2. Give a different task to the same N programmers.
  3. Repeat #2 several times.
  4. Say to self “I’ll bet the same guy was a super-programmer on all of those tasks. He just is better at programming”.
  5. Repeat #4 several times.
  6. Analyze the data by multiple regression. Independent variables are programmer ids and task ids. Intrinsic variability of tasks falls out of the analysis as unexplained variance, but what you are really interested is relative performance of programmers over all tasks.
  Bonus: I don’t think you are confused. But you seem to be assuming that the 10x thesis applies to specific programming tasks (like writing a parser, or a diagram editor, or a pretty-printer). But I think the hypothesis is stronger than that. Some people are better at all types of programming than are lesser mortals. So, you can smooth the noise by aggregating several tasks without losing the 10x signal.
  - Morendil 7 May 2011 20:56 UTC
    0 points
    0
    Parent
    
    Analyze the data by multiple regression
    
    I’d appreciate practical advice on how to do that in R/RStudio. I have data from an empirical study, loaded in RStudio as “29 observations of 8 variables”. My variables are “Who, T1, T2, T3 (etc)” where “Who” is programmer id and T1, etc. are the times taken for tasks 1 through 8.
    
    What R command will give me a multiple regression of times over programmer id and task id?
    
    [ETA: OK, I figure what I’ve got to do is make this a data frame with 3 variables, those being Who, TaskId, Time. Right? Maybe I can figure it out. Worst case, I’ll create a spreadsheet organized that way.]
    
    [ETA2: I’ve done the above, but I don’t know how to interpret the results. What do you expect to see—in terms of coefficients of regression?]
    - Perplexed 7 May 2011 22:41 UTC
      −2 points
      0
      Parent
      I think you need one variable per programmer (value 0 or 1), one variable per task (value 0 or 1), and one variable for time taken to complete the task (real number). So, with 8 tasks and 29 programmers, you have 38 (= 29 + 8 + 1) variables, all but 3 of which are zero for each observation. And you have 232 observations.
      
      Since you have 37 independent variables, you will have 37 regression coefficients (each presumable in units of hours) plus one additional parameter that applies to all observations. The results claim that you get a good estimate of the time required for programmer j to complete task k by adding together the j-th programmer coefficient, the k-th task coefficient and the extra parameter.
      - Morendil 8 May 2011 8:10 UTC
        0 points
        0
        Parent
        I’m not seeing why the ProgID and TaskID variables need to be booleans—or maybe R implicitly converts them to that. I’ve left them in symbolic form.
        
        Here is a subset of the PatMain data massaged (by hand!) into the format I thought would let me get a regression, and the regression results as a comment. I got this into a data frame variable named z2 and ran the commands:
        
        fit = lm(Time ~ .,data=z2) summary(fit)
        I suck at statistics so I may be talking nonsense here, and you’re welcome to check my results. The bottom line seems to be that the task coefficients do a much better job of predicting the completion time than do the programmer coefficients, with t-values that suggest you could easily not care about who performs the task with the exception of programmer A6 who was the slowest of the lot.
        
        (For instance the coefficients say that the best prediction for the time taken is “40 minutes”, then you subtract 25 minutes if the task is ST2. This isn’t a bad approximation, except for programmer A4 who takes 40 minutes on ST2. It’s not that A4 is slow—just slow on that task.)
        Perplexed 8 May 2011 13:30 UTC
        0 points
        0
        Parent
        You had asked for assistance and expertise on using R/RStudio. Unfortunately, I have never used them.
        
        maybe R implicitly converts them
        
        Judging from your results, I’m sure you are right.
        
        The bottom line seems to be that the task coefficients do a much better job of predicting the completion time than do the programmer coefficients.
        
        Yes, and if you added some additional tasks into the mix—tasks which took hours or days to complete—then programmer ID would seem to make even less difference. This points out the defect in my suggested data-analysis strategy. A better approach might have been to divide each time by the average time for the task (over all programmers), optionally also taking the log of that, and then exclude the task id as an independent variable. After all, the hypothesis is that Achilles is 10x as fast as the Tortoise, not that he takes ~30 minutes less time regardless of task size.
  - Morendil 7 May 2011 11:27 UTC
    0 points
    0
    Parent
    
    you seem to be assuming that the 10x thesis applies to specific programming tasks (like writing a parser, or a diagram editor, or a pretty-printer)
    
    Where is that implied in what I wrote above?
    
    Some people are better at all types of programming than are lesser mortals
    
    Are you making that claim, or suggesting that this is what the 10x thesis means?
    
    (Dijkstra once claimed that “the use of COBOL cripples the mind”. If true, it would follow that someone who is a great COBOL programmer would be a poor programmer in other languages.)
    - Perplexed 7 May 2011 17:11 UTC
      4 points
      0
      Parent
      
      Some people are better at all types of programming than are lesser mortals
      
      Are you making that claim, or suggesting that this is what the 10x thesis means?
      
      Both.
      
      (Dijkstra once claimed that “the use of COBOL cripples the mind”. If true, it would follow that someone who is a great COBOL programmer would be a poor programmer in other languages.)
      
      Amusingly, that does not follow. A great COBOL programmer completes his COBOL tasks in ¹⁄₁₀ the time of lesser folk, and hence becomes ¹⁄₁₀ as crippled.
      
      you seem to be assuming …
      
      Where is that implied in what I wrote above?
      
      It appears that I somehow misinterpreted your point and thereby somehow offended you. That was not my intention.
      
      You begin by mentioning the problem of testing the 10x hypothesis, and then switched to the problem of trying to separate out “how variable the time required to complete a task is intrinsically”. That is an odd problem to focus on, and my intuition tells me that it is best approached by identifying that variance as a residual rather than by inventing an ideal thought experiments that measure it directly. But if someone else has better ideas, that is great.
      - Morendil 7 May 2011 20:43 UTC
        2 points
        0
        Parent
        
        somehow offended you
        
        No offense taken. Just curious to know. I’m declaring Crocker’s Rules in this thread.
        
        You are asserting “some people are better at all types of programming than are lesser mortals”. In that case I’d like to know what evidence convinced you, so that I can have a better understanding of “better at”.
        
        Some of the empirical data I have looked at contradicted your hypothesis “the same guy was a super-programmer on all of those tasks”. In that study, some people finished first on one task and last on some other task. (Prechelt’s “PatMain” study.)
        
        the problem of testing the 10x hypothesis
        
        One of my questions is, “is the 10x claim even a testable hypothesis?”. In other words, do we know what the world would look like if it was false?
        
        When I’ve brought this up in one venue, people asked me “well, have you seen any evidence suggesting that all people code at the same rate?” This is dumb. Just because there exists one alternate hypothesis which is obviously false does not immediately confirm the hypothesis being tested.
        
        Rather, the question is “out of the space of possible hypotheses about how people’s rates of output when programming differ, how do we know that the best is the one which models each individual as represented by a single numerical value, such that the typical ratio between highest and lowest is one order of magnitude”.
        
        This space includes hypotheses where rate of output is mostly explained by experience, which appear facially plausuble—yet many versions of the 10x thesis explicitly discard these.
        Perplexed 7 May 2011 23:04 UTC
        2 points
        0
        Parent
        My reasons for believing the 10x hypothesis are mostly anecdotal. I’ve talked to people who observed Knuth and Harlan Mills in action. I know of the kinds of things accomplished more recently by Torvalds and Hudak. Plus, I have myself observed differences of at least 5x in industrial and college classwork environments.
        
        I looked at the PatMain study. I’m not sure that the tasks there are large enough (roughly 3 hours) to test the 10x hypothesis. Worse, they are program maintenance tasks, and they exclude testing and debugging. My impression is that top programmers achieve their productivity mostly by being better at the design and debugging tasks. That is, they design so that they need less code, and they code so they need dramatically less debugging. So I wouldn’t expect PatMain data to back up the 10x hypothesis.
        Morendil 8 May 2011 7:44 UTC
        0 points
        0
        Parent
        
        My reasons for believing the 10x hypothesis are mostly anecdotal.
        
        Do you see it as a testable hypothesis though, as opposed to an applause light calling out the programming profession as one where remarkable individuals are to be found?
        
        I’m not sure that the tasks there are large enough … they are program maintenance tasks
        
        You said earlier that a great programmer is good at all types of programming tasks, and program maintenance certainly is a programming task. Why the reversal?
        
        Anyway, suppose you’re correct and there are some experimental conditions which make for a poor test of 10x. Then we need to list all such exclusion criteria prior to the experiment, not come up with them a posteriori—or we’ll be suspected of excluding the experimental results we don’t like.
        
        My impression is that top programmers achieve their productivity mostly by being better at the design and debugging tasks … they design so that they need less code
        
        Now this sounds as if you’re defining “productivity” in such a way that it has less to do with “rate of output”. You’ve just ruled out, a priori, any experimental setup in which you hand programmers a fixed design and measure the time taken to implement it, for instance.
        
        At this point ISTM we still have made surprisingly little headway on the two questions at hand:
        
        what kind of claim is the 10x claim—is it a testable hypothesis, and if not, how do we turn it into one
        what kind of experimental setup will give us a way to check whether 10x is indeed favored among credible alternatives
        Perplexed 8 May 2011 13:56 UTC
        2 points
        0
        Parent
        
        Do you see it as a testable hypothesis[?]
        
        I believe it can be turned into one. For example, as stated, it doesn’t take into account sample or population size. The reductio (N=2) is that it seems to claim the faster of two programmers will be 10x as fast as the slower. There is also a need to clarify and delimit what is meant by task.
        
        You said earlier that a great programmer is good at all types of programming tasks, and program maintenance certainly is a programming task. Why the reversal?
        
        Because you and I meant different things by task. (I meant different types of systems—compilers vs financial vs telephone switching systems for example.) Typing and attending meetings are also programming tasks, but I wouldn’t select them out for measurement and exclude other, more significant tasks when trying to test the 10x hypothesis.
        
        Now this sounds as if you’re defining “productivity” in such a way that it has less to do with “rate of output”. You’ve just ruled out, a priori, any experimental setup in which you hand programmers a fixed design and measure the time taken to implement it, for instance.
        
        Yes, I have. And I think we are wasting time here. It is easy to refute a scientific hypothesis by uncharitably misinterpreting it so that it cannot possibly be true. So I’m sure you will succeed in doing so without my help.
        Morendil 8 May 2011 15:46 UTC
        0 points
        0
        Parent
        
        It is easy to refute a scientific hypothesis by uncharitably misinterpreting it so that it cannot possibly be true.
        
        Where specifically have I done that? (Is it the “applause light” part? Do you think it obviously false that the thesis serves as an applause light?)
        
        And I think we are wasting time here.
        
        Are you tapping out? This is frustrating as hell. Crocker’s Rules, dammit—feel free to call me an idiot, but please point out where I’m being one!
        
        Without outside help I can certainly go on doubting—holding off on believing what others seem to believe. But I want something more—I want to form positive knowledge. (As one fictional rationalist would have it, “My bottom line is not yet written. I will figure out how to test the magical strength of Muggleborns, and the magical strength of purebloods. If my tests tell me that Muggleborns are weaker, I will believe they are weaker. If my tests tell me that Muggleborns are stronger, I will believe they are stronger. Knowing this and other truths, I will gain some measure of power.”)
        
        For example, as stated, it doesn’t take into account sample or population size.
        
        Yeah, good catch. The 10x ratio is supposed to hold for workgroup-sized samples (10 to 20). What the source population is, that’s less clearly defined. A 1983 quote from Mills refers to “programmers certified by their industrial position and pay”, and we could go with that: anyone who gets full time or better compensation for writing code and whose job description says “programmer” or a variation thereof.
        
        We can add “how large is the programmer population” to our list of questions. A quick search turns up an estimate from Watts Humphrey of 3 million programmers in the US about ten years ago.
        
        So let’s assume those parameters hold—population size of 3M and sample size of 10. Do we now have a testable hypothesis?
        
        What is the math for finding out what distribution of “productivity” in the overall population gives rise to a typical 10x best-to-worst ratio when you take samples of that size? Is that even a useful line of inquiry?
        pengvado 11 May 2011 20:09 UTC
        4 points
        0
        Parent
        The misinterpretation that stood out to me was:
        
        Now this sounds as if you’re defining “productivity” in such a way that it has less to do with “rate of output”. You’ve just ruled out, a priori, any experimental setup in which you hand programmers a fixed design and measure the time taken to implement it, for instance.
        
        I’m not sure whether you meant “design” to refer to e.g. internal API or overall program behavior, but they’re both relevant in the same way:
        
        The important metric of “rate of output” is how fast a programmer can solve real-world problems. Not how fast they can write lines of code—LOC is a cost, not an output. Design is not a constant. If Alice implements feature X using 1 day and 100 LOC, and Bob implements X using 10 days and 500 LOC, then Alice was 10x as productive as Bob, and she achieved that productivity by writing less code.
        
        I would also expect that even having a fixed specification of what the program should do would somewhat compress the range of observed productivities compared to what actually happens in the wild. Because translating a problem into a desired program behavior is itself part of the task of programming, and is one of the opportunities for good programmers to distinguish themselves by finding a more efficient design. Although it’s harder to design an experiment to test this part of the hypothesis.
        thomblake 11 May 2011 20:18 UTC
        0 points
        0
        Parent
        
        LOC is a cost, not an output
        
        Yes.
      - DSimon 7 May 2011 23:33 UTC
        1 point
        0
        Parent
        
        A great COBOL programmer completes his COBOL tasks in ¹⁄₁₀ the time of lesser folk, and hence becomes ¹⁄₁₀ as crippled.
        
        That has unfortunately not been my experience with similarly crippling languages. A great programmer finishes their crippling-language tasks much quicker than a poor programmer… and their reward is lots lots more tasks in the crippling language. :-\
        wedrifid 8 May 2011 2:35 UTC
        3 points
        0
        Parent
        
        That has unfortunately not been my experience with similarly crippling languages. A great programmer finishes their crippling-language tasks much quicker than a poor programmer… and their reward is lots lots more tasks in the crippling language
        
        I’ve seen this too—if something sucks it can be a good idea to make sure you appear to suck at it!
        DSimon 9 May 2011 19:25 UTC
        0 points
        0
        Parent
        If you do the job badly enough...
- twanvl 6 May 2011 23:14 UTC
  3 points
  0
  Parent
  If being a good or bad programmer is an intrinsic quality that is independent of the task, then you could just give the same subject different tasks to solve. So you take N programmers, and give team all K tasks to solve. Then you can determine the mean difficulty of each task as well as the mean quality of each programmer. Given that you should be able to infer the variance.
  
  There are some details to be worked out, for example, is task difficulty multiplicative or additive? I.e. if task A is 5 times as hard as task B, will the standard deviation also be 5 times as large? But that can be solved with enough data and proper prior probabilities of different models.