Retrospective on a quantitative productivity logging attempt

I have a productivity scale I’ve used to log productivity data for several years. It’s a subjective rating system from 1 to 10, and looks something like

1. can’t do anything, even reading. Worktime 0%.

2. can force myself to read or work, but I can barely parse anything. Worktime 5%.

...

5. I don’t want to work and am easily distracted, but I’m getting stuff done. 50%

6. Some mild distractions, but I can stick to a pomodoro timer. Worktime 60%.

...

10. Insane art-level hyperfocus. Worktime 100%.

At the end of each workday I would record how well I thought I’d done by this scale.

I’d been dissatisfied with this for a while – there was no way my brain was accurately tracking what percentage of time I was working, these descriptions are not well defined, don’t cleanly map to some level of productivity. I can’t prevent internal mapping drift, where my standards slowly rise or fall, such that a day I mark as productivity=4 this week is actually much more productive than a day I marked at 4 several years ago.

I’m invested in having good measurements, because I’ve been iterating on antidepressants and ADD meds for years, and having data on which ones are working on what metrics (I also track mood, exercise, sleep, social time) would be very useful for having a better life.

So I wrote a small Python script that tracked how much time I spent working at my job. Every time I took a break or went back to work, I’d mark it. If I noticed I’d started working during breaktime or zoning out during worktime, I’d ‘revert’ however many minutes I thought it had been to be of the other type. I also had ‘dead’ time where I wasn’t getting anything done for reasons unrelated to my productivity, which I used to mark meetings or lunch breaks. At the end of the session, it would spit out a summary of how much time I’d spent working vs resting. This was a much more accurate, quantitative way of measuring what I wanted, or so I thought. I used it for a month and a week before I stopped.

Here’s why I quit it.

As is frequently the case, using part of a system to monitor that system changes that system enough that the output of the monitoring is less useful. On bad attention days, I would be switching back and forth between work every two minutes. The work/break context switches were costlier, and seeing how short my work periods were lowered my mood.
The varying difficulty of tasks threw off the measurement. Some days I’d have monotonous but easy work, sometimes I’d have one tricky complicated thing that took a lot of brainpower and persistence. When I was using the subjective scale, I’d adjust my score for perceived task-easiness. But when I was using the time tracker, a day that was spent on “going through the codebase and adjusting every method call to do a new thing” would be logged as productive, even though I could have done that on that even on a bad day.
My personal productivity scale ranged between 0% and 100% productivity, and my ratings usually fall between 2 (subjective 5% work time) to 6 (60%). But my time-tracked work time usually hovered around 50%. I have an internal sense of “I haven’t done enough work, I really need to do something” that kicks in and makes me do work-like activities badly, slowly, and ineffectively. For example, I’ll read some documentation, staring at one sentence at a time, forcing myself to process it before moving onto the next one. That counts as work by the tracker time—I’m certainly not resting—and I can fill half my workday with that. At the end, the tracker will say I worked 50% of the time, but my subjective scale would say it was a 2. And I think my subjective scale is more correct.

I considered the idea of rating work periods every time I ended one, so that after spending an hour laboriously shoving sentences against my eyeballs, I’d indicate to the program that “taking break now; also, that last work period was only 10% of a real work period”. But that goes right back to the problem of subjectively rated work, plus adds to the already-painful overhead I described in point 1.

That was an interesting lesson to learn. In the future, when I’m trying to measure something, I’ll try to ask myself

How will integrating monitoring into my system change my system?
Describe a day that the proposed monitoring system would give a high score, but actually should have a bad score. How common do you expect these days to be?
Describe a day that would get a low score should have a high score, etc.
How much overhead do you expect this to add?