DARPA Digital Tutor: Four Months to Total Technical Expertise?

DARPA spent a few million dollars around 2009 to create the world’s best digital tutoring system for IT workers in the Navy. I am going to explain their results, the system itself, possible limitations, and where to go from here.

It is a truth universally acknowledged that a single nerd having read Ender’s Game must be in want of the Fantasy Game. The great draw of the Fantasy Game is that the game changes with the player and reflects the needs of the learner growing dynamically with him/​her. This dream of the student is best realized in the world of tutoring, which while not as fun, is known to be very, very effective. Individualized instruction can make students jump to the 98 percentile compared to non tutored students. DARPA poked at this idea with their Digital Tutor trying to answer this question: How close to the expertise and knowledge base of well-experienced IT experts can we get new recruits in 16 weeks using a digital tutoring system?

I will say the results upfront, but before I do, I want to do two things. First pause to note the audacity of the project. Some project manager thought, “I bet we can design a system for training that is as good as 5 years on the job experience.” This is astoundingly ambitious. I love it! Second a few caveats. Caveat 1) Don’t be confused. Technical training is not the same as education. The goals in education are not merely to learn some technical skills like reading, writing, and arithmetic. Getting any system to usefully measure things like inculturation, citizenship, moral uprightness, and social mores is not yet something any system can do, let alone a digital system. Caveat 2) Online classes have notoriously high attrition rates, drop rates, and no shows. Caveat 3) Going in we should not expect the digital tutor to be as good as a human tutor. A human tutor likely can catch nuances that a digital tutor, no matter how good cannot. Caveat 4) Language processing technology, chat bots, and AI systems are significantly better in 2020 than they were 2009, so we should be forgiving if the DARPA IT program is not as good as it would be if the experiment were rerun today.

All these caveats, I think should give us a reason to adjust our mental score of the Digital Tutor a few clicks upward and give it some credit. However, this charitable read of the Digital Tutor that I started with when reading the paper turned out to be unnecessary. The Digital Tutor students outperformed traditionally taught students and field experts in solving IT problems on the final assessment. They did not merely meet the goal of being as good after 16 weeks as experts in the field, but they actually outperformed them. This is a ridiculously positive outcome, and we need to look closely to see what parts of this story are believable and make some conjectures for why this happened and some bets about whether it will replicate.

The Digital Tutor Experience

We will start with the Digital Tutor student experience. This will give us the context we need to understand the results.

Students (cadets?) were on the same campus and in classrooms with their computers which ran the Digital Tutor program. A uniformed Naval officer proctored each day for their 16 week course. The last ‘period’ of the day was a study hall with occasional hands-on practice sessions led by the Naval officer. This set-up is important for a few reasons, in my opinion. There is a shared experience among the students of working on IT training, plus the added accountability of a proctor keeps everyone on task. This social aspect is very important and powerful compared to the dissipation experienced by the lone laborer at home on the computer. This social structure completely counteracts caveat 2 above. The Digital Tutor is embedded in a social world where the students are not given the same level of freedom to fail that a Coursera class offers.

Unlike many learning systems, the Digital Tutor had no finishing early option. Students had on average one week to complete a module, but the module would continuously teach, challenge, and assess for students who reached the first benchmark. “Fast-paced learners who reached targeted levels of learning early were given more difficult problems, problems that dealt with related subtopics that were not otherwise presented in the time available, problems calling for higher levels of understanding and abstraction, or challenge problems with minimal (if any) tutorial assistance.” Thus the ceiling was very high and kept the high flyers engaged.

As for pedagogical method “[The Digital Tutor] presents conceptual material followed by problems that apply the concepts and are intended to be as authentic, comprehensive, and epiphanic as those obtained from years of IT experience in the Fleet. Once the learner demonstrates sufficient understanding of the material presented and can explain and apply it successfully, the Digital Tutor advances either vertically, to the next higher level of conceptual abstraction in the topic area, or horizontally, to new but related topic areas.” Assessment of the students throughout is done by the Conversation Module in the DT which offers hints, asks leading questions, and requests clarifications of the student’s reasoning. If there is a problem or hangup, the Digital Tutor will summon the human proctor to come help (the paper does not give any indication of how often this happened).

At the end of the 16 weeks, the students trained by the Digital Tutor squared off in a three way two week assessment comparing them to a group which was trained in a 35 week classroom program and experienced Fleet technicians. Those trained by the Digital Tutor significantly outperformed both groups.

  • At least four patterns were repeated across the different performance measures:

  • With the exception of the Security exercise, Digital Tutor participants outperformed the Fleet and ITTC participants on all other tests.

  • Differences between Fleet and ITTC participants were generally smaller and neither consistently positive nor negative.

  • On the Troubleshooting exercises, which closely resemble Navy duty station work, Digital Tutor teams substantially outscored Fleet ITs and ITTC graduates, with higher scores at every difficulty level, less harm to the system, and fewer unnecessary steps.

  • In individual tests of IT knowledge, Digital Tutor graduates also substantially outscored Fleet ITs and ITTC graduates.

How did they build the Digital Tutor?

This process was long, arduous, and expensive. First they recruited subject area experts and had them do example tutoring sessions. They took the best tutors from among the subject area experts and had 24 of them tutor students one-on-one in their sub-domain of expertise. Those students essentially received a one-on-one 16 week course. Those sessions were all recorded and served as the template for the Digital Tutor.

A content author (usually a tutor) and content engineer would work together to create the module for each sub-domain while a course architect oversaw the whole course and made sure everything fit together.

The Digital Tutor itself has four layers: 1) a framework for the IT ontologies and feature extraction, 2) an Inference Engine to judge the students understanding/​misunderstanding, 3) an Instruction Engine to decide what topics/​problems to serve up next, a Conversation Module which uses natural language to prod the student to think through the problem and create tests for their understanding, and 4) a Recommender to call in a human tutor when necessary.

I would like to know a lot more about this, so if anyone could point me in a good direction to learn how to efficiently do some basic Knowledge Engineering that would be much appreciated.

So in terms of personnel we are talking 24 tutors, about 6 content authors, a team of AI engineers, several iterations through each module with test cohorts, and several proctors throughout the course, and maybe a few extra people to set up the virtual and physical problem configurations. Given this expense and effort, it will not be an easy task to try and replicate their results in a separate domain or even the same one. One note in the paper that I found obscure is that the paper claimed the Direct Tutor “is, at present, expensive to use for instruction.” What does this mean? Once the thing is built, besides the tutors/​teachers—which you would need for any course of study, what makes it expensive at present? I’m definitely confused here.

Digging into the results

The assessment of the 3 groups in seven categories showed the superiority of the Digital Tutoring system in everything but Security. For whatever reason they could not get a tutor to be part of the development of the Security module, so that module was mostly lecture. Interestingly though, if we were to assume all else to be equal, then this hole in the Digital Tutor program serves to demonstrate the effectiveness of the program design through a via negativa.

In any case the breakdown of performance in the seven categories, I think is pretty well captured in the Troubleshooting assessment.

“Digital Tutor teams attempted a total of 140 problems and successfully solved 104 of them (74%), with an average score of 3.78 (1.91). Fleet teams attempted 100 problems and successfully solved 52 (52%) of them, with an average score of 2.00 (2.26). ITTC teams attempted 87 problems and successfully solved 33 (38%) of them, with an average score of 1.41 (2.09).”

Similar effects are true across the board, but that is not what interests me exactly, because I want to know about question type. Indeed, what makes this study so eye catching is that it is NOT a spaced-repetition-is-the-answer-to-life paper in disguise (yes, spaced-repetition is the bomb, but I contend that MOST of what we want to accomplish in education can’t be reinforced by spaced-repetition, but oh hell, is it good for language acquisition!).

The program required students to employ complicated concepts and procedures that were more than could be captured by a spaced-repetition program. “Exercises in each IT subarea evolve from a few minutes and a few steps to open-ended 30–40 minute problems.” (I wonder what the time-required distribution is for real life IT problems for experts?) So this is really impressive! The program is asking students and experienced Fleet techs to learn how to solve large actual problems aboard ships and is successful on that score. We should be getting really excited about this! Remember in 16 weeks these folks were made into experts.

Well let’s consider another possibility… what if IT system network maintenance is a skill set that is, frankly, not that hard? You can do this for IT, but not for Captains of a ship, Admirals of a fleet, or Program Managers in DARPA. Running with this argument a little more, perhaps the abstract reasoning and conceptual problem solving in IT is related to the lower level spaced-repetition skills in a way that for administrators, historians, and writers it is not. The inferential leap, in other words, from the basics to expert X-Ray vision of problems is lower in IT than in other professional settings. Perhaps. And I think this argument has merit to it. But I also think this is one of those examples of raising the bar for what “true expertise” is, because the old bar has been reached. To me, it is totally fair to say that some IT problems do require creative thinking and a fully functional understanding of a system to solve. That the students of the Direct Tutor (and its human adjuncts) outperformed the experts on unnecessary steps and avoided causing more problems than they fixed is some strong evidence that this program opened the door to new horizons.

Where to go from here

From here I would like to learn more about how to create AI systems like this and try it out with the first chapter of AoPS Geometry. I could test this in a school context against a control group and see what happens.

Eventually, I would like to see if something like this could work for AP European History and research and writing. I want someone to start pushing these program strategies into the social sciences, humanities and other soft fields, like politics (elected members to government could have an intensive course so they don’t screw everything up immediately).

Another thing I would be interested to see is a better platform for making these AI networks. Since creating something of this sort can only be done by expert programmers, content knowledge experts can’t gather together to create their own Digital Tutors. This is a huge bottleneck. If we could put a software suite together that was only moderately more easy than the current difficulty of creating a fully operational knowledge environment from scratch that could have an outsized effect on education within a few years.