§1Introduction
Mainstream educational software is, by structure, a delivery device for questions. A student receives an item, selects or types an answer, receives a correctness signal, and moves on. The platform stores the final response and, in the better implementations, an aggregate of per-standard performance over time. What is not stored — what is, in fact, structurally invisible — is the act of thinking that produced the answer. This omission is not a UX shortfall. It reflects a deep commitment, inherited from psychometric tradition[14], to the response as the unit of evidence.
Cognitive science has, for fifty years, argued the opposite case. The verbal-report tradition initiated by Ericsson and Simon[15, 16]established that protocols collected during task performance are valid windows on the cognitive processes that produced the response, when appropriately elicited and analyzed. The literature on misconceptions in physics[17, 18], mathematics[19], and reading[20] has demonstrated repeatedly that two students producing the same wrong answer can be running fundamentally different cognitive programs — and that the appropriate remediation for each is different. A multiple-choice item collapses these into one bit; a free-response item, even a short one, preserves the structure of the reasoning[10, 11, 21].
Luna is an attempt to take this argument seriously at the level of software architecture. Rather than treating a curriculum as a queue of items, we treat it as a graph of thinking steps: atomic exercises, each describing one move of reasoning the student must perform, threaded into chains that the student walks. At every step, the student's response — typed, dictated, or drawn — is captured, parsed for the structure of the move attempted, written into a per-student knowledge network we will describe as a zettelkastenin Luhmann's sense[22], and made available both to the student's next thought and to the teacher's dashboard in real time.
This paper sets out the evidence base for that design. We do not present a randomized controlled trial of Luna here; that work is ongoing and will be reported separately. Instead we describe the convergence of cognitive-science findings that the design composes, and the predictions those findings make about what should happen when their mechanisms are integrated in software at the level of the thinking step.
§2The Cognitive Foundation
§2.1Retrieval practice and the testing effect
The single most replicated finding in the science of memory is that the act of retrieving information from memory strengthens that memory far more than restudy of the same material[1, 2, 23]. Roediger and Karpicke's 2006 series demonstrated effect sizes in the d = 0.7–1.0 range for retrieval over rereading at one-week delays, with the gap widening at longer delays[1]. Subsequent meta-analyses across 159 studies and roughly 17,000 participants confirmed the effect across domains, ages, and item formats[24]. The relevant design consequence is unambiguous: every interaction in which a student is asked to produce rather than recognize is a learning event, not merely an assessment event.
§2.2Desirable difficulty
Bjork and Bjork[3] distinguish learning (a durable change in the long-term substrate) from performance (the transient retrieval observable in the moment), and demonstrate that conditions which depress performance frequently improve learning. Soderstrom and Bjork[25] review the evidence in detail. The practical consequence is that the right level of struggle is neither the easiest path through material nor the hardest, but a calibrated level at which retrieval is effortful but generally successful — what Vygotsky earlier characterized as the zone of proximal development.
§2.3Spacing and interleaving
Distributed practice — encountering the same material across separated sessions rather than massed together — produces large, robust gains in long-term retention[4, 26]. Interleaving — alternating between related problem types within a session — produces additional gains, particularly for transfer and discrimination[5]. Both findings are antagonistic to the unit-block organization of most textbooks and most ed-tech, and aligned with a thought-step architecture in which atoms can be sequenced flexibly across topics.
§2.4Generation and self-explanation
Slamecka and Graf's generation effect[27] establishes that information a learner produces is retained better than information presented to them. Chi et al.'s work on self-explanation[28]shows that students who explain worked examples to themselves — even when the explanations are partly wrong — outperform students who merely read the examples. Both effects are mechanically engaged by Luna's requirement that the student produce a thought at each atom rather than select from a menu.
§3From Final Answer to Reasoning Trace
Multiple-choice items dominate educational assessment for two reasons: they are cheap to score, and they support strong psychometric properties under classical test theory. They also have an underexplored cost. Birenbaum and Tatsuoka[10] compared open-ended and multiple-choice forms of the same arithmetic items and found that the open-ended forms detected misconceptions undetectable in the multiple-choice forms; further, students sometimes selected correct multiple-choice answers via incorrect reasoning. Martinez[11]and Bennett[21] generalize the finding: the response format constrains the cognition that can be evidenced.
Free-response and constructed-response items, by contrast, leave a trace. Kang, McDermott, and Roediger[29] show that constructed-response testing produces larger retention benefits than multiple-choice testing, even when both are followed by corrective feedback. The verbal-protocol tradition pushes further: collected during rather than after task performance, even brief written responses contain partial-product information that allows the reasoning process to be reconstructed[15, 16].
Luna's thought-step atoms are designed to occupy this niche. Each atom requires production — a sentence, an equation, a phrase, a drawn line — and the capture is timestamped at sub-second resolution. The result is not a graded essay but a structured trace, parseable for the specific moves the atom describes (e.g., identify the inverse operation, name the assumption, state the counterfactual) and for the misconceptions in the documented taxonomy that the response is consistent with.
§4Atomization: the Thinking-Step as Unit
Cognitive load theory[30, 31] makes a strong prediction: instruction that exceeds working-memory capacity at any moment produces little learning, however well-designed at the unit level. Sweller's formulation has direct architectural implications: each interaction must isolate one source of difficulty, allowing germane load to be allocated to the move being learned rather than wasted on extraneous integration[31].
diSessa's knowledge-in-pieces framework[17, 18]further suggests that what students bring to instruction is not a single mental model but a loose collection of phenomenological primitives — small, partly correct intuitions that must be elicited, named, and reorganized. Treating curriculum as a tree of large "skills" obscures these pieces. Treating it as a graph of thinking steps surfaces them.
Luna's atoms are therefore narrow by construction: each describes one move, addresses one piece of knowledge, and admits one canonical family of misconceptions. A typical chain — the unit a student walks in a single session — comprises five to twelve such atoms, with the student's response at atom n serving as the prompt's context at atom n+1. The chain is not a quiz; it is a scaffolded sequence of reasoning, and the schema places atoms within chains using the prerequisite structure documented in cognitive diagnostic models[32].
§5Misconceptions as Named Targets
The misconception literature, beginning with Smith, diSessa, and Roschelle's reconception[19] and extending across physics[33], mathematics[34], biology, and reading, identifies named, persistent error patterns that produce characteristic wrong answers across populations. The names matter. Once a misconception is named — impetus theory, tone-as-topic,equates squared with doubled — instruction can be designed specifically to dislodge it.
Luna maintains an internal taxonomy of approximately 1,200 named misconceptions across K–12, drawn from this literature and extended through teacher review. Each misconception is associated with: (a) the thinking-step atoms that elicit it, (b) the surface features that mark it in a free-response, and (c) one or more targeted re-teach sequences. When a student's response at an atom matches a misconception pattern, the platform flags it to the student's teacher and offers the targeted re-teach as a one-click intervention.
This is, in effect, a software implementation of formative assessment as described by Black and Wiliam[6]: the diagnostic function and the remedial function are placed adjacent to the moment of evidence, rather than separated by the days or weeks of a traditional assessment cycle.
§6The Extended Mind and the Zettelkasten
Clark and Chalmers[35] argue that when an external resource is reliably available, automatically consulted, and information-processing-equivalent to an internal cognitive process, the resource constitutes a literal part of the cognitive system. Their canonical example is Otto's notebook: a memory-impaired man whose notebook plays the role that biological memory plays for others. The philosophical and empirical extensions of this thesis[36, 37]are now substantial.
Luna's persistent thought-store is designed to meet the Clark– Chalmers conditions. Every captured thought is written to a per-student note network — addressable, linked, and surfaced automatically when later atoms touch the same content. The architectural model is Luhmann's zettelkasten[22]: a densely cross-referenced collection of atomic notes, with links that encode supports, contradicts, exemplifies, and decomposes relations between ideas. Luhmann credited the slip-box with his own scholarly productivity (over 70 books and 400 articles); Ahrens[38]documents the cognitive mechanics.
The educational consequence is cumulative. A student's thinking about presupposition in an October passage is linked, automatically, to the next chain in March that asks about authorial assumption. The teacher conferencing with the student in May can replay both the October chain and the March chain alongside each other. The diagnosis is no longer a snapshot; it is a trajectory.
§7Real-Time Formative Feedback
Black and Wiliam's 1998 review[6] remains the foundational document for formative assessment, identifying effect sizes on the order of d = 0.4–0.7 for well-designed formative practice. Hattie's meta-meta-analyses[7, 39]place high-quality feedback among the largest single interventions available to a teacher, with average effect sizes near d = 0.7. Kluger and DeNisi's earlier meta-analysis[40]adds an important caveat: feedback that orients attention to the self (praise, grade-level) is on average negative; feedback that orients to the task and the strategy is reliably positive. Shute's synthesis[41] operationalizes the latter.
Luna places task- and strategy-level feedback at the level of the individual atom and routes it to the teacher within seconds of capture. The signal is intentionally not numeric: it is the named misconception, the targeted re-teach, and a one-click action. This design choice follows the Kluger–DeNisi recommendation away from ego-engaging feedback and toward strategic feedback.
Bloom's 2-sigma problem[12] — the observation that one-to-one mastery-oriented tutoring produces achievement two standard deviations above conventional group instruction — has set the aspirational ceiling for ed-tech for forty years. VanLehn's review[13] revised the estimate downward but confirmed that step-level interaction in intelligent tutoring systems can approach the human-tutor effect when implemented carefully. The thought-step architecture is a direct attempt at the step granularity VanLehn identifies as critical.
§8Implications for Teachers and Schools
The architecture described here makes specific predictions about how classroom workflow changes. First, the unit of teacher attention shifts from which students got which items wrong to which misconceptions are spreading in the cohort right now. Second, the re-teach loop tightens from days to minutes. Third, the conference — with student or guardian — becomes evidence-rich: the teacher can replay actual reasoning, not merely cite scores.
Standards-based grading[42] composes naturally with this design: mastery rolls up by standard from the underlying graph of atoms, and the artifact backing every mastery claim is a stored, inspectable chain. Conferring with parents shifts from "she has a B-" to "here are the fourteen standards she has demonstrated and the two she has not, with examples of her reasoning."
§9Limitations and Open Questions
Several caveats are worth stating clearly. (i) Free-response capture depends on the student's ability to externalize thinking; modalities other than typing — voice, drawing, scribed dictation — are supported but vary in fidelity. (ii) The misconception taxonomy is a living artifact; its coverage in some content areas is denser than others. (iii) Parsing of student responses involves both rule-based and learned models; we treat parser outputs as fallible evidence and surface ambiguity to teachers rather than asserting confident diagnoses where the response is genuinely ambiguous. (iv) The platform's effect on long-run achievement, as distinct from the component effects established in the cited literature, is the subject of ongoing studies in partner schools and will be reported in a separate technical report.
§10Conclusion
The cognitive-science literatures on retrieval, difficulty, spacing, generation, self-explanation, formative assessment, metacognition, misconceptions, verbal protocols, and extended cognition are individually mature and individually robust. They have not, in our view, been composed at the level of software architecture in the manner this paper describes. Luna's claim is that composing them at the level of the thinking step, with persistent linked storage of each captured thought, yields a platform whose formative signals are of a kind not available in the question-and-answer model that has dominated educational software since its inception. The empirical evaluation of that claim is the work of the next several years; the theoretical case is set out above.
§RReferences
- [1]Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17(3), 249–255.
- [2]Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science, 331(6018), 772–775.
- [3]Bjork, R. A., & Bjork, E. L. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In M. A. Gernsbacher et al. (Eds.), Psychology and the real world (pp. 56–64), Worth Publishers.
- [4]Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380.
- [5]Rohrer, D., Dedrick, R. F., & Stershic, S. (2015). Interleaved practice improves mathematics learning. Journal of Educational Psychology, 107(3), 900–908.
- [6]Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74.
- [7]Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.
- [8]Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American Psychologist, 34(10), 906–911.
- [9]Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.) (2000). How People Learn: Brain, Mind, Experience, and School (Expanded Edition). National Academy Press.
- [10]Birenbaum, M., & Tatsuoka, K. K. (1987). Open-ended versus multiple-choice response formats — It does make a difference for diagnostic purposes. Applied Psychological Measurement, 11(4), 385–395.
- [11]Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218.
- [12]Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16.
- [13]VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221.
- [14]Lord, F. M., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley.
- [15]Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological Review, 87(3), 215–251.
- [16]Ericsson, K. A., & Simon, H. A. (1984). Protocol Analysis: Verbal Reports as Data. MIT Press.
- [17]diSessa, A. A. (1988). Knowledge in pieces. In G. Forman & P. B. Pufall (Eds.), Constructivism in the computer age (pp. 49–70), Lawrence Erlbaum.
- [18]diSessa, A. A. (1993). Toward an epistemology of physics. Cognition and Instruction, 10(2–3), 105–225.
- [19]Smith, J. P., diSessa, A. A., & Roschelle, J. (1994). Misconceptions reconceived: A constructivist analysis of knowledge in transition. Journal of the Learning Sciences, 3(2), 115–163.
- [20]Kintsch, W. (1998). Comprehension: A Paradigm for Cognition. Cambridge University Press.
- [21]Bennett, R. E. (1993). On the meanings of constructed response. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 1–27), Lawrence Erlbaum.
- [22]Luhmann, N. (1981). Kommunikation mit Zettelkästen: Ein Erfahrungsbericht. In H. Baier, H. M. Kepplinger, & K. Reumann (Eds.), Öffentliche Meinung und sozialer Wandel (pp. 222–228), Westdeutscher Verlag.
- [23]Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students' learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14(1), 4–58.
- [24]Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701.
- [25]Soderstrom, N. C., & Bjork, R. A. (2015). Learning versus performance: An integrative review. Perspectives on Psychological Science, 10(2), 176–199.
- [26]Carpenter, S. K., Cepeda, N. J., Rohrer, D., Kang, S. H. K., & Pashler, H. (2012). Using spacing to enhance diverse forms of learning: Review of recent research and implications for instruction. Educational Psychology Review, 24(3), 369–378.
- [27]Slamecka, N. J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 592–604.
- [28]Chi, M. T. H., Bassok, M., Lewis, M. W., Reimann, P., & Glaser, R. (1989). Self-explanations: How students study and use examples in learning to solve problems. Cognitive Science, 13(2), 145–182.
- [29]Kang, S. H. K., McDermott, K. B., & Roediger, H. L. (2007). Test format and corrective feedback modify the effect of testing on long-term retention. European Journal of Cognitive Psychology, 19(4–5), 528–558.
- [30]Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285.
- [31]Sweller, J., van Merriënboer, J. J. G., & Paas, F. G. W. C. (1998). Cognitive architecture and instructional design. Educational Psychology Review, 10(3), 251–296.
- [32]Leighton, J. P., & Gierl, M. J. (Eds.) (2007). Cognitive Diagnostic Assessment for Education: Theory and Applications. Cambridge University Press.
- [33]Halloun, I. A., & Hestenes, D. (1985). Common sense concepts about motion. American Journal of Physics, 53(11), 1056–1065.
- [34]Stafylidou, S., & Vosniadou, S. (2004). The development of students' understanding of the numerical value of fractions. Learning and Instruction, 14(5), 503–518.
- [35]Clark, A., & Chalmers, D. (1998). The extended mind. Analysis, 58(1), 7–19.
- [36]Clark, A. (2008). Supersizing the Mind: Embodiment, Action, and Cognitive Extension. Oxford University Press.
- [37]Sutton, J. (2010). Exograms and interdisciplinarity: History, the extended mind, and the civilizing process. In R. Menary (Ed.), The Extended Mind (pp. 189–225), MIT Press.
- [38]Ahrens, S. (2017). How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Sönke Ahrens.
- [39]Hattie, J. (2009). Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to Achievement. Routledge.
- [40]Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119(2), 254–284.
- [41]Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189.
- [42]Guskey, T. R. (2007). Closing achievement gaps: Revisiting Benjamin S. Bloom's 'Learning for mastery'. Journal of Advanced Academics, 19(1), 8–31.