Essays 2026.05.09 Aid

What we actually know about AI in education — and what we don't yet.

Somewhere around the halfway point of 2026, global spend on AI in education crossed ten billion dollars a year. The number is hard to sit with — not because it is shocking, but because of what it stands next to.

In January of this year, the OECD released its Digital Education Outlook 2026, a landmark synthesis of the current picture of technology in classrooms across its member countries. One of its headline findings is precise and worth quoting in full: "When students rely too heavily on generative AI, metacognitive engagement tends to decline, creating a misalignment between task performance and genuine learning." The benefits, the report is careful to note, are not automatic. They arrive under particular conditions — and they appear to disappear when the conditions change.

The OECD goes further. It describes what happens in the studies it reviewed when the AI tool is removed from the student's hands entirely. Results go from mixed to negative. The student who had the tool performs worse without it than the student who never had it at all. This is an unusual finding — not because it is damning (it is not, necessarily), but because it reframes the question. The debate in most coverage of AI in education is "does AI help?" The OECD finding makes that question too simple. The more useful question is: is AI functioning as a prosthesis here, or as a tutor? A prosthesis improves performance in context and disappears with the tool. A tutor builds capacity that outlasts the sessions. Those are different interventions, even if they wear the same label.

What Stanford found in 800 papers

A few months earlier, in the window reported on by thecause Substack in April 2026, researchers at Stanford completed a systematic review of the academic literature on AI in K-12 education. They reviewed more than 800 papers. The number of studies that met the bar for strong causal evidence was twenty. Not twenty percent. Twenty papers.

This is not a finding that should make anyone feel bad about AI in education. Most new interventions — pharmaceutical, pedagogical, technological — have thin causal-evidence bases in their early years. The studies take time; the deployment comes first. What is striking about the 800/20 ratio in 2026 is that it sits inside a $10.6 billion spending year. The scale of investment and the scale of the evidence base are not moving at the same speed. Whether that gap is a "we haven't measured yet" problem or a "we measured and it's mixed" problem is exactly the thing we do not yet know. The Stanford review, as far as the current literature goes, is consistent with either reading.

That gap — between the spending picture and the strong-causal picture — is the paragraph almost nobody is writing in 2026. The OECD finding gets covered in one corner. The Stanford review gets covered in another. The two taken together, as a single reading of what the field has and hasn't established, is a different story.

The thing that already works

There is a useful counterpoint sitting on the same shelf. Before AI entered the classroom at scale, education research had assembled the strongest causal-evidence stack it has ever produced around a single non-technological intervention: high-impact tutoring.

The numbers from J-PAL and IES are worth stating plainly. High-impact tutoring — paid, trained, consistent tutors in small ratios (1:1 to 1:3 at most), during the school day, at least three times a week — produces effect sizes roughly fifteen to twenty times larger than standard tutoring. Equivalent to three to fifteen additional months of learning per intervention cycle. The Chicago Public Schools study, which placed AmeriCorps fellows in daily 2:1 sessions with low-performing students, found GPA gains of 0.58 grade points and attendance improvements up to 7%.

These are strong numbers by any standard in education research. They are not ambiguous. They are not from a single study — they replicate across independent syntheses.

And then there is a finding that rarely travels alongside the headline: a 2022 study of virtual tutoring across eight school districts showed no detectable effect. The same word — "tutoring" — applied to a different set of design conditions, and the effect size vanished. In-school-day timing, small ratio, paid and trained tutor, consistent relationship — remove any of those conditions and you are no longer implementing high-impact tutoring. You are implementing the label. The label does not transfer.

A question the field hasn't yet consolidated

Reading the OECD finding and the Stanford 800/20 review together, alongside the high-impact tutoring evidence base, produces a question the 2026 conversation hasn't quite assembled yet.

High-impact tutoring's effect is so large partly because of what it is at a structural level: a consistent human relationship, in a real room, on a predictable schedule, calibrated to this student's current level, with an adult who notices what the student is doing and responds in real time. It is cognitively demanding in exactly the ways that produce durable learning — including metacognitive engagement, which the OECD finds declining under generative AI reliance.

The question is not: does AI in education work? The question, by 2026, is something more specific: what design conditions would make AI function more like a tutor than a prosthesis — more like high-impact tutoring than the virtual-tutoring-in-name-only that the 2022 study examined?

That question has an answer in principle. It would look something like: consistent AI-student relationship over weeks or months; responsiveness calibrated to the individual, not the cohort; sessions that require the student to do the cognitive work rather than offload it; some mechanism for checking capacity when the tool is absent, not only with it present. Some 2026 pilots are trying to build that thing. The evidence on whether they have succeeded is not yet in.

What we'd want to know by 2027

We find it genuinely interesting that the two largest reviews of this question — OECD in January, Stanford in early 2026 — appeared within weeks of each other and have not, in most coverage, been read as a single coordinated signal. They are not coordinated. They are independent institutions arriving at the same moment of scrutiny from different angles.

What they are doing together is setting the research agenda for the next two years. The evidence base on AI in K-12 will either fill between now and 2028 — the deployment surface is large enough now that rigorous studies can actually run — or it will remain sparse against the spending picture. By 2027, either the classroom-AI studies will have begun to replicate with strong causal designs, or the field will be having a different conversation about what ten-plus billion dollars a year is actually purchasing.

We are watching that question with curiosity, not alarm. The evidence base is not thin because the intervention is failing — it is thin because the intervention is new and the rigorous study timelines are long. The thing to notice is the gap, and then to ask what kind of evidence we would need to close it. That is a different posture than either the promotional literature or the backlash takes.

The Aid question — what does it mean to genuinely augment a teacher's capacity rather than route around it — sits inside this same evidence question. A tool that gives the teacher better sight lines on what a student actually knows, and responds to what the teacher learns, looks different from a tool that delivers content directly to the student and measures whether the student can recall it. One of those is closer to the high-impact tutoring design. Whether either of them works at scale is what the next two years of studies will, or won't, tell us.

The honest position for now: we know what works. We know much less about whether the new tools will work in the same register. That is a good moment to pay close attention.

Sources

Wiki pages drawn from

topics/ai-in-education-evidence-base-2026 — Stanford 800/20 review, OECD headline finding, prosthesis-vs-tutor framing, spending picture.
concepts/high-impact-tutoring — effect sizes, design conditions, 2022 virtual-tutoring null result, J-PAL / IES / NCTQ syntheses.
concepts/aid — the Thewhat R&D path; teacher-augmentation framing; the question of what AI for teachers should actually mean.

External sources

OECD Digital Education Outlook 2026 — OECD, January 2026. https://www.oecd.org/en/publications/oecd-digital-education-outlook-2026_062a7394-en.html
"Here's what we know about AI and education. It's not good." — thecause (Substack), April 15, 2026. https://thecause.substack.com/p/heres-what-we-know-about-ai-and-education
"High-Quality Tutoring: An Evidence-Based Strategy to Tackle Learning Loss" — IES, 2026. https://ies.ed.gov/learn/blog/high-quality-tutoring-evidence-based-strategy-tackle-learning-loss
"Tutoring programs lead to learning improvements" — J-PAL Evidence-Effect. https://www.povertyactionlab.org/evidence-effect/tutoring-programs-lead-to-learning-improvements
"High-impact tutoring: Five ways to increase effectiveness with students" — NCTQ. https://www.nctq.org/research-insights/high-impact-tutoring-five-ways-to-increase-effectiveness-with-students/

← Essays Index →