Beyond the Hype: How We Built AI Tools That Actually Support Learning
AI tools have made remarkable progress. They can now detect ideas in student science explanations, score responses, and provide feedback at scale (Kubsch et al., 2022; Zhai et al., 2020). But the harder question is how to use what the AI detects to actually support learning.
In classrooms, learning is complex and contextual. Students come to science classrooms with ideas shaped by their prior instruction and everyday observations that don't always match what we expect them to know. For example, when 7th graders are asked how animals use energy from the sun in science classrooms, instead of reciting textbook answers about photosynthesis and energy transfer, students often share observations:
"The sun keeps all of us and the animals warm. Without the sun, we will be cold and freeze to death."
Or they connect to personal experience:
"The sun helps plants grow, and they also need water. My mom's tomatoes died last year because of drought."
AI tools cannot detect and respond to as many ideas as teachers do. When AI tools treat this richness as noise, they can give feedback that makes students feel their lived experiences are less valued. Consider what happens when the student shares the drought observation above and receives this response:
"Your answer is incomplete. Please describe how energy from the sun transfers through the food chain."
The student just exits the dialog. The feedback ignores the student's genuine connection between their lived experience and plant survival.
Over time, students gradually stop sharing what they actually think and instead try to guess what the scoring tools want to hear. Instead, a teacher might ask:
"That's a really important observation about your mom's tomatoes! How do you think they used water and sunlight to survive?"
These conversations allow teachers to build on what students already know, ask tailored follow-up questions, and guide students toward connecting their observations to scientific concepts.
I've worked with education researchers from the TELS group, middle school science teachers, and computer scientists to build AI dialogs using Natural Language Processing (NLP). These dialogs, which I'll refer to throughout this piece, are structured adaptive systems designed to detect specific student ideas and provide targeted questions, distinct from the generative AI chatbots that have recently captured public attention.
Following the Knowledge Integration Framework (Linn & Eylon, 2011), we designed NLP dialogs that ask questions to help students think, encourage students to elaborate, distinguish among alternatives, and link the most promising ones together. Our goal was to create AI tools that detect and respond to the full spectrum of student thinking, including those warmth and drought observations, grounded in locally relevant contexts. The difference in our approach? We gave teachers and students the power to tell us what counted as valuable thinking.
The Problem with Current AI in Education
Consider the student who wrote about the sun keeping animals warm, a reasonable observation grounded in lived experience. If the AI identifies this idea, what should happen next? Should the AI ask for more details about warmth? Should it help the student distinguish between different types of energy? Should it connect this observation to what they learned last week about energy and matter in food chains? The answer depends on everything else the student said, what instruction they've received so far, and where the teacher is trying to take the class next.
This is where teacher expertise becomes essential, and where it's often underutilized in AI design. Our work represents a variation of design-based research (Bell, 2004) that centers teachers in two critical design decisions throughout dialog development.
First, we worked together to define and iteratively refine what ideas matter enough for the system to detect. Second, we collaborated on deciding the guidance it provides: how the system should respond instructionally to different combinations of detected ideas. Both decisions required ongoing partnership as we learned from classroom implementation, and both turned out to be essential for creating AI tools that actually support learning.
What "Partnership-Based" Design Looks Like
Partnership Decision #1: Iterative Refinement of What to Detect
From the beginning, teachers were thought partners in designing what the AI should detect. When developing the photosynthesis dialog (Li et al., 2024), we worked together to identify 13 key ideas students commonly expressed, including scientifically normative ideas like reactants and products of photosynthesis, alongside non-normative understandings like "light energy becomes matter." We trained an NLP model on over 1,200 student responses from five public middle schools.
Then we tested the dialog in classrooms. Researchers analyzed student learning outcomes from the dialog log data, tracking how students' ideas changed and whether their overall understanding improved. I brought these findings to our research-practice partnership meetings and teacher workshops, paired with concrete examples of student dialog exchanges. Teachers, in turn, reviewed the actual conversations their students had with the AI during these workshops and meetings.
During one workshop, an expert teacher with 10+ years of experience pointed to several student responses that caught her eye: ideas that seemed "very vague and general but not necessarily wrong." These were the kind of broad observations that often serve as entry points when students are building understanding: students writing that "the sun helps plants survive" or drawing on the "10% rule" from prior lessons.
Through systematic review, teachers in collaboration with researchers identified six additional ideas our model needed to recognize. These spanned common entry points to understanding, connections to prior instruction, and personal experiences that often foreshadowed more sophisticated reasoning.
We expanded our model from 13 to 19 ideas, retraining it on 1,206 student dialogs from eight schools. The model accuracy remained satisfactory (micro-F1: 0.73). Those ideas about common entry points turned out to be among the top four most frequently detected ideas across all student responses.
Partnership Decision #2: How Should the AI Respond?
Detecting ideas was only half the challenge we faced. Once the AI identifies the ideas a student was thinking, we had to determine what it should say next to support that student's learning. This question turned out to require bringing together different forms of expertise.
As an education researcher, I brought knowledge from the Knowledge Integration framework (Linn & Eylon, 2011) about when students need prompts that elicit ideas versus distinguishing between concepts. But this theoretical knowledge had limitations. Teachers knew their specific students: which words would resonate, what might confuse, how to build on classroom conversations. Our co-design process evolved to leverage both types of expertise. I drafted prompts based on learning sciences principles, then refined them with teachers in one-on-one meetings to check whether they made sense in local classroom contexts.
For example, we initially designed a response when the model detects the energy transfer idea. This response aims to elicit ideas about the next step in the energy flow–cellular respiration:
"Nice thinking! You talked about energy transfer. Can you tell me more about how animals use energy?"
When tested, students interpreted "energy" broadly: writing about warmth, vitamins, running, and hunting. They were thinking about energy, just not the specific transfer through cellular respiration we were scaffolding toward. Teachers helped us see the problem: the response was too vague. We revised it to:
"Interesting idea! How do plants and animals use energy from the sun differently?"
This distinguishing response helped students articulate the key distinction: photosynthesis versus cellular respiration. The difference came down to specificity. Instead of "tell me more," we gave students precise thinking work.
Through many rounds, key guidelines emerged:
- Ask one question at a time.
- Affirm what students said before pushing further.
- Keep language concise.
Simple, but they fundamentally shaped every response. The central insight from our response design work is that effective AI guidance requires both learning sciences principles and classroom-grounded expertise. Neither form of expertise alone would have been enough to create prompts that genuinely supported student thinking.
What This Means for Designing AI Tools in Education
The expanded model and co-designed guidance worked together to support student learning. Students who engaged with the dialogs showed significant learning gains, expressing more ideas in their responses and progressing from isolated observations to linked explanations connecting mechanisms across photosynthesis and cellular respiration.
The implications extend beyond our work. If you're developing AI for education, the most important decisions happen early: when defining what to detect and how to respond. Teacher expertise is most valuable here, yet often teachers are least involved. Teachers' insights can address what AI alone cannot: identifying when the model misses students' ideas entirely (noise), when it fails to detect the full range of student thinking (coverage), and when technical accuracy doesn't align with pedagogical value. Bring teachers in when making fundamental choices, not just for pilot testing. Be prepared to prioritize learning over technical metrics when they conflict.
If you're a teacher considering AI tools, ask who decided what the AI detects and how it responds. Were teachers involved in those core decisions? Look for evidence of practitioner involvement in design, not just testing. The tools most likely to support your students are those designed in genuine partnership with teachers like you.
If you're a researcher, what practitioners teach you about learning and classroom contexts isn't just feedback, it's knowledge worth documenting and publishing alongside technical results.
Remember that student who wrote about warmth? Our original model would have marked it as off-topic. But teachers helped us see it as an entry point, an idea to build on, not an error to correct. That's what partnership makes possible: AI that sees learning the way teachers do.
References
-
Bell, P. (2004). On the theoretical breadth of design-based research in education. Educational psychologist, 39(4), 243-253.
-
Kubsch, M., Krist, C., & Rosenberg, J. M. (2023). Distributing epistemic functions and tasks—A framework for augmenting human analytic power with machine learning in science education research. Journal of Research in Science Teaching, 60(2), 423-447.
-
Li, W., Liao, Y., Steimel, K., Bradford, A., Gerard, L., & Linn, M. (2024, July). Teacher-informed expansion of an idea detection model for a knowledge integration assessment. In Proceedings of the eleventh ACM conference on learning@ scale (pp. 447-450).
-
Linn, M. C., & Eylon, B. S. (2011). Science learning and instruction: Taking advantage of technology to promote knowledge integration. Routledge.
-
Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 56(1), 111-151.
