Evaluation of LLMs’ Skills in Correcting Short Responses in the K-12 Educational Setting | Google Machine Learning Course | | | Turtles AI
Do Large-Scale Language Models Achieve the Desired Level?
An empirical study evaluated LLMs’ ability to correct short-answer questions in K-12 education. Below are the results
Introduction
The importance of formative assessment in K-12 education is well recognized. However, the scale at which such assessments can be conducted is limited by the resources required. This study explores the ability of large language models (LLMs), particularly GPT-4, to correct open-ended responses to short-answer questions in real-world educational contexts, using a new dataset from Carousel, a quiz platform.
Methodology
Using student responses in Science and History, we evaluated the performance of different configurations of GPT-4 in correcting responses to short-answer questions. Responses were evaluated on a set of 1710 student responses from various school levels (ages 5-16). We compared the performance of the models with expert human evaluations by measuring concordance between models and humans using Cohen’s Kappa metrics.
Dataset
The Carousel dataset included questions categorized as easy or difficult based on complexity and required abstraction. After removing null and perfect responses (which exactly matched exemplary responses), we evaluated ambiguous responses that required more nuanced judgment.
Human Assessment
Human rats, qualified as teachers of the subjects evaluated, examined each response and determined whether it was correct or not. The concordance among human rats was 87% with a Kappa of 0.75, indicating good reliability among human rats.
Model Results.
The best model configuration (GPT-4 with prompt few-shot) obtained a Kappa of 0.70, very close to human performance (Kappa of 0.75). The results suggest that LLMs, with minimal prompt engineering, can come very close to the performance of human rats in a variety of subjects and school levels.
Discussion
The closeness of GPT-4 performance to human performance suggests that LLMs could be useful for low-risk formative assessment tasks in K-12 education. The difference in performance between the GPT-3.5 and GPT-4 models was significant, with GPT-4 significantly improving recall without compromising accuracy too much. The addition of few-shot examples improved model performance slightly.
Conclusion
This study shows that LLMs such as GPT-4 can be effective tools for formative assessment, saving significant time for teachers without compromising assessment quality. Further research could explore the use of LLMs on a wider range of tasks and questions, as well as investigate factors that influence model performance.
Implications
The adoption of LLM in education could revolutionize the way formative assessments are carried out, making them more frequent and efficient. However, it is crucial to further explore the limitations and potential of these models to ensure that they can be used effectively and responsibly in education.