Why Large Language Models (LLMs) Aren't Built for Grading, And What That Means for Faculty

It usually starts with good intentions and a stack of papers.

A professor facing hundreds of submissions and a closing grade window opens ChatGPT, pastes in a student's work, and asks for feedback. The response comes back fast, fluent, and reasonable-looking. It feels like a problem solved and time saved.

According to Anastasios Sidiropoulos, that feeling is the trap.

Sidiropoulos, is an associate professor of computer science at the University of Illinois Chicago, where his research centers on algorithms and theoretical computer science. He earned his Ph.D. at MIT under Piotr Indyk, one of the field's most influential algorithms researchers, and previously held a joint appointment in computer science and mathematics at The Ohio State University. Across more than twenty years he has worked both in front of classrooms and in the technical depths of how machine learning models actually behave. He's also a co-founder of Grady, an AI-assisted grading and feedback tool built specifically for higher education. That combination of working educator and algorithms researcher is what makes his warning especially compelling.

His position is blunt: a general-purpose large language model is "the wrong tool" for grading student work. Not because the technology is weak, but because of how it fails, which is to say quietly, plausibly, and in ways most faculty may never catch by spot-checking.

Can any LLM grade student work?

The intuitive appeal of using an LLM to grade is obvious. These models write well, explain clearly, and handle text and images.

"But if you take the time to actually read the feedback carefully, line by line," he says, "you will be surprised that in some cases, one out of five papers the feedback is completely garbage."

In Sidiropoulos's testing on real-world data, general-purpose LLMs incorrectly graded 10-20% of submissions in certain assignments, treating incorrect student responses as correct. More concerning, the models often produced polished explanations, making the errors difficult for instructors to detect.

Sycophancy: the model tells you what you want to hear

To explain why this happens, Sidiropoulos points to a behavior called sycophancy. In an LLM, sycophancy is the tendency to tell users what they want to hear rather than what's true. Ask an LLM for life advice and it tends to tell you you're great, you're right, your instincts are sound, regardless of whether that's true. The model is biased toward agreement and reassurance.

Now apply that same tendency to grading. When the model cannot determine what a student wrote, it rarely signals that uncertainty. Instead, it infers the most expected response based on context and patterns in its training data. For exam questions, that often means interpreting an incorrect or ambiguous answer as the correct one.

The result is a strange and specific failure. "Instead of the LLM grading what the students submitted," Sidiropoulos explains, "the LLM will grade as if the students submitted the correct answer." A student gets credit for work they didn't do or the feedback responds to an argument the student never made. The grade and feedback are confident, well-written, and wrong.

Can an AI chatbot read what a student actually wrote?

To demonstrate how unreliable LLMs can be at evaluating student work, Sidiropoulos uses a striking example.

He begins with a multiple-choice question from a test: Which are characteristics of a good scientific experiment? A student has marked two options, one of them incorrect.

Multiple-choice question 33 from a chemistry test, with the student's boxes for Repeatable results and Based on opinions marked. — What the student actually marked on question 33.

He then gives the completed test item to one of the most capable AI models available, running at its maximum reasoning settings. He does not ask the model to grade the response. He simply asks the model to "Give me a list of the student answers for all questions."

The model gets it wrong. Instead of identifying the two options the student actually chose, it reports back the correct answers to the question. In other words, the model doesn't merely misgrade the response, it misreads it. Faced with an answer that differs from what it expects, it substitutes the response it believes should be there.

An AI model's answer listing the student's responses, showing question 33 as Repeatable results, Control variables, Measurable outcomes. — Asked only to report what the student selected, the model returned the correct answers to question 33 instead, not the boxes the student actually checked.

According to Sidiropoulos, this is not an isolated failure of a single model. He reports observing similar behavior across leading models from multiple providers. The implication is serious. If an LLM cannot reliably determine what a student wrote, no amount of its eloquent, polished feedback can be trusted.

Why do LLMs hallucinate?

LLMs hallucinate because of how they store information, not because of a fixable bug. It isn't something a patch will fix, and that's the part Sidiropoulos most wants faculty to understand.

LLMs don't store the text you give them word-for-word. They convert it into an internal representation that's efficient for reasoning but lossy, meaning some detail is discarded. Models learn to fill in anything unclear with the most statistically likely possibility.

In grading, that instinct is exactly backwards. On a typical question there are many ways to be wrong and only one way to be right, and the right answer is, by definition, the most likely one. So whenever the student's actual response is even slightly unclear to the model, it defaults to completing the picture with the correct answer. The very thing that makes LLMs powerful, filling gaps with the most expected answer, is what makes them hallucinate grades.

As a researcher who studies the mathematics of these models, Sidiropoulos is careful here. Hallucinations cannot be eliminated entirely; they are a fundamental property of large language models. But university grading is not a general-purpose task. While it spans many subjects and question types, it is constrained enough that many of the specific failure modes can be identified, measured, and engineered against. The goal is not to trust a chatbot to grade student work. It is to build a specialized system, like Grady, that is designed around the known limitations of LLMs and engineered to mitigate them.

"You wouldn't put crude oil in your car"

Sidiropoulos's favorite analogy is a relatable one. "A raw LLM," he says, "is like crude oil. It's enormously valuable, but you don't pour it straight into your engine. You refine it first: into gasoline for a car, lighter fluid for a lighter, shampoo for your hair." Each use demands its own refinement.

Grading is its own refinement problem, and it's a hard one. The hallucinations aren't uniform. Handwriting, diagrams, formulas, tables, and multiple-choice each fail in their own distinctive ways. Testing whether a system works means running not ten papers but hundreds or thousands of varied cases, which is precisely why a faculty member running one or two samples into an LLM to "check if it works" is the most dangerous test of all. The good result on a clean sample is exactly what hides the failures waiting in the messier ones.

There is a narrow exception for a tightly controlled, well-defined task. Sidiropoulos points to coding assignments, where you can actually run the code to verify it works, and where models have been heavily trained. Feed an LLM one programming question at a time, with strong guardrails, and a careful instructor can get real value. But open-ended essays, handwritten work, and diagrams, the everyday substance of most courses, are where raw LLMs quietly break.

What it actually takes: amplify the human, don't replace it

This is where Grady enters, not as a faster chatbot, but as a fundamentally different approach. Sidiropoulos's guiding principle is simple: don't build an autonomous robot; build an exoskeleton. An autonomous system replaces human judgment. An exoskeleton amplifies it. In Grady's design, the faculty member remains responsible for evaluating student work, while the system helps them do that work at greater scale and with greater consistency.

Unlike a general-purpose chatbot, Grady is an agentic system that runs many different models against each other, some doing the work and others checking it. Concretely, that means Grady is not an LLM with a friendly interface bolted on. Because LLMs can self-reflect, with one model evaluating another's output, Grady is built around layers of cross-checking rather than a single pass. With that, plus some proprietary tech, the system is engineered specifically against the hallucinations that grading produces. The team fine-tuned it in collaboration with experienced faculty across a range of disciplines, so its output reflects how expert graders in a given field actually grade rather than how a general-purpose model guesses. These methods and engineering drive the error rate down to a level no single model can reach.

Because of this architecture, the team describes Grady as a purpose-built grading workspace where AI is harnessed and controlled for this specific task, with faculty in control, not a chatbot you paste work into. The point isn't to push a button and walk away. It's to let an instructor steer the system up front, review and regrade efficiently across an entire class, and sign off on the result. The human is present at the start, the middle, and the end.

On accuracy, Sidiropoulos is direct, and he frames it the way he'd frame a self-driving car: the honest question isn't "is it 100% safe," it's "is it meaningfully safer than the human alternative." Human grading, he notes, is notoriously inconsistent. We just rarely notice because no one re-reads hundreds of graded papers to check. In Grady's internal trials, he reports the system making 10-30 times fewer errors than strong human graders.

The bigger picture

With his background in both education and AI, Sidiropoulos sees the promise of systems like Grady less in automating grading than in improving teaching and learning.

For students, that means receiving timely, targeted feedback that identifies specific misconceptions and points them toward improvement. For instructors, it means gaining visibility into patterns that are often hidden in a sea of assignments. Instead of seeing only averages and final scores, faculty can identify where students are struggling, which concepts are causing confusion, and how understanding differs across a class.

At scale, that kind of insight changes the role of assessment. Grading stops being merely a way to evaluate learning and becomes a way to understand it. As Sidiropoulos puts it, the system can give instructors something close to the ability to "read the minds" of an entire class, not by replacing their judgment, but by revealing patterns that would otherwise remain invisible.

The mistake isn't using AI to support grading and feedback. The mistake is assuming that a general-purpose chatbot can be trusted to do the job on its own. The real opportunity lies in building systems that acknowledge the limits of AI, keep instructors in control, and use the technology to help teachers teach and students learn more effectively.

Anastasios Sidiropoulos is an associate professor of computer science at the University of Illinois Chicago and a co-founder of Grady. He holds a Ph.D. from MIT, is a recipient of the NSF CAREER Award, and has spent more than twenty years researching algorithms and theoretical computer science.