AI and exam marking: exploring the difficult questions of trust and accountability

AQA’s Head of Research and Development, Cesare Aloisi, gives his perspective on some of the barriers to using artificial intelligence (AI) systems for marking high-stakes assessment.

Share this page

Updated Content

Last updated Tuesday 7 October 2025

Published

Monday 6 Feb 2023

Author

Dr Cesare Aloisi

Drones artwork like graphical reproduction of a robot holding the words AI

Anyone picking up a newspaper recently will almost certainly have come across headlines about how some new AI systems1 such as OpenAI’s ChatGPT2 or Anthropic’s Claude3 are threatening to shake the foundations of written assessment.

The evolving conversation around AI is certainly a topic of interest for exam board researchers like me. I’ve been exploring how AI might impact on the future of marking; in a previous blog post, I set out my thinking on why AI systems could not be good markers or reviewers of marking, even when they are accurate. The main issue, I argued at the time, is the concept of ‘explainability’: AI systems cannot explain how they came up with a particular decision or mark in the same way that a human marker could.

Regarding recent developments, it’s really too early to say if interfaces like ChatGPT or Claude have changed the game in terms of explainability. Anecdotal evidence suggests that sometimes they can provide correct explanations, and sometimes they cannot. My research colleague David West, who has looked into these questions, proposes that it could be due to the fact that these interfaces are predictive models of language, rather than ideas. This means that when an idea has been a topic of online discussion, this will result in a large volume of text for ChatGPT or Claude to draw upon, thus enabling them to eloquently mimic that discussion. However, there is a chance that the AI systems simply perpetuate popular myths, as they have no real-world context to draw upon beyond ‘what’s talked a lot about on the internet’. As a result, their explanations might appear convincing while not being based on facts.

In a recently published journal article, I went on to argue that lack of explainability is just one of the barriers to the use of AI systems for marking high-stakes assessment; other key issues that must be explored include questions of (un)reliability and bias.

Unreliability and bias

Unreliability in the context of AI systems means that small variations in the input may result in large differences in the output. For example, current AI technologies cannot precisely distinguish between important and unimportant content in a text, so they might shift their focus towards features that have less influence in determining the mark. A further challenge related to AI language models is their inability to understand causal relationships where these aren’t obvious from the grammar in a response.

In terms of bias, an AI system could treat some groups of people more favourably or discriminate against them, based on characteristics such as sex, ethnicity or religious beliefs. This is a well-known issue in the field, to the extent that there are now ‘toxic language’ tests to score AI systems on how biased they are. Bias is due to AI systems ‘learning’ and reflecting back wider societal attitudes and prejudices that already exist in the data sets they are trained on. Often these data sets are so large that it is not possible to review and moderate all content, so some ‘toxicity’ remains. This is a difficult issue to resolve and for now can only be improved upon by continuous monitoring and correcting, which is something that ChatGPT or Claude attempt to do.

The critical question is whether we can manage the reliability, bias and accountability of AI systems to at least compete with existing best practice in marking. From my personal perspective, I do think that this is likely. The ability to address issues such as explainability and bias will require a shift beyond the current machine-learning architectures, which are based on training deep neural networks on huge amounts of unchecked data. However, efforts are already being made to design hybrid models that can take some aspects of how people learn into account.

What’s next for assessment research on AI?

Exploring accountability and trust in relation to using AI in high-stakes assessment is an important focus of my current research. In particular, I’m acutely aware of the need to consider how these technologies will alter the relationship between assessors (teachers, assessment organisations, regulators, governments) and those who are assessed (learners).

This raises difficult questions, such as ‘who will be responsible for the AI’s ‘behaviour’, and, in particular, its harmful effects? For example, if a teacher were to use a school-approved automated marking software, which was later found to give inaccurate or useless feedback to students leading to lower performance, who would be at fault? Would it be the teacher, for relying on the software, or potentially the school, for approving the software? Could the software developer or licensor bear some responsibility? In summary, who should be expected to quality-assure the AI system? A risk might be a situation whereby ‘the system’ is to blame, but no one individual or organisation can be deemed accountable.

Further questions arise when it comes to the role of experts; for example, in a situation where a human examiner and reviewer overrides an AI-generated mark because it is too high. This decision could lead to an overall lower grade, which is appealed by the school. The next step would normally be an independent review of marking. A key question then is whether the reviewer should be a person or another AI system, or both. And who should have the final say? A human marker would face different consequences depending on the severity of their error. However, if the software is unable to explain itself adequately, we could consider whether the software developer should be involved in the appeal to explain its behaviour, or whether the faulty AI system would simply be retrained. Yet, what if this were to have consequences for any historical decisions made?

These scenarios illustrate some hypothetical questions, but such situations are routine in high-stakes assessment. Indeed, further questions could arise about other aspects of accountability.

Even if future AI systems can be trained to become more ethical and principled, these unresolved fundamental questions about who or what we consider to be an expert, who or what we can trust, and who or what can be found ‘guilty’ of a bad decision will require further research and stakeholder engagement to arrive at workable solutions.

Notes

I use AI systems in this context as shorthand for ‘automated decision systems based on machine learning and related approaches’.
ChatGPT (Generative Pre-trained Transformer) is a conversational assistant (ie a chatbot) launched by OpenAI in November 2022.
Claude is a conversational assistant created by Anthropic, a company created by ex-employees of OpenAI.

See all articles