AI and reviews of marking – how is the future looking?
Published 16 Mar 2021
AQA’s Head of Research and Development identifies the barriers to using AI/machine learning systems as a second marker or reviewer, while considering opportunities AI can offer.
Artificial Intelligence (AI) and machine learning have been hot topics in educational assessment for quite some time now. As part of a wider interest in the practical and operational impacts of machine learning on our industry, I decided to explore the area of post-results services and reviews of marking. I wanted to identify the barriers to using AI as a second marker or reviewer, while considering the opportunities AI can offer.
It's worth highlighting that I use the term 'AI' here as a shorthand for 'machine learning systems' – computer programs that analyse data and return information that may be interesting, new or original – not computers that think like people.
AI and 'explainability'
We know that good marking and a clear explanation of decision making go hand in hand. A reviewer must be able to explain how they reached a decision, but this is something that AI struggles to do.
The concept of 'explainability' is the machine learning equivalent to academic judgement – it's the thing we question when we're unsure if a decision was correct.
We might imagine a future where we're able to train an AI 'reviewer' to distinguish between levels of performance (eg to recognise the difference between 'describe', 'discuss' or 'evaluate'). However, this is of little value if the system is unable to explain in a simple, understandable way why it determined that a candidate 'described' rather than 'evaluated'.
Take, for example, a question on the effect of urban change on the environment in a Geography paper. A human marker would be able to point at a paragraph and show that the candidate is simply listing a series of facts (eg saying that urban population increased between 1950 and 2010, there are more roads etc.) instead of engaging in a deeper conversation about the meaning and implications of those facts. AI cannot do that and will probably be unable to do that for a long time. It requires knowledge about the world that we take for granted but that the AI engine simply does not have.
At present, AI systems can at best highlight a few key words in the text that they used to determine that something is a description rather than an evaluation. This is a significant barrier in terms of translating what the AI engine considers to be an explanation into what we would consider to be acceptable. If a machine simply returns a list of words when asked to explain why it reached one decision rather than another, we cannot be confident it has really 'understood' our question.
While state-of-the-art AI engines regularly make the news by engaging in seemingly natural back-and-forth conversations or producing written texts that are sometimes very realistic, this is different from being able to reflect on their output in a way that suggests actual understanding.
Accuracy vs communication
We could argue that people are also black boxes and cannot fully explain their reasoning. However, it's much easier to probe a human reviewer and to come to a mutual understanding! A machine will only accept questions that it's designed to answer.
Even if an AI engine can produce marks that are in line with a human marker, we can’t accept that in terms of our commitment to transparency and accountability, which depends on the ability to explain decisions taken.
My investigation has led me to think that perhaps our approach has been back to front. So much of AI research has been about making accurate and reliable predictions: detecting faces, recognising handwriting, forecasting weather, playing games, predicting people’s health or finances. Meanwhile, explainability research has been lagging behind. The vast majority of top-performing AI systems cannot communicate why one prediction was made as opposed to another, and this makes them lose some of their sparkle in my eyes. How can there be a two-way interaction?
There have been a few advances on this front, but we're far from real understanding. Some credit rating systems provide explanations by showing how different scenarios would have led to different decisions, but what good does it do me to know that if I were 10 years younger and had a different degree my loan would have been approved? There is no understanding of the concept of life circumstances.
In summary, it looks like AI systems have a long way to go before we can entrust them with critical decisions that go beyond finding interesting patterns in the data.
About our blog
Discover more about the work our researchers are doing to help improve and develop our assessments, expertise and resources.