OET Writing OET Grading AI Grading Bias OET Rechecking

Why 1 in 4 OET Writing Fails Gets Overturned: AI Grading Bias and What It Means for Your Score

Jinish Rajan

Jinish Rajan

Assistant Director of Nursing · OET Certified Teacher · Founder, FluencyX

8 min read
Featured image for Why 1 in 4 OET Writing Fails Gets Overturned: AI Grading Bias and What It Means for Your Score

The Problem With “Getting Feedback” on Your OET Letter

Most nurses preparing for OET do one of three things when they want feedback on a practice letter: they ask a tutor, paste it into ChatGPT, or submit it to an online marking service and wait several days.

Each of these approaches has a serious limitation.

Human tutors are inconsistent and slow. A tutor working from memory of the OET rubric, without a verified case note blueprint, cannot tell you definitively whether your Content choices were correct. They may have a strong instinct. They may even be right. But instinct and rubric-based scoring are not the same thing.

Generic AI tools are fast — but they are systematically biased toward the wrong outcomes for OET preparation. We will come back to why in a moment.

Online marking services are slow by design, and their accuracy depends entirely on the quality of the individual examiner marking your script. OET’s own rechecking statistics tell a clear story about how much variability exists in human marking.

The Rechecking Data: What It Tells Us About Grading Reliability

OET offers a formal rechecking process for the Writing and Speaking sub-tests. The process involves a senior examiner who was not part of the original marking reviewing the script and issuing a revised result if warranted. When a score changes, the rechecking fee is refunded in full.

The existence of a rechecking process is standard practice in high-stakes language testing. What is significant is the rate at which results change — and for which sub-test.

Rechecking is most commonly requested — and most commonly successful — for the Writing sub-test. The reason is structural: Writing involves multi-criterion judgements (six separate analytic ratings per script), clinical content decisions (did this candidate include the right information?), and register calibration (was this appropriate for the stated reader?). These are tasks where even trained examiners differ.

The linguistic research on inter-rater reliability (IRR) in clinical assessments is clear: individual raters vary. Examiner personality affects scoring tendency. Paired examiners show less variability than individual ones, but variability does not disappear. The OET rechecking system exists precisely because the difference between a C+ and a B on Writing — a single grade boundary — can determine whether a nurse is eligible to register with NMBI in Ireland, AHPRA in Australia, or the NMC in the UK.

If you are sitting on a result that is 10–20 points below a required threshold, rechecking is a legitimate strategic option. Many candidates who passed on their first rechecking attempt wished they had used it sooner.

How AI Graders Get OET Writing Wrong

Automated essay scoring has been studied since the 1960s. The early promise was obvious: consistent scoring, immediate feedback, no examiner fatigue or personality effects. The reality has been more complicated.

Contemporary large language model (LLM) graders — including ChatGPT, Gemini, and other general-purpose AI assistants — show a well-documented pattern in writing assessment research: they reward surface fluency over deeper reasoning and exhibit a positivity bias, rating writing higher than human raters in ambiguous cases.

For OET Writing specifically, this creates a dangerous mismatch. Here is how:

Conciseness & Clarity (7 marks) — OET rewards efficient, actionable language. A 200-word letter that gives the reader exactly what they need is better than a 250-word letter with padding. AI tools trained on large corpora tend to reward more elaborate responses. A nurse who writes “He was seen in clinic today and found to be experiencing some degree of respiratory discomfort” gets similar AI feedback to one who writes “He reported exertional dyspnoea.” OET examiners would not score these equally.

Content (7 marks) — This is the criterion that generic AI simply cannot score. Content is assessed by cross-referencing your letter against the case notes: which details were clinically relevant? Which were distractors? What level of specificity was appropriate for the stated reader? Without access to the case note blueprint and a verified key, an AI tool is guessing. It may give you a positive score on a letter that missed three critical inclusions — because the grammar was clean and the vocabulary impressive.

Genre & Style (7 marks) — OET Writing requires clinical register: objective, neutral, professional, scope-appropriate. Generic AI feedback often defaults to “writing sounds professional” without distinguishing between a consumer-facing communication style and a clinician-to-clinician register. These are meaningfully different, and examiners notice the gap.

The AI feedback trap: what you are actually measuring

What AI feedback measuresWhat OET examiners measure
Grammar and syntax accuracyLanguage criterion (1 of 6)
Vocabulary range and fluencyPartly Genre & Style
Approximate structural organisationOrganisation & Layout
Overall impression / perceived qualityHolistic feeling only
Whether information seems plausibleNot Content — this requires the case note key

Relying on AI feedback alone gives you accurate signal on roughly 2–3 of 6 criteria.

A Concrete Example of AI Scoring Failure

Consider this letter excerpt for a case where the case notes include a critical allergy (penicillin) that the candidate failed to mention:

Candidate letter (excerpt): “Mrs. Chen is a 47-year-old patient with a 3-week history of productive cough, low-grade fever, and bilateral basal crepitations on examination. Recent chest X-ray confirms consolidation in the right lower lobe, consistent with community-acquired pneumonia. I am referring her for inpatient assessment and antibiotic management.”

This letter reads well. The grammar is clean. The structure is logical. The register is clinical. A generic AI would score it highly — 6 or 7 out of 7 on most criteria.

But the OET examiner, working from the verified case note key, would note that the allergy to penicillin is clinically essential for a reader who will be prescribing antibiotic therapy. Omitting it is a Content failure. In a real clinical scenario, it could be a patient safety failure.

The AI scored it as excellent. The examiner marked it down on Content. The candidate was surprised when they received their result.

This is not a hypothetical. Variants of this scenario play out repeatedly in OET preparation when candidates rely on generic AI for feedback and are not warned about the Content criterion’s specific demands.

What Accurate Feedback Actually Looks Like

Accurate OET Writing feedback requires two things that generic AI does not have: a verified case note blueprint, and criterion-specific evaluation aligned with OET’s analytic rubric.

FluencyX was built to solve exactly this problem. Your letter is checked against the case notes — not evaluated in isolation — to determine whether your Content choices were clinically appropriate. The feedback is broken down by all 6 criteria, so you know whether your 340 came from a weak Language score (grammar errors) or a weak Content score (wrong information). Those are completely different preparation problems with completely different solutions.

The distinction matters because it determines what you should do next. If your Language criterion is pulling you down, grammar practice is the answer. If your Content criterion is the problem, you need to practice reading case notes clinically — understanding which details belong in a referral versus a discharge, how to calibrate information to a specialist versus a GP, and how to exclude distractors without omitting essentials.

The Practice Cycle That Actually Moves Your Score

The most effective preparation for OET Writing follows this structure: practice letter, immediate criterion-specific feedback, identify the weakest criterion, targeted improvement, repeat.

Without criterion-specific feedback, the practice cycle stalls. You produce letters, you are told they are “good” or “needs work,” and you don’t know which of 6 criteria is your ceiling. You retake. You get the same result.

With criterion-specific feedback at speed, you can run multiple iterations in a single study session. You learn, for example, that your Organisation & Layout is strong (you naturally write thematically, not chronologically) but your Content is inconsistent (you include background history that the reader does not need). That targeted insight changes your preparation focus immediately.

See also: OET Writing Practice Test: Free Scored Attempts

Get the Feedback Generic AI Can’t Give You

FluencyX scores your OET Writing against all 6 criteria — including Content, assessed against verified case note blueprints. Know exactly which criterion is limiting your score.

Start Your Free OET Writing Diagnostic

Jinish Rajan

Written by Jinish Rajan

Assistant Director of Nursing at a leading Academic Teaching Hospital, Dublin, and Health Informatics specialist. OET Certified Teacher, MSc Cardiovascular Nursing, MSc Leadership, and software developer with 20 years of clinical experience in Ireland's healthcare system.