Blog / Research

AI Detectors Are Biased Against Non-Native English Speakers: The 2026 Evidence

AI detectors flag non-native English writing as machine-generated at rates approaching 98%. The bias is structural, the research is replicated, and the people paying for it are international students and ESL applicants. Here's what the 2026 record actually shows — and what a flagged student should do.

· 11 min read

TL;DR

AI detectors systematically misclassify non-native English writing as AI-generated — Stanford's audit found a 61.3% false-positive rate on TOEFL essays, with one detector flagging 97.8%. The bias is built into how detectors work: ESL writing has lower perplexity, which is exactly the signal that says "AI" to a classifier. The 2025–2026 research record has replicated the finding repeatedly. If you've been flagged as a non-native English speaker, document everything, cite the research, and demand a human review.

In 2023, a team led by James Zou at Stanford ran 91 TOEFL essays — written by real non-native English speakers — through seven of the most widely used GPT detectors. The detectors flagged more than half of the essays as AI-generated. One flagged 97.8%. The same detectors run against essays written by US-born eighth graders produced near-zero false-positive rates. The paper, "GPT detectors are biased against non-native English writers," published in Patterns, was the first formal demonstration that the entire category of AI text detection has a structural bias problem.

Three years later — through the 2025 academic year and into Q1 2026 — that finding has been replicated, extended, and, in some cases, weaponized in court. Yet the tools that produced those numbers are still embedded in university workflows, still scoring international students' essays, and still triggering academic misconduct hearings. This post lays out what the research actually says, why the bias is mechanical rather than fixable with a tuning patch, and what an ESL student flagged in 2026 should do about it.

The Original Stanford Finding, and What It Actually Showed

The Liang et al. study tested seven detectors — including GPTZero, OpenAI's classifier, Originality.ai, Quil, Sapling, Crossplag, and ZeroGPT — on two corpora: TOEFL essays from a Chinese educational forum, and essays from US eighth graders. Across the seven detectors, the average false-positive rate on the TOEFL corpus was 61.3%. On the eighth-grade corpus, it was effectively zero.

One result inside the paper is more striking than the headline number. When the researchers used ChatGPT to "enhance" the vocabulary of the TOEFL essays — replacing simpler word choices with more native-sounding phrasing — the average false-positive rate dropped by 49.7 percentage points, from 61.3% to 11.6%. In other words: making non-native English look more native-fluent reduced the rate at which it was flagged as AI-generated. The detectors were not measuring AI-ness. They were measuring something closer to "does this writer have access to a wide active vocabulary," and treating low scores on that proxy as evidence of machine generation.

That single result reframes the entire debate. A detector that flags "writing that looks like it came from someone with limited English" as "writing that came from a machine" is not a misconduct tool. It is a fluency classifier wearing a misconduct tool's labels.

Why the Bias Is Structural, Not a Bug

To see why the detectors fail this way, it helps to look at what they actually compute. The two main signals across virtually every commercial AI detector are perplexity and burstiness.

Perplexity measures how surprising each word is given the words around it, scored against a language model. Text where every next word is highly predictable scores low on perplexity. Text where word choices are unexpected, varied, or rare scores high. Detectors treat low perplexity as a signal of LLM output, because LLMs — by training objective — generate the most likely next token.

Burstiness measures variation in perplexity across a document. Human writing tends to "burst" — a complex sentence followed by a simple one, an unusual word followed by common ones. LLM writing tends to be more uniform. Detectors treat low burstiness as a second signal of machine generation.

Now overlay the linguistic profile of a non-native English writer. ESL writers typically work from a smaller productive vocabulary, draw on a narrower set of syntactic patterns, and reuse transitional phrases they have learned to trust. The result, statistically, is text with lower perplexity and lower burstiness — for human reasons that have nothing to do with AI. The detector cannot tell the difference. It sees the same statistical signature it was trained to flag, and it flags.

This is why retraining or threshold-tuning does not solve the problem. The features the detector relies on are correlated with English fluency, and English fluency is correlated with native-speaker status. A detector that ignores those features cannot detect AI; a detector that uses them cannot avoid penalizing ESL writers. There is no engineering exit from this trade-off so long as the underlying signal is what it is.

The 2025–2026 Replication Record

The Stanford finding could have been an outlier. It hasn't been. Three lines of evidence have accumulated since 2024:

Independent academic replication. A 2025 paper in the journal Information ("Evaluating the Effectiveness and Ethical Implications of AI Detection Tools in Higher Education," MDPI) re-ran detection benchmarks on multilingual student corpora and found the same disparate pattern, with non-native English samples flagged at multiples of the native-speaker rate. The paper explicitly cites the bias as an ethical reason to bar detector output from misconduct findings.

Large-scale vendor and third-party audits. Independent test runs published through 2025 and early 2026 — including a 100,000-text audit of GPTZero by Ryne AI and roundup tests by walterwrites.ai and Skywork.ai — have continued to report ESL false-positive rates in the 50–60% range against native-speaker rates in the high single digits. Pangram and Originality.ai publish lower numbers in their own marketing, but those are vendor-controlled benchmarks, not independent audits.

Litigation. A publicly reported February 2025 case at Yale involved a student suing the university over a wrongful suspension after GPTZero flagged an exam. The complaint specifically cites discrimination against non-native English speakers, drawing directly on the Liang et al. findings. Cases of this shape have proliferated through 2025 and early 2026, and the underlying disparate-impact theory has begun to attract class-action interest. We covered the broader false-positive problem in detail when the case volume started rising.

The point is not that any single 2025–26 study is definitive. It's that the disparate-impact pattern keeps replicating across study designs, datasets, and detector versions. That is what a structural finding looks like.

Who Actually Pays the Cost

The students at the sharp end of this are not abstract. International graduate applicants writing personal statements in their second language, F-1 visa holders writing their first college essay in English, immigrant academics submitting peer-reviewed manuscripts, ESL high schoolers in US public schools, applicants to UK and Australian universities from non-English-medium secondary systems — all of them produce writing whose statistical profile triggers the same false signal.

The downstream consequences are uneven and serious. A flagged paper in an undergraduate course is a stressful but contained problem. A flagged personal statement during graduate admissions is a closed door with no appeal. A flagged dissertation chapter for a non-tenured international scholar is a career risk. A flagged TOEFL or admissions essay can affect visa status. The same detector behavior — the same 60-something-percent false-positive rate — produces different harms depending on where in the academic pipeline it lands.

And the harm compounds in ways the per-paper accuracy number hides. An ESL student flagged once carries the suspicion forward. The next paper is read with a primed reviewer. The next assignment is graded with the same skepticism. Even if no formal sanction lands, the relationship with the institution shifts. International student services offices have been documenting this pattern for two years; it does not show up in detector accuracy benchmarks because detector accuracy benchmarks measure single decisions, not careers.

What Universities Have Done So Far

Institutional response has been faster in some places than others. Washington State University reportedly terminated its Turnitin AI detection contract in February 2026 after a reported 1,485 false positives in a single semester, with international and ESL students disproportionately represented in the flagged set. UCLA, UC San Diego, Cal State LA, Vanderbilt, Yale, Johns Hopkins, Northwestern, the University of Waterloo, and Curtin University have disabled Turnitin's AI detection entirely. Most of the University of California and Big Ten systems now instruct faculty to treat detector output as a non-evidentiary conversation starter rather than grounds for misconduct.

The institutions that have moved are also the ones with international student populations large enough to have surfaced the disparate-impact pattern in their own internal data. Smaller private colleges and regional state systems — institutions with both fewer international students and less revision capacity — are still running score-triggered processes, and they produce most of the litigated wrongful-finding cases. The pattern of universities that have banned AI detectors tracks closely with student demographics.

For a deeper read on what a defensible institutional response looks like, see our piece on university policies that protect students. The short version: replace probability thresholds with an evidence standard, require human review, audit outcomes by ESL status quarterly, and align policy language with the vendor's own disclaimers.

What to Do If You've Been Flagged as an ESL Student

Policy moves slowly. If you are flagged this term, here is the practical playbook.

1. Preserve every artifact of the writing process. Google Docs version history is the single best piece of evidence — it shows the document being typed, edited, and revised over hours or days. If you wrote in another tool, gather autosave backups, draft files, browser history of the research you did, notes you took, and any sources you cited. The strongest defense against "this looks AI-generated" is "here is twenty hours of edit history showing me writing it."

2. Request the specific evidence in writing. Ask for the raw detector report, the threshold the institution used, the version of the detector that flagged you, and any other evidence in the case file. Most schools will not volunteer this. They are usually obligated to provide it on request.

3. Cite the research and the vendor disclaimers. In any written response or appeal, reference the Liang et al. 2023 study, the 2025 Information journal replication, and the vendor's own published guidance — Turnitin and GPTZero both publish documentation explicitly cautioning that scores should not be the sole basis for misconduct findings, and both note reduced accuracy on non-native English. An institution that sanctions you on a score the vendor itself disclaims is on shaky ground at appeal.

4. Ask for a human review and an interview. A face-to-face conversation about your work — the sources you used, the choices you made, the things you struggled with — surfaces authorship signal that no detector can. Most students who can discuss their own writing in detail are believed when they do. Push for that conversation early, in writing, and bring the version history with you.

5. Escalate before you accept a finding. Talk to the campus ombudsperson, the dean of students, and — critically — the international student services office, which has institutional incentive to push back on disparate-impact patterns. If the university proceeds despite the documented bias and the vendor disclaimers, talk to a lawyer. The fact pattern of "ESL student, low-perplexity writing, vendor-disclaimed score, no corroborating evidence" is the exact pattern current civil rights litigation is built on.

None of these steps require admitting fault, none of them require paying for a tool, and all of them shift the burden of proof back where it belongs.

Where the Detectors Themselves Are Going

Two trajectories are visible in 2026. The first is that detector vendors are softening their public claims. Turnitin's documentation now includes ESL-specific cautions, GPTZero's enterprise materials include reliability warnings, and Originality.ai has shifted some marketing toward "AI-assisted writing detection" framing rather than binary classifications. This is the legal and reputational ground moving beneath the category.

The second is that the underlying technical problem is not getting solved. Newer detectors do not have lower ESL false-positive rates; if anything, the gap has widened as the detectors have been tuned more aggressively against state-of-the-art LLM output, which itself has become more polished and harder to distinguish from careful human writing. The trade-off is fundamental, not an artifact of any specific generation of tools.

The honest read is that AI detection, as a category applied to ESL writing, is not going to become reliable enough for academic-integrity decisions in the foreseeable future. That is a hard thing for institutions invested in the tooling to admit, but it is what the technical literature actually supports.

A Note on Humanizers and Defensive Tools

There is an uncomfortable tension in writing about this from a humanizer's blog. Tools like ours exist partly because the detection environment is broken — students who have done nothing wrong still need a way to defend authentic writing against a flawed classifier. ToHuman and AI humanizers built for educational use are part of that defensive landscape. We won't pretend otherwise.

But the larger point is the right one: ESL students should not need to run their own writing through a humanizer to avoid being flagged for cheating they did not commit. The fix is at the institutional and policy level, not at the writer's desk. The defensive tools exist because the policy fix has been slow. They should not be necessary, and the better long-term outcome is the one where they aren't.

Resources

Frequently Asked Questions

Are AI detectors biased against non-native English speakers?

Yes. The Stanford-led Liang et al. study found that seven leading detectors misclassified TOEFL essays as AI-generated at an average false-positive rate of 61.3%, with one detector flagging 97.8%. Native-speaker writing was flagged at near-zero rates in the same test. Replications through 2025 and into 2026 — including a peer-reviewed paper in Information and large-scale third-party audits — have reproduced the disparity. The bias is structural, not a tuning artifact.

Why do AI detectors flag ESL writing as AI-generated?

Detectors classify text using perplexity (how predictable each word is) and burstiness (how perplexity varies across the document). Non-native English writers tend to draw on smaller working vocabularies and reuse phrasing — patterns that produce lower perplexity and lower burstiness, the same statistical signature LLM output produces. The classifier cannot distinguish between a careful ESL writer and GPT-4 because the feature distributions overlap.

What should I do if an AI detector flagged my essay and I'm a non-native English speaker?

Preserve every artifact of the writing process (Google Docs version history is the strongest), request the specific evidence in writing, cite the published research and the vendor's own ESL caveats, ask for a human review and an interview, and escalate through the campus ombudsperson, international student services, and — if the institution proceeds despite the documented bias — legal counsel. Don't accept a finding before you've exhausted the appeals path.

Which AI detector is most accurate for ESL writers?

None of the major commercial detectors is accurate enough on non-native English to ground a misconduct decision. Pangram and Originality.ai publish lower ESL numbers in their own marketing, but independent replication is limited. GPTZero and Turnitin both ship vendor-side caveats specifically warning about reduced reliability on shorter or non-native texts. The honest answer is that the category as a whole is not fit for high-stakes academic use against ESL writers — and the institutions that have moved fastest in 2026 are the ones that recognized this in their internal data first.

Have universities responded to the ESL bias problem?

Some have. Washington State University reportedly terminated its Turnitin AI detection contract in February 2026 after a reported 1,485 false positives in a single semester. UCLA, Vanderbilt, Yale, Curtin, Waterloo, Northwestern, and others have disabled Turnitin's AI detection entirely. Most University of California and Big Ten campuses now instruct faculty to treat detector output as a conversation starter, not evidence. Smaller and regional institutions are slower to revise policy and produce most of the wrongful-finding cases against international students.

Published April 27, 2026 by the ToHuman team.

Back to blog