Blog / Research
AI Detectors Are Biased Against Non-Native English Speakers: The 2026 Evidence
Stanford measured a 53.5% false-positive rate for AI detectors on TOEFL essays by non-native English writers — 1.4× to 50× the rate on native-speaker writing. Our May 2026 GPTZero primary research replicates the same direction on informal ESL writing. The bias is structural, replicated four times across three years, and the people paying for it are international students. Here's what the record actually shows and what a flagged ESL student should do.
TL;DR
Stanford's Liang et al. study measured an average 53.5% false-positive rate across seven leading AI detectors on TOEFL essays by non-native English writers — with one detector flagging 97.8%. Native-speaker writing was flagged at near-zero rates in the same test. ToHuman's May 2026 primary research on GPTZero replicates the direction: 16.0% false-positive rate on informal ESL writing versus 12.9% on combined native-English corpora. The bias is structural — ESL writing has lower perplexity, which is the exact signal detectors treat as "AI." The pattern has now replicated four times across three years. If you've been flagged as a non-native English speaker, document everything, cite the 53.5% Liang number plus the 2025–2026 replications, and demand a human review.
The bias is real, measurable, and replicated four times in three years. In 2023, a team led by James Zou at Stanford ran 91 TOEFL essays — written by real non-native English speakers — through seven of the most widely used GPT detectors. The average false-positive rate was 53.5%, with one detector flagging 97.8%. The same detectors run against US-born eighth-grade essays produced near-zero rates. The paper, "GPT detectors are biased against non-native English writers" (Patterns, 2023), was the first formal demonstration that the category had a structural bias problem. Three years later, in May 2026, we submitted 250 sentences of real ESL writing from Reddit language-learning communities to GPTZero's current production API. GPTZero flagged 16.0% of them as AI — a 1.4× lift over Wikipedia and academic prose run through the same model. Different corpus, different decade, same direction. Below is the four-study record, the mechanical reason the bias persists, and the appeal playbook for a flagged ESL student in 2026.
If you've been flagged or just want to check whether your own writing trips a detector, ToHuman is 100% free, no signup, no card — paste up to 700 characters, see if the rewrite reads like you. We exist partly because the detection environment is broken for ESL writers; the rest of this post explains why.
The Original Stanford Finding, and What It Actually Showed
The Liang et al. study tested seven detectors — including GPTZero, OpenAI's classifier, Originality.ai, Quil, Sapling, Crossplag, and ZeroGPT — on two corpora: TOEFL essays from a Chinese educational forum, and essays from US eighth graders. Across the seven detectors, the average false-positive rate on the TOEFL corpus was 53.5%. On the eighth-grade corpus, it was effectively zero.
One result inside the paper is more striking than the headline number. When the researchers used ChatGPT to "enhance" the vocabulary of the TOEFL essays — replacing simpler word choices with more native-sounding phrasing — the average false-positive rate dropped by 49.7 percentage points, from 53.5% to 3.8%. In other words: making non-native English look more native-fluent reduced the rate at which it was flagged as AI-generated. The detectors were not measuring AI-ness. They were measuring something closer to "does this writer have access to a wide active vocabulary," and treating low scores on that proxy as evidence of machine generation.
That single result reframes the entire debate. A detector that flags "writing that looks like it came from someone with limited English" as "writing that came from a machine" is not a misconduct tool. It is a fluency classifier wearing a misconduct tool's labels.
Why the Bias Is Structural, Not a Bug
To see why the detectors fail this way, it helps to look at what they actually compute. The two main signals across virtually every commercial AI detector are perplexity and burstiness.
Perplexity measures how surprising each word is given the words around it, scored against a language model. Text where every next word is highly predictable scores low on perplexity. Text where word choices are unexpected, varied, or rare scores high. Detectors treat low perplexity as a signal of LLM output, because LLMs — by training objective — generate the most likely next token.
Burstiness measures variation in perplexity across a document. Human writing tends to "burst" — a complex sentence followed by a simple one, an unusual word followed by common ones. LLM writing tends to be more uniform. Detectors treat low burstiness as a second signal of machine generation.
Now overlay the linguistic profile of a non-native English writer. ESL writers typically work from a smaller productive vocabulary, draw on a narrower set of syntactic patterns, and reuse transitional phrases they have learned to trust. The result, statistically, is text with lower perplexity and lower burstiness — for human reasons that have nothing to do with AI. The detector cannot tell the difference. It sees the same statistical signature it was trained to flag, and it flags.
This is why retraining or threshold-tuning does not solve the problem. The features the detector relies on are correlated with English fluency, and English fluency is correlated with native-speaker status. A detector that ignores those features cannot detect AI; a detector that uses them cannot avoid penalizing ESL writers. There is no engineering exit from this trade-off so long as the underlying signal is what it is.
The 2025–2026 Replication Record
The Stanford finding could have been an outlier. It hasn't been. Four lines of evidence have accumulated since 2024 — three external, one our own.
Independent academic replication. A 2025 paper in the journal Information ("Evaluating the Effectiveness and Ethical Implications of AI Detection Tools in Higher Education," MDPI) re-ran detection benchmarks on multilingual student corpora and found the same disparate pattern, with non-native English samples flagged at multiples of the native-speaker rate. The paper explicitly cites the bias as an ethical reason to bar detector output from misconduct findings.
Large-scale vendor and third-party audits. Independent test runs published through 2025 and early 2026 — including a 100,000-text audit of GPTZero by Ryne AI and roundup tests by walterwrites.ai and Skywork.ai — have continued to report ESL false-positive rates in the 50–60% range against native-speaker rates in the high single digits. Pangram and Originality.ai publish lower numbers in their own marketing, but those are vendor-controlled benchmarks, not independent audits.
ToHuman's own May 2026 GPTZero primary research. We submitted 861 verified pre-LLM human sentences from four corpora — Wikipedia, PubMed 2015–2019, Wikipedia news articles, and 250 sentences of real ESL writing from r/WriteStreakEN, r/EnglishLearning, and r/IELTS — to GPTZero's v2/predict/text endpoint at model version 2026-05-11-base. The overall false-positive rate was 13.8%. The ESL corpus alone was flagged at 16.0%, against 12.9% combined on Wikipedia, PubMed, and news — a 1.2–1.4× lift in our informal-writing sample. The lift is smaller than Liang's 53.5% on formal TOEFL essays because Reddit ESL writing carries more natural grammatical variation that reads as more distinctively human; the direction is identical. Full 861-sentence per-corpus breakdown, raw data, and the top-10 falsely flagged ESL sentences.
Litigation. A publicly reported February 2025 case at Yale involved a student suing the university over a wrongful suspension after GPTZero flagged an exam. The complaint specifically cites discrimination against non-native English speakers, drawing directly on the Liang et al. findings. Cases of this shape have proliferated through 2025 and into 2026, and the underlying disparate-impact theory has begun to attract class-action interest. We covered the broader false-positive problem in detail when the case volume started rising.
The point is not that any single 2025–26 study is definitive. It's that the disparate-impact pattern keeps replicating across study designs, datasets, and detector versions. That is what a structural finding looks like.
Who Actually Pays the Cost
The students at the sharp end of this are not abstract. International graduate applicants writing personal statements in their second language, F-1 visa holders writing their first college essay in English, immigrant academics submitting peer-reviewed manuscripts, ESL high schoolers in US public schools, applicants to UK and Australian universities from non-English-medium secondary systems — all of them produce writing whose statistical profile triggers the same false signal.
The downstream consequences are uneven and serious. A flagged paper in an undergraduate course is a stressful but contained problem. A flagged personal statement during graduate admissions is a closed door with no appeal. A flagged dissertation chapter for a non-tenured international scholar is a career risk. A flagged TOEFL or admissions essay can affect visa status. The same detector behavior — the same 60-something-percent false-positive rate — produces different harms depending on where in the academic pipeline it lands.
And the harm compounds in ways the per-paper accuracy number hides. An ESL student flagged once carries the suspicion forward. The next paper is read with a primed reviewer. The next assignment is graded with the same skepticism. Even if no formal sanction lands, the relationship with the institution shifts. International student services offices have been documenting this pattern for two years; it does not show up in detector accuracy benchmarks because detector accuracy benchmarks measure single decisions, not careers.
What Universities Have Done So Far
Institutional response has been faster in some places than others. Washington State University reportedly terminated its Turnitin AI detection contract in February 2026 after a reported 1,485 false positives in a single semester, with international and ESL students disproportionately represented in the flagged set. UCLA, UC San Diego, Cal State LA, Vanderbilt, Yale, Johns Hopkins, Northwestern, the University of Waterloo, and Curtin University have disabled Turnitin's AI detection entirely. Most of the University of California and Big Ten systems now instruct faculty to treat detector output as a non-evidentiary conversation starter rather than grounds for misconduct.
The institutions that have moved are also the ones with international student populations large enough to have surfaced the disparate-impact pattern in their own internal data. Smaller private colleges and regional state systems — institutions with both fewer international students and less revision capacity — are still running score-triggered processes, and they produce most of the litigated wrongful-finding cases. The pattern of universities that have banned AI detectors tracks closely with student demographics.
For a deeper read on what a defensible institutional response looks like, see our piece on university policies that protect students. The short version: replace probability thresholds with an evidence standard, require human review, audit outcomes by ESL status quarterly, and align policy language with the vendor's own disclaimers.
What to Do If You've Been Flagged as an ESL Student
Policy moves slowly. If you are flagged this term, here is the practical playbook.
1. Preserve every artifact of the writing process. Google Docs version history is the single best piece of evidence — it shows the document being typed, edited, and revised over hours or days. If you wrote in another tool, gather autosave backups, draft files, browser history of the research you did, notes you took, and any sources you cited. The strongest defense against "this looks AI-generated" is "here is twenty hours of edit history showing me writing it."
2. Request the specific evidence in writing. Ask for the raw detector report, the threshold the institution used, the version of the detector that flagged you, and any other evidence in the case file. Most schools will not volunteer this. They are usually obligated to provide it on request.
3. Cite the research and the vendor disclaimers. In any written response or appeal, reference (a) Liang et al. 2023 — average 53.5% false-positive rate across seven detectors on TOEFL essays by non-native English writers, with one detector hitting 97.8%; (b) the 2025 Information journal replication on multilingual student corpora; (c) ToHuman's May 2026 GPTZero benchmark showing 16.0% false-positive rate on real ESL writing under the current production model; and (d) the vendor's own published guidance — Turnitin and GPTZero both publish documentation explicitly cautioning that scores should not be the sole basis for misconduct findings, and both note reduced accuracy on non-native English. An institution that sanctions you on a score the vendor itself disclaims is on shaky ground at appeal.
4. Ask for a human review and an interview. A face-to-face conversation about your work — the sources you used, the choices you made, the things you struggled with — surfaces authorship signal that no detector can. Most students who can discuss their own writing in detail are believed when they do. Push for that conversation early, in writing, and bring the version history with you.
5. Escalate before you accept a finding. Talk to the campus ombudsperson, the dean of students, and — critically — the international student services office, which has institutional incentive to push back on disparate-impact patterns. If the university proceeds despite the documented bias and the vendor disclaimers, talk to a lawyer. The fact pattern of "ESL student, low-perplexity writing, vendor-disclaimed score, no corroborating evidence" is the exact pattern current civil rights litigation is built on.
None of these steps require admitting fault, none of them require paying for a tool, and all of them shift the burden of proof back where it belongs. For a deeper walkthrough of the Turnitin-specific playbook — the perplexity signal mechanics, how the detector works, and what an appeal should cite — see our complete guide to Turnitin AI detection in 2026.
If you want to defensively check your own writing — paste it into ToHuman, see whether the rewrite reads like you, then submit the version you stand behind. 100% free, no signup, no card, up to 700 characters per pass. We exist because the detection environment is broken for ESL writers — the long-term fix is institutional, the practical fix is at your desk.
Where the Detectors Themselves Are Going
Two trajectories are visible in 2026. The first is that detector vendors are softening their public claims. Turnitin's documentation now includes ESL-specific cautions, GPTZero's enterprise materials include reliability warnings, and Originality.ai has shifted some marketing toward "AI-assisted writing detection" framing rather than binary classifications. This is the legal and reputational ground moving beneath the category.
The second is that the underlying technical problem is not getting solved. Newer detectors do not have lower ESL false-positive rates; if anything, the gap has widened as the detectors have been tuned more aggressively against state-of-the-art LLM output, which itself has become more polished and harder to distinguish from careful human writing. The trade-off is fundamental, not an artifact of any specific generation of tools.
The honest read is that AI detection, as a category applied to ESL writing, is not going to become reliable enough for academic-integrity decisions in the foreseeable future. That is a hard thing for institutions invested in the tooling to admit, but it is what the technical literature actually supports. For the year-by-year accuracy data — covering whether benchmark improvements have actually closed the false-positive gap in real student populations — see Are AI Detectors Getting Better in 2026?.
A Note on Humanizers and Defensive Tools
There is an uncomfortable tension in writing about this from a humanizer's blog. Tools like ours exist partly because the detection environment is broken — students who have done nothing wrong still need a way to defend authentic writing against a flawed classifier. ToHuman and AI humanizers built for educational use are part of that defensive landscape. We won't pretend otherwise.
But the larger point is the right one: ESL students should not need to run their own writing through a humanizer to avoid being flagged for cheating they did not commit. The fix is at the institutional and policy level, not at the writer's desk. The defensive tools exist because the policy fix has been slow. They should not be necessary, and the better long-term outcome is the one where they aren't.
Resources
- We Ran 861 Human Sentences Through GPTZero — 13.8% Were Flagged AI (May 2026) — our own primary research, with the 16.0% ESL sub-rate referenced above.
- Are AI Detectors Getting Better in 2026? — three-year accuracy timeline across the major detectors, including the 2023 vs 2026 comparison table.
- AI detection false positives — the 2026 data — the broader research corpus on detector accuracy across populations.
- What universities get wrong about AI detection policies in 2026 — the policy companion to this research piece.
- Universities banning AI detection in 2026 — institutional decisions, dates, and the populations that drove them.
- AI humanizers for educational use — how ToHuman thinks about the education market.
- Liang et al. 2023 — GPT detectors are biased against non-native English writers — the original Stanford study (arXiv).
Frequently Asked Questions
Are AI detectors biased against non-native English speakers?
Yes. The Stanford-led Liang et al. study found that seven leading detectors misclassified TOEFL essays as AI-generated at an average false-positive rate of 53.5%, with one detector flagging 97.8%. Native-speaker writing was flagged at near-zero rates in the same test. Replications through 2025 and our own May 2026 GPTZero primary research (16.0% FPR on informal ESL writing vs 12.9% on native-English corpora) reproduce the same direction. The bias is structural, not a tuning artifact.
Why do AI detectors flag ESL writing as AI-generated?
Detectors classify text using perplexity (how predictable each word is) and burstiness (how perplexity varies across the document). Non-native English writers tend to draw on smaller working vocabularies and reuse phrasing — patterns that produce lower perplexity and lower burstiness, the same statistical signature LLM output produces. The classifier cannot distinguish between a careful ESL writer and GPT-4 because the feature distributions overlap.
What should I do if an AI detector flagged my essay and I'm a non-native English speaker?
Preserve every artifact of the writing process (Google Docs version history is the strongest), request the specific evidence in writing, cite the published research and the vendor's own ESL caveats, ask for a human review and an interview, and escalate through the campus ombudsperson, international student services, and — if the institution proceeds despite the documented bias — legal counsel. Don't accept a finding before you've exhausted the appeals path.
Which AI detector is most accurate for ESL writers?
None of the major commercial detectors is accurate enough on non-native English to ground a misconduct decision. Pangram and Originality.ai publish lower ESL numbers in their own marketing, but independent replication is limited. GPTZero and Turnitin both ship vendor-side caveats specifically warning about reduced reliability on shorter or non-native texts. The honest answer is that the category as a whole is not fit for high-stakes academic use against ESL writers — and the institutions that have moved fastest in 2026 are the ones that recognized this in their internal data first.
Have universities responded to the ESL bias problem?
Some have. Washington State University reportedly terminated its Turnitin AI detection contract in February 2026 after a reported 1,485 false positives in a single semester. UCLA, Vanderbilt, Yale, Curtin, Waterloo, Northwestern, and others have disabled Turnitin's AI detection entirely. Most University of California and Big Ten campuses now instruct faculty to treat detector output as a conversation starter, not evidence. Smaller and regional institutions are slower to revise policy and produce most of the wrongful-finding cases against international students.
Published April 27, 2026 by the ToHuman team.