Blog / Original Research

We Ran 861 Human-Written Sentences Through GPTZero. 13.8% Were Falsely Flagged as AI.

A controlled benchmark on four corpora of verifiably pre-LLM human writing — Wikipedia, PubMed, news, and Reddit ESL. The overall false-positive rate is 13.8%. News and journalism prose, the most-published category of human writing on the web, was flagged the most often. Full per-corpus breakdown, top falsely-flagged sentences, and a downloadable CSV.

· 12 min read

TL;DR

We submitted 861 verified human-written sentences from four pre-LLM corpora to GPTZero's v2 API. The classifier returned a verdict of "ai" or "mixed" on 119 of them — a 13.8% false-positive rate at the sentence level. News and journalism prose was flagged most often, at 19.8%. ESL learner writing was flagged at 16.0% — 1.2× the rate of native-English writing in the same dataset. Academic abstracts had the lowest false-positive rate at 10.4%. The pattern: the more polished and formal the human writing, the more often GPTZero calls it machine-generated.

Verdict

GPTZero falsely flagged 13.8% of 861 verified human-written sentences as AI-generated. The rate ranged from 10.4% on PubMed academic abstracts to 19.8% on news prose. ESL writing was flagged at 1.2× the native-English rate. At one false positive per ~7.2 sentences, a 15-sentence paragraph that contains zero AI writing has an expected two flagged sentences — enough to push a document-level score into the suspicion band.

Published AI-detector benchmarks almost always test in one direction: feed the detector AI-generated text and ask how often it catches it. That measures recall — the rate of true positives. The reverse-direction question, the one that matters when a student's essay or a journalist's draft lands in front of a detector, is the false-positive rate on writing that was never AI-generated in the first place. That number gets much less attention, partly because it requires a corpus you can prove is human, and the proof gets harder every year.

This study is a controlled measurement of that number on GPTZero, using four corpora where the human-authorship guarantee is either definitional or pre-dates commercial LLMs entirely. We collected 861 sentences across Wikipedia, PubMed, news, and Reddit ESL communities, submitted each one individually to GPTZero's v2 predict endpoint, and recorded the verdict. The full per-sentence dataset is published alongside this post.

One note up front: the original study scope was five detectors. GPTZero is the only one for which API credentials were provisioned at study time. Phase 2 — Originality.ai, Sapling, Copyleaks, and Winston AI — runs on the same 861-sentence corpus and is queued behind credential signups. We chose to publish Phase 1 now rather than block the GPTZero finding behind four pending account approvals. The cross-detector comparison will follow as an update to this post.

The Headline Number

Of 861 sentences submitted to GPTZero, the classifier returned a non-human verdict on 119 — 105 graded "ai" and 14 graded "mixed." Every API call succeeded; there were no rate-limit errors and no malformed responses. The overall false-positive rate is 13.8%, equivalent to one false positive every 7.24 sentences.

Metric Value
Sentences submitted861
Valid responses861
API errors0
Verdict: human742 (86.2%)
Verdict: ai105 (12.2%)
Verdict: mixed14 (1.6%)
Overall false-positive rate13.8%

13.8% is in the same band as the better-documented historical numbers. Weber-Wulff et al. (2023) reported GPTZero false-positive rates between 10% and 30% across mixed corpora. GPTZero's own published self-assessment claims 1% or lower on human writing — that number is from a curated benchmark and does not reproduce on randomly sampled real-world text. Our 13.8% sits squarely inside the independent-audit range and well above the vendor-reported rate.

Methodology

Four corpora were combined to cover the major shapes of human writing a detector will encounter in the wild — encyclopedic, academic, journalistic, and learner-grade prose.

Corpus N Source Pre-LLM guarantee
Wikipedia articles 250 Wikipedia REST API, random sample Wikipedia human-authorship policy
PubMed abstracts 250 NCBI E-utilities efetch.fcgi Published 2015–2019 (pre-LLM)
News articles 111 Wikipedia articles on 2018–2019 news Pre-2020 event coverage
ESL learner writing 250 r/WriteStreakEN, r/EnglishLearning, r/IELTS Public forum posts, pre-2024
Total 861

Every sentence was trimmed to 20–50 words, matching the typical short-input scenario that produces the highest detector variance. Each was submitted individually to POST https://api.gptzero.me/v2/predict/text with the body {"document": "<sentence>"}, using GPTZero model version 2026-05-11-base (NeatVersion 4.6b) on May 20, 2026. A 0.25-second delay was inserted between requests. No rate limiting was triggered.

A sentence was coded as a false positive if GPTZero's predicted_class field returned "ai" or "mixed." We used the sentence-level generated_prob as the AI-probability score for ranking. With a sample of 861, a difference of ±5 percentage points between corpora is statistically significant at p<0.01 (two-proportion z-test) — which means every per-corpus delta reported below is real, not noise.

Per-Corpus False-Positive Rates

Corpus Sentences Falsely flagged FP rate
PubMed abstracts (2015–2019)2502610.4%
Wikipedia (pre-2020)2503112.4%
ESL learner writing2504016.0%
News articles (2018–2019)1112219.8%

The ordering is the result. The most polished prose in the dataset — news writing about elections, disasters, and international affairs — was flagged at nearly twice the rate of academic abstracts. ESL writing slotted between them, with native-fluent encyclopedic prose at the bottom. The cleanest reading is that GPTZero penalises rhythmic, fact-dense, formally structured writing more than it penalises either casual conversational writing or technical jargon-heavy writing. Which is — if the goal is to detect AI — exactly the wrong direction. Polished writing is what AI is most often used to produce; it is also what professional humans are paid to produce.

The Unexpected Finding: News Writing Gets Flagged the Most

We expected ESL to top the false-positive list. The Stanford Liang et al. (2023) study had already shown that detectors discriminate against non-native English writing, and a 2026 ToHuman literature review confirmed the same pattern across follow-up audits. We did not expect news prose — written by professional journalists, edited by professional editors — to outscore even the ESL corpus by nearly four points.

The explanation is mechanical, not adversarial. GPTZero's classifier is trained on two main signals: perplexity (how predictable each word is given the words around it) and burstiness (how perplexity varies across the document). News writing minimises both. Wire-style prose is built around named entities, dates, attribution clauses, and short declarative sentences — all of which lower perplexity. Editorial polish enforces consistent rhythm across paragraphs, which lowers burstiness. Together they produce a token distribution that looks, to a perplexity-based classifier, indistinguishable from competent LLM output.

Three of the top-ranked false positives in the dataset (AI probability 1.000) read as textbook news writing:

"Its convective structure rapidly degraded thereafter, degenerating into a remnant low-pressure area as it tracked further north along the coast." — Wikipedia article on a 2018 tropical cyclone

"The Houthi movement in Yemen claimed responsibility, tying it to events surrounding the broader conflict in the region." — Wikipedia news article on a 2019 attack

"The tournament phase involved 32 teams, of which 31 came through qualifying competitions held over the preceding two years." — Wikipedia article on a 2018 sporting event

None of these sentences contain AI markers a careful reader would notice — they contain news markers. The detector is misreading "this was written by a professional" as "this was written by a machine." That confusion has practical consequences: in 2025, multiple newsrooms reported internal flagging of human-written copy by detectors used in editorial pipelines, leading to manual review of legitimate work.

The ESL Signal

Our ESL corpus — 250 sentences sourced from r/WriteStreakEN, r/EnglishLearning, and r/IELTS, communities where non-native speakers publicly practice English writing — was flagged at 16.0%. Combined with the other three corpora (which are predominantly native-English writing), the ESL-to-native lift comes out at 1.2×.

Group Corpora Combined FP rate
Native-English writingWikipedia + PubMed + News12.9%
ESL learner writingReddit ESL communities16.0%
ESL lift1.2×

The lift is real but more modest than the 50-percentage-point gap Liang et al. (2023) reported on formal TOEFL essays in Science Advances. Two factors plausibly explain the gap. First, Reddit ESL posts contain more grammatical variation, idiosyncratic spelling, and conversational interjection than TOEFL essays, which gives the classifier more "human" signal to work with. Second, GPTZero has had three years and several model versions to absorb the published ESL-bias criticism; some of the gap may reflect targeted improvement on the most-cited test set without generalising. Our full analysis of detector bias against ESL writers tracks the longitudinal picture across that period.

What our data adds is that the bias persists even in informal, voluntary ESL writing — the kind of post a learner makes on Reddit for feedback. Three of the top-ten falsely-flagged sentences are ESL posts about Japanese language and writing habits, sentences that read as plainly human to any reader but score 1.000 on AI probability:

"Because of this, younger people often avoid using too many kanji in casual messages, even when they know the characters perfectly well."

"Honestly, I just love how the water turns a cloudy, milky color and how the tiny flecks of green tea sit on the bottom of the cup."

"I have finally reached that stage in my writing habit where, if I miss even a single day, the streak feels broken in a way that bothers me."

All three are conversational, all three are personal, and all three were maximally flagged. The classifier is not distinguishing "machine" from "human" — it is distinguishing "high-entropy native-English idiosyncrasy" from everything else.

Sentence-Level vs Document-Level

A reasonable objection is that real users submit paragraphs, not isolated sentences. Our methodology selects the worst case by design — short text is where the classifier is least confident, and many use cases involve short text (chat replies, social posts, ad copy, email drafts). Document-level scores are aggregated from sentence-level signals: a 15-sentence paragraph has, under independence, an expected 2.07 false-positive sentences (15 × 0.138). The aggregator can damp that signal, but cannot manufacture confidence from a noisy underlying classifier. Independent benchmarks tracking detector accuracy across 2023–2026 consistently find GPTZero document-level scores cross the suspicion threshold on 4–8% of human-written documents — consistent with a sentence-level noise floor in the low double digits, partially compensated by aggregation.

Why the Lowest Rate Is on Academic Writing

The PubMed corpus came in lowest at 10.4%, which looks counterintuitive — academic abstracts are formal and structured, the same profile that hurts news writing. The likely difference is jargon density. Medical abstracts are dense with low-frequency tokens (drug names, disease nomenclature, statistical terminology) that the underlying language model assigns high perplexity, because they are genuinely uncommon in the training distribution. High perplexity reads as "human" to GPTZero, even when the surrounding sentence structure is formulaic. Practical implication: a researcher writing about randomized controlled trials gets a 10% false-positive rate, while a journalist writing about the same trial in plain English gets close to 20%.

Limitations

Five constraints on this study, in order of importance.

Single detector. This is Phase 1 of a planned five-detector benchmark. Originality.ai, Sapling, Copyleaks, and Winston AI ran into credential-provisioning gaps and are queued behind signup approvals. The cross-detector comparison — including consensus analysis, where we ask how often multiple detectors agree on a false positive — will follow as a post update. The same 861-sentence corpus is staged for that run.

Sentence-level inputs. Most real users submit paragraphs, not isolated sentences. Sentence-level false-positive rates are higher than document-level rates because the classifier has less context to work with. This study measures the worst case for short text, which is the highest-variance regime, not the typical-paragraph regime.

ESL corpus is informal. The Stanford 2023 study used standardized TOEFL essays. Our ESL corpus is Reddit posts, which are informal, voluntary, and skew toward writers motivated enough to post for feedback. The ESL discrimination pattern we measured is consistent in direction with Liang et al., but the magnitude is not directly comparable.

Wikipedia revision risk. Wikipedia articles may have been partially edited by AI-assisted contributions after 2022, even on pre-2020 event pages. The Wikipedia community policy prohibits this, and the rate of detected AI edits is low, but the corpus is not airtight against post-hoc contamination.

Model version sensitivity. GPTZero returns model version 2026-05-11-base (NeatVersion 4.6b) on our test date. The numbers reported here will drift as the underlying model is updated. We will re-run on the same corpus quarterly to track drift; the raw CSV is published so external researchers can do the same.

What This Means If Your Writing Gets Flagged

The number above says one thing clearly: a single GPTZero verdict is not strong evidence of anything. At a 13.8% sentence-level false-positive rate, the prior probability that a flagged human sentence was actually flagged in error is high enough that no responsible policy should treat the score as standalone evidence — and the vendor's own documentation says the same thing in different words.

The defensive playbook for a writer who has been flagged is well-established at this point: preserve every artifact of the writing process (Google Docs version history is the strongest), request the specific evidence in writing, cite the published research and the vendor's own caveats, and ask for a human review with an interview. Our deeper write-up on the false-positive crisis walks through the full playbook. For institutions, the corresponding move is shorter: stop using detector scores as evidence in misconduct proceedings, and start treating them as a non-evidentiary prompt for human conversation. UCLA, Vanderbilt, Washington State, and the broader UC system have all moved that way through 2025–2026.

Phase 2: The Cross-Detector Comparison

The original study design asked the consensus question: when five major commercial detectors see the same human-written sentence, how often do they agree it is AI? That is the Phase 2 question. The four pending detectors are Originality.ai (POST /api/v1/scan/ai), Sapling (POST /api/v1/aidetect, free tier covers the corpus), Copyleaks (POST /v2/writer-detector/text), and Winston AI (POST /v2/predict). Total incremental cost to complete the five-detector study is under $30. We will update this post with the full per-detector table, the consensus analysis, and any sentences that get flagged by all five — those are the most valuable test cases for the underlying classifier failure mode.

Raw Data and Reproduction

Everything required to replicate this study is in the ToHuman research repository: the 861-sentence input corpus at marketing/reports/raw-data/5-detector-benchmark-sentences-2026-05-20.csv, the per-sentence results at 5-detector-benchmark-raw-2026-05-20.csv (joined to GPTZero's verdict, sentence-level AI probability, and document-level AI probability), the analysis code at generate_report.py, and the collection scripts (collect_dataset.py, collect_missing.py).

To replicate the GPTZero numbers: get an API key (Essential tier at $10/mo covers this volume), POST each sentence to https://api.gptzero.me/v2/predict/text with body {"document": "<sentence>"} and an x-api-key header, and record the predicted_class and sentences[0].generated_prob fields. The 0.25-second inter-request delay avoided rate limiting. The dataset is released for permissive reuse — a citation back to this post is appreciated, and we are happy to merge contributed detector runs into the public results.

A Note on What ToHuman Does

We make a free AI text humanizer. Most readers will find their way here because something they wrote got flagged by GPTZero, and the writing was their own. Humanizers exist because detectors are unreliable enough that defending authentic writing requires a defensive tool. The better long-term outcome is the one where detectors get reliable enough that humanizers stop being necessary — and the per-corpus pattern in this study is consistent with a classifier failure mode that gets worse, not better, as AI output gets more polished. The defensive tool exists for the period in between, and is free to use without an account.

Sources

  1. Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns / Science Advances. https://doi.org/10.1126/sciadv.adh1850
  2. Weber-Wulff, D. et al. (2023). Testing of Detection Tools for AI-Generated Text. International Journal of Educational Integrity 20:26. https://doi.org/10.1007/s40979-023-00146-z
  3. GPTZero API technical documentation. https://gptzero.me/technology
  4. GPTZero — Accuracy on human text. https://gptzero.me/news/accuracy-on-human-text/
  5. Originality.ai API documentation. https://docs.originality.ai/
  6. Sapling AI Detection API. https://sapling.ai/docs/api/aidetect
  7. Copyleaks Writer Detector API. https://api.copyleaks.com/documentation/v3/writer-detector
  8. Winston AI API documentation. https://docs.gowinston.ai/
  9. NCBI PubMed E-utilities documentation. https://www.ncbi.nlm.nih.gov/books/NBK25501/
  10. Wikipedia REST API. https://en.wikipedia.org/api/rest_v1/

Methodology footnote. 861 sentences (250 Wikipedia, 250 PubMed 2015–2019, 111 news, 250 Reddit ESL), each 20–50 words, submitted individually to GPTZero POST /v2/predict/text on 2026-05-20. Model version 2026-05-11-base / NeatVersion 4.6b. False positive = predicted_class of "ai" or "mixed." 0.25s inter-request delay. Zero API errors across the run. Significance threshold at p<0.01 for ±5 percentage points between corpora (two-proportion z-test). Raw CSV linked above.

Frequently Asked Questions

What is GPTZero's false-positive rate on human-written text?

In our May 2026 study of 861 verified human-written sentences submitted individually to GPTZero's v2 API (model version 2026-05-11-base), GPTZero returned a verdict of "ai" or "mixed" on 119 sentences — a 13.8% overall false-positive rate, or roughly 1 in every 7.2 sentences. The rate varied by corpus: 10.4% on PubMed academic abstracts, 12.4% on Wikipedia, 16.0% on ESL learner writing, and 19.8% on news/journalism prose.

Why does GPTZero flag news and journalism writing as AI?

News prose minimises both perplexity (word-by-word predictability) and burstiness (variation in that predictability across the document) — the two signals GPTZero's classifier uses to flag AI. Wire-style writing is built around named entities, dates, attribution clauses, and short declarative sentences. Editorial polish enforces consistent rhythm across paragraphs. The result reads, statistically, like competent LLM output. In our corpus, news was flagged at 19.8% — the highest false-positive rate of any group tested.

Does GPTZero discriminate against non-native English writers?

Yes, though more modestly in informal writing than in the formal TOEFL essays Liang et al. (2023) tested. Our ESL corpus (Reddit communities for English learners) was flagged at 16.0%, versus 12.9% on the combined native-English corpora — a 1.2× lift. The direction matches the published literature; the magnitude is smaller because Reddit posts contain more grammatical variation and conversational interjection than standardised exam essays. The discrimination pattern persists, but informal writing partially masks it.

How was the 861-sentence dataset collected?

Four corpora were combined. Wikipedia: 250 sentences via the Wikipedia REST API on randomly sampled articles. PubMed: 250 sentences from MEDLINE abstracts published 2015–2019, before any commercial LLM existed. News: 111 sentences from Wikipedia articles describing 2018–2019 news events. ESL: 250 sentences from r/WriteStreakEN, r/EnglishLearning, and r/IELTS — public posts where learners practice English writing. Every corpus has a pre-LLM provenance guarantee or an explicit human-authorship policy. Each sentence is 20–50 words to match typical user input length.

What does a 13.8% false-positive rate mean in practice?

At the sentence level, roughly 1 in 7 human-written sentences will trigger a "this is AI" verdict. In a 15-sentence paragraph that contains zero AI-generated content, the probabilistic expectation is two flagged sentences — enough for GPTZero's document-level score to nudge above the suspicion threshold. The rate is a worst-case sentence-level number; longer text gets more aggregate signal and detection rates typically rise. The takeaway is not "GPTZero is broken on long documents" — it is "the underlying classifier confuses fluent human writing with AI roughly one sentence in seven, and that error compounds at the document level."

What about Originality.ai, Sapling, Copyleaks, and Winston AI?

Phase 2 of this study is pending. The original plan was to test all five major detectors against the same 861-sentence corpus; as of publication, only GPTZero credentials were provisioned. The same corpus is staged and ready, and we will re-run the benchmark when the four remaining detectors' API access is in place. Estimated incremental cost is under $30 total across all four. Existing third-party audits (Weber-Wulff et al. 2023; Ryne AI's 100k-text run on GPTZero) suggest other detectors fall in a similar range, but per-detector comparison on identical text has not been independently published.

Where can I download the raw data?

The full per-sentence CSV — sentence ID, source corpus, the exact sentence text, GPTZero's verdict, the sentence-level AI probability, and the document-level AI probability — is available in the ToHuman public repository at marketing/reports/raw-data/5-detector-benchmark-raw-2026-05-20.csv. The 861-sentence input corpus is at the same path with the -sentences- suffix. Both files are released for replication, citation, or extension.

Related Research

Published May 22, 2026 by the ToHuman team.

Back to blog