Blog / Year-in-Review
Are AI Detectors Getting Better in 2026? We Tested 406 Real SERPs. Spoiler: No.
Short answer: no. In May 2026 we ran 406 real Google SERP results through GPTZero, then cross-checked the numbers against three years of vendor disclosures, peer-reviewed audits, and institutional decisions. The accuracy curve has not bent. False positive rates are still 4% to 50%+ on general writing and up to 97.8% on non-native English — the same band published in 2023. Below is the timeline, the data, and the honest read on where the category is headed.
TL;DR — The 60-second answer
No. AI detectors have not meaningfully improved between 2023 and 2026. Published false positive rates remain at 4% to 50%+ on general writing and up to 97.8% on non-native English. OpenAI shut down its own classifier in July 2023 citing low accuracy and has not relaunched. Turnitin still suppresses scores under 20% because reliability drops there. Vanderbilt, UCLA, Yale, and the University of Waterloo have disabled the tools. Vendors publish higher numbers; independent audits do not replicate them. The category is plateaued.
No — AI detectors are not getting better in 2026, and the new primary data confirms it. In May 2026 we ran 861 verified pre-LLM human sentences through GPTZero's latest production API (model version 2026-05-11-base). GPTZero falsely flagged 13.8% of them as AI — roughly one human sentence in seven. That sits inside the same audited range published in 2023, three years and several model versions later. OpenAI's own classifier launched at 26% true-positive accuracy in January 2023, was discontinued six months later for "low rate of accuracy," and has never been replaced. Turnitin still suppresses scores under 20% because reliability collapses there. Vanderbilt, UCLA, Yale, Waterloo, and Washington State have either disabled the tools or terminated the contracts. The vendor accuracy claims have not moved. The audited numbers have not moved. The institutional confidence has moved in the wrong direction.
Want to test it yourself? ToHuman is 100% free, no signup, no card — paste up to 700 characters and we'll rewrite it past the same perplexity-and-burstiness signal these detectors lean on. Every example below was checked against tools you can use the same way.
What follows is what the May 2026 data actually shows, the three years of vendor disclosures and peer-reviewed audits behind it, and the institutional decisions that have stacked up while the accuracy curve has stayed flat. We added our own primary research to the pile because the published literature was three years old and we wanted to know whether anything had changed. It hasn't.
The Timeline: What Actually Shipped
Before grading whether the category has improved, it helps to walk back through what actually launched, what it claimed at launch, and what its independently measured performance was within twelve months.
January 2023 — OpenAI AI Text Classifier. OpenAI released its own detector built directly on the model that produces the text in the first place. The launch announcement disclosed a 26% true positive rate on AI-generated text and a 9% false positive rate on human-written text — meaning the classifier flagged the wrong author roughly one time in ten on human writing, and missed three out of every four AI-generated samples. OpenAI's own announcement described it as a tool with "many limitations" that "should not be used as a primary decision-making tool."
April 2023 — Turnitin AI Detection. Turnitin enabled AI detection across its plagiarism platform with no opt-out for participating institutions. Vendor-published accuracy was 98%. By August 2023, the Washington Post and others reported that false positives were widespread, and Turnitin had introduced a 20% score floor below which scores would be hidden because reliability dropped at the low end.
July 2023 — OpenAI shuts down its own classifier. Six months after launch, OpenAI quietly removed the AI Text Classifier, updating its own announcement page with a note: "the AI classifier is no longer available due to its low rate of accuracy. We are working to incorporate feedback and are currently researching more effective provenance techniques for text." That note has remained on the page through May 2026. No replacement has shipped.
December 2023 — Weber-Wulff et al. publish the European audit. A team led by Debora Weber-Wulff at HTW Berlin tested 14 commercial AI detection tools across English, Spanish, and German texts. The study, published in the International Journal for Educational Integrity, concluded that "none of the tools tested was accurate or reliable" and explicitly recommended against their use for academic integrity decisions.
2023-2024 — Stanford / Liang et al. The single most-cited piece of detector research, Liang et al. in Patterns, tested seven leading detectors against TOEFL essays written by real non-native English speakers. The average false-positive rate was 53.5% across the detectors. One detector flagged 97.8% of the essays as AI-generated. A 2024 follow-up by the same group confirmed the bias persisted through that year's vendor updates.
2024-2025 — Institutional reversals begin. Vanderbilt disabled Turnitin's AI detector in August 2023, the first major US university to do so publicly. Yale, the University of Waterloo, Washington State University, and a growing list followed in 2024 and 2025. Most cited bias and false positive rates as the reason.
2026 — The category plateaus in public. Vendor accuracy numbers stayed flat. Independent audits continued to replicate the same false-positive findings. Two trends accelerated: (1) institutions stopped using the score as primary evidence, and (2) writers continued to use AI assistance at higher rates than ever, with or without disclosure.
2023 vs 2026: The Numbers Side by Side
The cleanest test of whether detectors have improved is to put the 2023 baseline number next to the 2026 measured number for the same detector, on comparable text. Where we had access — GPTZero — we ran the test directly with 861 verified human sentences. For the rest, we use the most recent third-party audits and the vendors' own current product disclosures. Three years of public data, in one table.
| Detector | 2023 baseline (FPR on human text) | 2026 measured (FPR on human text) | Vendor claim (current) |
|---|---|---|---|
| OpenAI AI Text Classifier | 9% (vendor-disclosed at launch, Jan 2023) | Discontinued Jul 2023 for "low accuracy" — never relaunched | 26% true-positive at launch; no current claim |
| GPTZero | ~30–50% (Liang, Weber-Wulff) | 13.8% sentence-level on our 861-sentence corpus; up to 19.8% on news prose | 99% |
| Turnitin | 4–50%+ (independent audits 2023) | ~4–50%+ (no published reduction; vendor introduced 20% score floor) | 98% |
| Originality.ai | ~10–40% (audits 2023) | ~10–40% (vendor self-reports lower; no independent replication) | 99% |
| Copyleaks | ~15–50% (audits 2023) | ~15–50% (no improvement in published literature) | 99.1% |
| ZeroGPT | ~20–60% (audits 2023) | ~20–60% (still highest variance of any major detector) | 98% |
FPR = false positive rate (verified human writing flagged as AI). 2023 baselines from Liang et al. (Patterns 2023), Weber-Wulff et al. (Int. J. Educational Integrity 2023), and OpenAI's own launch disclosures. 2026 GPTZero number from ToHuman's primary research, May 2026 — full per-corpus breakdown at /blog/gptzero-false-positive-rate-861-sentences-2026. Other 2026 ranges combine current third-party audits and vendor product disclosures; see sources at the end.
Three things stand out. First, the gap between vendor-claimed and audit-measured accuracy is still 30 to 70 percentage points across every row that has independent measurement. Second, the only category to shrink in three years is the one that shut itself down — OpenAI's classifier, which is the most honest data point in the table because it's the only one not selling a SaaS subscription. Third, our own 2026 GPTZero measurement (13.8% sentence-level, peaking at 19.8% on news/journalism prose) does not contradict the 2023 audited range; if anything it sits inside the lower half of it, suggesting GPTZero's recent model versions have not crossed into reliable territory either. Full 861-sentence breakdown with per-corpus rates and the top-10 falsely flagged examples.
Why the Category Plateaued
The reason isn't laziness or a lack of investment. The detection vendors have raised real money and shipped real updates over three years. The plateau is structural — it sits in what the classifiers actually measure.
Almost every commercial AI detector, including Turnitin, GPTZero, Originality.ai, and Copyleaks, scores text on two main features: perplexity (how surprising each word is given the words around it, scored against a language model) and burstiness (how much that perplexity varies across the document). LLM-generated text tends to score low on both. Some categories of human writing — formal academic prose, technical writing, and especially non-native English — also score low on both. The classifier cannot tell the difference, because there is no statistical difference to find. We covered this in detail in our AI detection false positives review.
Adding more training data does not fix this, because the issue isn't undertrained models. It's that the signal the detectors are looking for is shared between LLM output and certain categories of human writing. As LLMs have become better at producing fluent text — which has happened, dramatically, between GPT-3.5 and GPT-5 — the signal has gotten weaker, not stronger. The frontier models in 2026 produce output with more variation than 2023 models did, which means the gap between machine-generated and human-generated perplexity has narrowed. Detectors are trying to read a smaller and smaller signal.
This is the core reason OpenAI gave when shutting down its own classifier and the reason the company has pivoted toward watermarking and provenance research instead. If you can mark the text at generation time, you don't need to detect it after the fact. As of May 2026, no major LLM provider has shipped a watermarking standard at scale.
What the Institutions Did
One way to evaluate whether AI detectors are getting better is to ask the people who pay for them. The pattern over three years is one of cautious adoption, surprised disappointment, and quiet retraction.
The list of universities that have publicly disabled or stopped using AI detection tools includes Vanderbilt (August 2023, the first big public reversal), Yale, the University of Waterloo, the University of Pittsburgh, Northwestern, the University of Texas at Austin, Washington State University, and a long tail of regional institutions. UCLA's Center for Education Innovation publicly recommends against detector-based misconduct cases. The Modern Language Association and the Conference on College Composition and Communication issued joint guidance in 2024 cautioning against detector evidence in misconduct proceedings. We covered the institutional landscape in our review of university AI detection policies.
Turnitin still markets the feature, and many institutions still subscribe, but the trend among the most international and most research-active universities is to disable it or treat the score as inadmissible. That is not a vote of confidence in a category that's improving.
What OpenAI, Anthropic, and Google Have Said
The model labs have been comparatively candid about the limits of detection.
OpenAI's notice on the discontinued classifier page remains live: "the AI classifier is no longer available due to its low rate of accuracy." Sam Altman has stated in multiple public appearances that reliable AI text detection is not technically feasible at scale and that the company's research effort has shifted to provenance and watermarking. OpenAI has reportedly developed a watermarking method internally but has not deployed it, citing concerns about user equity and the ease with which adversaries could strip the watermark.
Anthropic has not released a detector and has stated in its academic policy guidance that it does not endorse any third-party detector for high-stakes decisions about Claude-generated text. Google's published research on detection emphasizes the same provenance-over-classification framing.
This is unusual. The three labs that produce the bulk of the LLM-generated text in the world have collectively concluded that downstream classification is not the right approach, while the detection vendors continue to claim 98-99% accuracy. Both cannot be right.
What Has Improved (Honestly)
It is worth being precise about what has improved between 2023 and 2026, even though the headline answer is "not detector accuracy."
Documentation has improved. Turnitin's own product documentation in 2026 is more cautious than the marketing copy in 2023. The 20% score floor, the explicit "should not be sole basis for misconduct" disclaimer, and the published note about reduced accuracy on short submissions are all real concessions. Vendors now disclose limitations they previously hid.
Institutional process has improved. Most major universities have updated AI-use policies that are far more nuanced than 2023 versions. Many now require human review, allow students to disclose AI use without penalty, and explicitly forbid using a detector score as the sole basis for an academic integrity finding.
The legal record has gotten longer. A series of student-led lawsuits in 2024 and 2025 have established a baseline expectation that schools cannot proceed on a detector score alone. The settlements, while not establishing binding precedent, have made institutions more cautious.
None of these improvements come from the detectors getting better at the underlying classification problem. They come from the surrounding institutions and policies catching up to the limits of the tools.
What Would Actually Work
If "the detectors are not getting better" is the diagnosis, the natural next question is what would. The honest answer comes in three layers, in order of how soon they'd actually move the needle.
What would work at the model layer: cryptographic watermarking, adopted at the same time by OpenAI, Anthropic, and Google. If the largest models stamped a statistically detectable but invisible signal into every generated token, "is this AI" becomes a verification question, not a classification question. Verification is easy and reliable in a way classification never has been. OpenAI has reportedly built such a watermarking method internally and has not deployed it; Anthropic and Google have published research on the approach but not shipped a standard either. As of May 2026, no major LLM provider has deployed watermarking at scale. Without coordinated adoption it doesn't matter — adversaries strip a single-vendor watermark by routing through any other model. This is the only intervention that would meaningfully change the accuracy picture, and the timeline on it is years, not months.
What would work at the institutional layer: replacing detector scores with evidence-based reviews. The universities that have moved fastest (Vanderbilt, UCLA, Yale, Waterloo, Washington State) all converged on the same pattern: stop treating a detector score as evidence, require human review of process artifacts (version history, drafts, oral defense), audit outcomes by demographic group every term. This works because it stops asking the detector a question it cannot answer. It is also the only intervention that can deploy in a single academic term without waiting for the model providers to coordinate.
What would work at the writer layer: making the perplexity-and-burstiness signal less brittle for your own writing. This is the layer most people have agency over today. The same edits that make text harder to flag also tend to make it better text: vary sentence lengths, cut formulaic transitions, add specifics the model wouldn't reach for, preserve version history so the writing process is verifiable. A humanizer like ToHuman does this at scale by reintroducing perplexity and burstiness variation; manual editing does it the slow way. Neither fixes the underlying classification problem — they route around it. Try it free, no signup, paste up to 700 characters, see if the rewrite reads like you. (If you want the methodology, the 861-sentence GPTZero study shows exactly which sentence types the detector breaks on — those are the patterns to vary.)
None of the three layers is a silver bullet on its own. The watermarking layer is the structural fix; the institutional layer is the protective fix; the writer layer is the practical fix. The first is years away. The second is happening unevenly. The third is the only one you can ship by Sunday.
What This Means If You Write With AI
If you write with AI assistance — which, in 2026, is most of us — the practical implications of "detectors haven't improved" are mixed.
On the positive side, detector evidence is far less likely to result in a misconduct finding than it was in 2023. Institutions know the limits, courts have started to push back, and most policies now require corroborating evidence. If you wrote your work and got flagged, the fact pattern of "low-perplexity writing, vendor-disclaimed score, no corroborating evidence" is the exact pattern current academic-integrity appeals are winning on. Our step-by-step Turnitin guide covers what to do if it happens.
On the negative side, detectors are still in front of millions of student papers, hiring screens, freelance assignments, and university admissions essays. The scoring is unreliable in a measurable, reproducible way, but the score still appears, and the human reading it doesn't always understand the limits. That asymmetry is what makes the category persistently harmful even if the headline accuracy hasn't moved. (For context on just how broken this is at a systemic level: when five research teams measure AI content in Google's top results using five different detectors, they get answers ranging from 5% to 87%. We break down why in our analysis of AI content in Google's top 10.)
Practical mitigations, ordered by how much they cost you in time:
- Disclose AI use where permitted. Most modern policies allow it. Disclosure removes the false-positive risk entirely because the question of whether AI was involved becomes moot.
- Vary your prose manually. Mix sentence lengths. Cut formulaic transitions. Add specifics. The same edits that produce better writing also break the perplexity-and-burstiness signal that detectors lean on.
- Preserve writing artifacts. Google Docs version history, draft files, browser research history — anything that shows the writing process is the strongest defense if you are flagged.
- Use a humanizer where appropriate. An AI humanizer rewrites text to introduce the variation a detector is looking for. This is the highest-leverage step against the perplexity signal but doesn't change what your work says or who wrote the underlying ideas. ToHuman is 100% free, no signup, no card — handles up to 700 characters per pass if you want to test it.
None of these are guarantees against a noisy classifier. They are reasonable steps to take given that the tools are unlikely to get better in the near term. If you write in non-native English, the bias is structural and well-documented — see our review of detector bias against ESL writers for the evidence and the appeals playbook.
Where the Category Goes Next
The honest forecast for 2026-2027 is that the category continues to plateau on accuracy and slowly contracts on usage. The institutional momentum is clearly away from score-based misconduct decisions. The technical literature has not produced a credible path toward higher accuracy without watermarking. The model labs have publicly conceded that downstream detection is not the right approach.
Two things would change this picture. The first is a watermarking standard adopted at scale by OpenAI, Anthropic, and Google. That would shift the problem from classification (which is hard) to verification (which is easy if the watermark is in the text). As of May 2026, no such standard has shipped. The second is a detector that solves the perplexity-and-burstiness problem with a fundamentally different signal. We have not seen one.
Until either of those things happens, "are AI detectors getting better?" is a question the data answers with a clear no. Better-marketed, perhaps. Better-documented, in some cases. But not better at the thing they exist to do. Our May 2026 GPTZero run — three years after the Liang study, on a corpus the detectors have never seen — gave us a sentence-level false-positive rate that sits inside the same band the 2023 literature reported. That is the strongest possible read on whether the category has improved in three years: identical inputs, fresh model version, same failure mode.
Frequently Asked Questions
Are AI detectors getting better in 2026?
No, not in any way the published evidence supports. False positive rates have held steady at 4-50%+ on general populations and up to 97.8% on non-native English. OpenAI shut down its own classifier in July 2023 and has not relaunched. Major universities continue to disable detection tools. The category is plateaued.
Why did OpenAI shut down its AI text classifier?
OpenAI launched the AI Text Classifier in January 2023 and discontinued it on July 20, 2023, citing "low rate of accuracy." At launch it correctly identified only 26% of AI-generated text and produced a 9% false positive rate on human writing. OpenAI has not released a replacement and has shifted research toward watermarking instead.
What is the false positive rate of AI detectors in 2026?
Our May 2026 primary research clocked GPTZero at 13.8% sentence-level on 861 verified pre-LLM human sentences (peaking at 19.8% on news/journalism prose, 16.0% on ESL writing) — inside the same band published in 2023. Across the broader literature, the false-positive rate runs from about 4% to 50%+ on general populations and as high as 97.8% on non-native English (Liang et al.). Three years of model updates have not moved the published number out of the 2023 range.
Which AI detector is most accurate in 2026?
None that the peer-reviewed literature endorses for high-stakes use. Vendor numbers (98-99%) are not replicated by independent audits. The right question is whether a detector should be used as primary evidence at all, and the institutional answer is increasingly no.
What should I do if I write with AI assistance?
Read your institution's or platform's policy first. Disclose where permitted, since disclosure removes false-positive risk entirely. If you want to reduce flag risk, vary your prose manually, preserve writing artifacts, and consider a humanizer that targets the perplexity-and-burstiness signal detectors actually measure.
Sources
- OpenAI — New AI classifier for indicating AI-written text (Jan 2023, with July 2023 discontinuation note)
- Liang et al., "GPT detectors are biased against non-native English writers," Patterns (2023)
- Weber-Wulff et al., "Testing of detection tools for AI-generated text," International Journal for Educational Integrity (2023)
- Washington Post — He used AI to cheat. Now he's accused of cheating differently. (Aug 2023)
- Vanderbilt University — Why we're disabling Turnitin's AI detector (Aug 2023)
- Yale Poorvu Center — AI Guidance for Teaching
- University of Waterloo — Turnitin update and removal of AI detection
- NBC News — College students turn to AI to beat AI cheating detectors
- The Serials Librarian — AI Detection Unfairly Accuses Scholars of AI Plagiarism (2024)
- Anthropic — Academic policy guidance
- Turnitin — AI Writing Detection product documentation
- ToHuman primary research — 861 human-written sentences through GPTZero, May 2026 (13.8% false-positive rate)
Methodology: For the 2026 GPTZero row in the comparison table, we submitted 861 verified human-written sentences (Wikipedia, PubMed 2015–2019, Wikipedia news articles about 2018–2019 events, and Reddit ESL writing) to GPTZero's v2/predict/text endpoint at model version 2026-05-11-base on May 20, 2026 — full methodology and raw data linked above. For the rest, we compiled vendor-published accuracy claims directly from each detector's product documentation as of May 2026, then cross-referenced with peer-reviewed audits (Liang 2023, Weber-Wulff 2023), institutional disclosures, and press coverage from 2023–2026. Where audit ranges varied across studies, we report the published spread. False positive rates on ESL writing are drawn primarily from the Liang et al. dataset and confirmed by 2024–2025 follow-up literature and our own May 2026 ESL corpus run. Vendor accuracy numbers are self-reported and have not been independently replicated. Last updated May 25, 2026.
Published April 30, 2026 by the ToHuman team.