A recent study by the Columbia Journalism Review’s Tow Center for Digital Journalism has unveiled alarming accuracy problems with generative AI models used for news searches. The study assessed eight AI-driven search tools with live search capabilities and revealed that these tools provided incorrect answers for over 60% of news-source-related queries.

Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted that approximately one in four Americans now consider AI models as alternatives to traditional search engines. The significant error rate highlighted in this study raises critical concerns regarding the reliability of information sourced from these AI systems.

The performance of different platforms varied considerably. Perplexity delivered incorrect information 37% of the time, while ChatGPT Search faltered with a staggering 67% error rate (134 out of 200 queries). Grok 3 exhibited the highest level of error, with an alarming 94% of its responses being incorrect.

To evaluate the models, researchers posed direct excerpts from legitimate news articles to each AI tool, challenging them to identify the article’s headline, original publisher, publication date, and URL. The study conducted a whopping 1,600 queries across the eight generative search platforms.

One disturbing trend observed was that rather than declining to respond when faced with uncertainty or insufficient information, these AI models frequently resorted to generating confabulations—plausible-sounding yet incorrect answers. This behavior was consistent across all models tested, illustrating a systemic issue rather than isolated failures.

In a surprising twist, even premium versions of AI search tools often delivered incorrect responses at higher rates. For instance, Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) produced more inaccuracies than their free counterparts. Although these paid models answered more prompts correctly, their unwillingness to forgo uncertain responses resulted in elevated overall error rates.

Issues surrounding citations and publisher control were also alarming. The CJR team found evidence indicating that certain AI tools disregarded Robot Exclusion Protocol settings, which publishers use to restrict unauthorized access. For example, Perplexity’s free version successfully identified all ten excerpts from paywalled National Geographic content, despite explicit disallowance of such access.

The tendency for AI tools to cite syndicated versions of content—e.g., those found on Yahoo News—rather than linking to original publishers was another significant concern. There were instances where publishers held formal licensing agreements with AI companies, yet their original content was still overlooked.

URL fabrication also emerged as a critical issue, with over half of citations from Google’s Gemini and Grok 3 leading to fabricated or broken URLs, resulting in error pages. In a test of 200 citations from Grok 3, a staggering 154 resulted in broken links.

This scenario creates a difficult position for publishers. If they block AI crawlers, they risk losing attribution altogether, while allowing them to operate could lead to extensive reuse without traffic being directed back to their own websites. Mark Howard, Time magazine’s chief operating officer, expressed concerns about ensuring transparency and control over how the content appears in AI-generated searches. However, he remains optimistic about future improvements, stating, “Today is the worst that the product will ever be,” with substantial investments aimed at advancing these tools.

Interestingly, Howard also suggested that users bear some responsibility for their reliance on free AI tools, implying that expecting these systems to provide entirely accurate information without skepticism is misguided. OpenAI and Microsoft have acknowledged CJR’s findings but did not address the specific issues. OpenAI reaffirmed its commitment to supporting publishers by facilitating increased traffic through clear links, summaries, and attribution, while Microsoft stated it adheres to Robot Exclusion Protocols and publisher guidelines.

This latest report builds on prior findings published by the Tow Center in November 2024, which also identified accuracy problems, particularly in how ChatGPT managed news-related content. For more in-depth insights, visit the Columbia Journalism Review’s website.