Perplexity Faces Accusations of Unethical Website Scraping

Perplexity AI, a San Francisco-based startup valued at $3 billion, is under scrutiny for allegedly scraping websites that explicitly prohibited AI data collection, according to a Cloudflare report published on August 4. The AI-powered search engine, backed by investors like Jeff Bezos’ family fund and Nvidia, is accused of ignoring the Robots Exclusion Protocol, a web standard that uses robots.txt files to instruct crawlers on which pages to avoid. Cloudflare’s findings reveal Perplexity used stealth tactics to bypass these restrictions, raising ethical and legal questions about its data practices.

Cloudflare observed Perplexity’s bots crawling tens of thousands of domains, generating millions of daily requests, even when websites had blocked its known crawler, PerplexityBot. The company allegedly altered its user agents to mimic Google Chrome on macOS and switched IP addresses and autonomous system networks (ASNs) to obscure its identity. “This activity was observed across tens of thousands of domains,” Cloudflare’s researchers noted, using machine learning and network signals to identify the crawler. One unpublicized IP address, 44.221.181.252, reportedly accessed Condé Nast properties, including WIRED, at least 822 times over three months, per WIRED’s analysis.

Perplexity’s spokesperson, Jesse Dwyer, dismissed Cloudflare’s report as a “sales pitch,” asserting in an email to TechCrunch that screenshots showed “no content was accessed.” Dwyer further claimed the bot identified by Cloudflare “isn’t even ours.” However, WIRED and developer Robb Knight confirmed Perplexity’s scraper accessed restricted pages by monitoring server logs after prompting its chatbot with specific URLs. This follows earlier accusations from Forbes, which claimed Perplexity plagiarized its content in AI-generated summaries, and a Condé Nast cease-and-desist letter for similar violations.

The controversy highlights broader concerns about AI companies’ data practices. While robots.txt is not legally binding, Amazon Web Services, which hosts Perplexity’s crawler, requires compliance with its terms prohibiting abusive activities. “AWS’s terms of service prohibit abusive and illegal activities,” said spokesperson Patrick Neighorn, noting ongoing investigations into Perplexity’s practices. Perplexity’s CEO, Aravind Srinivas, previously told Fast Company that a third-party crawler, not disclosed due to a nondisclosure agreement, was responsible for some scraping, calling the issue “complicated.”

Key issues raised by the accusations include:

Ignoring Robots.txt: Perplexity bypassed website restrictions, accessing blocked content.
Stealth Tactics: Changing user agents and IPs to evade detection.
Legal Risks: Potential copyright violations, as seen in lawsuits from Dow Jones and NYP Holdings.

As publishers like the BBC and The New York Times threaten legal action, Perplexity’s practices underscore tensions between AI innovation and content ownership. The company has launched a revenue-sharing program with publishers, but critics argue it fails to address the core issue of unauthorized scraping.

Author

Connor Walsh

Connor Walsh is a passionate tech analyst with a sharp eye for emerging technologies, AI developments, and gadget innovation. With over a decade of hands-on experience in the tech industry, Connor blends technical knowledge with an engaging writing style to decode the digital world for everyday readers. When he’s not testing the latest apps or reviewing smart devices, he’s exploring the future of tech with bold predictions and honest insights.