The emergence of the web data infrastructure layer for AI

AI is booming. New use cases are emerging each day. To capitalize on the technology’s potential, enterprises require data at scale. In many cases, though, the relevant information is blocked or unstructured, which limits its use by AI models.

To understand this challenge, consider the foundation of the web itself. The web was not designed for the automated discovery and retrieval that new AI applications demand. Overcoming this inherent design constraint requires infrastructure.

The next frontier in AI may depend on a new web data infrastructure layer that can enable models to discover and map this ever-expanding digital realm. This layer must be able to navigate hundreds of millions of existing web domains and billions of new URLs created each week, delivering real-time information and overcoming technical barriers.

“The data suggests there’s far more data out there,” says Or Lenchner, CEO of Bright Data, a web data collection platform. “Think of the universe: It’s out there, but you don’t know what you don’t know.”

Enabling access to fresh, relevant, and trustworthy data

While early AI breakthroughs were driven by scaling training data and model size, organizations are now encountering a fundamental bottleneck: They need to keep pace with the dynamic, unstructured, and constantly evolving nature of web data in order to ground outputs in current and verifiable information. AI performance increasingly depends not just on model architecture but on a system’s compute, networking, retrieval, and data engineering capabilities—that is, the system’s ability to quickly and reliably retrieve data that is fresh, relevant, and trustworthy.

Traditional model training relies on snapshots of information collected at a particular point in time. Training AI on such static data is no longer sufficient. To track fluctuations such as competitor pricing, consumer sentiment, and market trends, companies need a constant feed of new information, pulling data in real time along with relevant context. Their infrastructure must therefore be able to handle millions of simultaneous interactions across websites that vary by geography, language, format, and access rules.

“If it can’t retrieve real-time information, it lacks context,” Lenchner says. “In a business setting, that’s not acceptable anymore. Stale answers lead to bad decisions and disappointed consumers.”

Speed is not merely a matter of convenience; it’s a matter of necessity. Today’s organizations operate in environments where prices, inventory, markets, security threats, and customer behavior change continuously. Delayed data retrieval can reduce the usefulness of an otherwise sophisticated model.

Using live, high-quality web data can also reduce AI hallucinations because the model has a more relevant knowledge base. This builds user trust. In fact, one survey found that 56% of AI practitioners said businesses need access to real-time web data to improve trust in AI outputs. To ensure the model runs efficiently and effectively, the information must also be pared down to the appropriate essentials.

Despite the introduction of retrieval-augmented generation (RAG), where models pull in external data at the moment of a query, many AI systems still struggle to deliver outputs that are current, contextually relevant, and trustworthy in operational settings. According to Gartner, 60% of AI projects that are not supported by AI-ready data—accurate, structured, organized, and contextualized—will be abandoned by the end of the year.

This is because large-scale retrieval alone does not solve the problem. As Lenchner puts it, “You need to retrieve data at scale, but also in real time. Latency becomes an issue because of the end user who is waiting for the output.”

Accessing fresh, AI-ready data at scale introduces technical and structural challenges. In practice, many enterprise systems combine public web retrieval with APIs, licensed datasets, and proprietary internal data in their AI applications. Integrating these fragmented sources into a timely and usable knowledge layer requires specialized capabilities. Some research has found that 97% of AI organizations depend on real-time web data infrastructure, but 90% feel boxed in by various restrictions. Companies are increasingly developing technical approaches to navigate these constraints.

Lenchner draws this metaphor: “Think of the trained model as intelligence and relevant data as knowledge. A powerful intelligence layer sitting on top of a hollow knowledge layer is like a genius who knows nothing—useless in practice. Intelligence and knowledge have to come together.”

The promise of new infrastructure

A new layer of web data infrastructure can address this developing need for stronger AI inputs by enabling discovery of data, real-time access, and tailoring to a specific context. As Lechner describes it, “It’s all about collecting data at scale, super-low latency, without being blocked.”

Rather than relying on increased computing power, this type of platform emulates human browsing behavior to access available content and transform raw code into structured data feeds. It can work with websites that might not interact with traditional scraping tools, such as those heavy in JavaScript, or with aggressive antibot software.

As Lenchner explains, “It’s basically having infrastructure that can mimic a web user with identifying information—IP address, location, and 1,000 more parameters. And at scale. Think of doing that 80 billion times a day for millions of websites. And every single time, you are looking exactly as the website expects you to look.”

Of course, continuous retrieval introduces new data governance challenges. To address them, platforms can enforce strict compliance protocols aligned with global privacy frameworks, such as the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). They can also be limited to openly accessible, public information, avoiding paywalls or private logins. Any networks used can be vetted and consent-based, and incentives can be provided to owners of IP addresses. In this way, systems can be designed to comply with tightening regulation.

Such complex capabilities do not come easy. “When this is critical infrastructure for a company,” Lenchner says, “doing it in-house becomes a full-time engineering problem that competes with the actual AI work.” Addressing this complexity requires organizations to commit significant resources, leading many to seek specialized platforms designed specifically for data retrieval, orchestration, and observability.

Infrastructure for the real world

Real-time data retrieval is changing what AI systems can do inside organizations. For example, a retail company can use public information to enable a dynamic pricing engine, and global brands can track trademark infringements.

As the ecosystem matures, organizations that invest in this emerging data infrastructure layer will be better positioned to build AI systems that are more responsive, reliable, and aligned with real-world conditions—AI systems that can continuously adapt using current web data. Over time, the distinction between AI models and the infrastructure that feeds them may even begin to disappear.

As Lenchner says, “The world is changing. And everything that is happening in the world is being uploaded to the public web. The amount of new data that is being generated is growing and accelerating.”

To learn more from Bright Data, read the Data for AI 2026 report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff. It was researched, designed, and written by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.