Leading AI researchers caution that training systems on internet data may be hitting their limits, raising concerns about the future of data-driven business models across the digital economy. Warnings by former OpenAI chief scientist Ilya Sutskever about data training constraints, as reported by Reuters, have rattled technology markets. Speaking at the NeurIPS conference, Sutskever emphasized the need for innovative approaches, such as AI-generated data and enhanced reasoning capabilities, to advance artificial intelligence. He predicted that future AI systems will possess human-like reasoning abilities, making their behavior less predictable and necessitating a shift in AI development strategies. But other experts argue current methods still have room to run, leaving companies to navigate competing visions of how to value and deploy AI systems that power everything from fraud detection to inventory management. “Internet data is running out, and AI companies are feeling the pressure,” Arunkumar Thirunagalingam, senior manager of data and technical operations at the McKesson Corporation, told PYMNTS. “For years, they relied on scraping huge amounts of online content to train their systems. That worked for a while, but now the easy data is drying up. This shift is putting the spotlight on companies with unique data sources, like healthcare records or logistics information. It is no longer about how much data you can grab; it is about having the right kind of data.” Coming Data Drought? AI systems rely on vast amounts of data from the internet to train and improve. However, the pool of high-quality, diverse data is finite, and researchers may be nearing the limits of what’s available. As models grow larger and demand more input, the risk of recycling similar information increases, leading to diminishing returns. Additionally, much of the internet’s content is noisy or repetitive, reducing its usefulness for cutting-edge training. This scarcity challenges researchers to seek alternatives, like creating synthetic data, leveraging specialized datasets, or developing models that rely less on raw data and more on advanced reasoning capabilities. With less internet data to scrape, companies are getting creative, Thirunagalingam said. They turn to real-world sources like IoT devices and sensors to collect fresh information. Crowd-sourcing platforms are paying people to share their unique insights, creating even more options. “This shift is already making waves in farming, where AI uses real-time data to improve crop yields, and in urban planning, where city sensors help design smarter infrastructure,” he added. “Companies that once sat on overlooked datasets are now finding new ways to monetize them, from partnerships to licensing deals. What seemed unimportant before is now a goldmine, sparking fresh ideas and business models.” Komninos Chatzipapas, founder of HeraHaven AI, acknowledged that the industry is running into a data wall. “The biggest AI companies have basically already scraped everything on the internet,” he told PYMNTS. “Also, a lot of the new internet content being published is itself AI-generated (which cannot be used for training as it will reinforce the existing biases these AI models have), and more and more publishers are blocking scraping bots like GPTBot from crawling their sites via their robots.txt.” AI’s Data Crisis: Publishers to the Rescue For pre-training AI models, Chatzipapas said, the data wall primarily affects unstructured training data, such as news articles and forum discussions. Pre-training is the initial phase of AI model development where the model learns general language patterns and knowledge from vast amounts of text data before being fine-tuned for specific tasks. “There is still work to be done on creating great structured data for training AI models,” he added. This can be, for example, very complex math/science problems that are solved in a step-by-step manner so the AI model can learn to reason, he said. One solution to the data drought is emerging through deals with academic publishers, who are offering their scholarly articles in exchange for millions of dollars. Microsoft’s recent $10 million deal with Taylor & Francis opened the floodgates for AI companies to tap into academic publishers’ vast research archives. The post AI Training Debate Raises Stakes for Digital Economy appeared first on PYMNTS.com. {Categories} _Category: Takes{/Categories} {URL}https://www.pymnts.com/artificial-intelligence-2/2024/ai-training-debate-raises-stakes-for-digital-economy/{/URL} {Author}PYMNTS{/Author} {Image}{/Image} {Keywords}artificial intelligence,AI,AI training data,data,digital transformation,internet data,News,OpenAI,PYMNTS News,Technology{/Keywords} {Source}POV{/Source} {Thumb}{/Thumb}