Nvidia was caught in a row over their controversial data scraping practices in the midst of increasing its AI models. Codenamed “Cosmos,” this likely represented a new watermark in terms of data collection, using between 20 and 30 virtual machines on Amazon Web Services to download an equivalent of 80 years’ worth of videos daily. Within just a little over a month, Nvidia had downloaded more than 30 million URLs from online platforms such as YouTube, Netflix, MovieNet, internal video game footage libraries, Github video datasets, etc.
The existence of a massive data-scraping project internally known as “Cosmos” by graphics chip maker Nvidia was revealed by internal communications obtained by The Intercept; this allowed it to train its commercial AI products on video data. Sources of the videos varied from popular services such as YouTube and Netflix to very particular databases, like MovieNet and libraries for video game footage.
Nvidia employees resorted to several ways to bypass the possible legal obstacles. They most often downloaded data marked for academic or non-commercial use, trying by all means to keep themselves in a gray area of copyright law. Besides, they used Google’s cloud service to download YouTube videos indirectly and avoid becoming direct violators of YouTube’s terms of service.
The magnitude of the data scraping that Nvidia conducted raises both legal and ethical questions. Traditionally, copyright law is there to cover creative works; using such content for training AI without explicit permission can be considered infringement. Lawyers say that although some use might be justifiable as fair use—particularly if the data is transformed to a great extent or used for non-commercial research—the line is blurry when the end goal is commercial AI products.
The practice is also highly debated from an ethical perspective since it scrapes publicly available content without consent. AI researchers point to the respect of intellectual property rights and the need for transparent data acquisition processes. If such data were scraped indiscriminately, they could undermine the trust between content developers and AI creators, eventually leading to louder calls for stricter regulation.
Content creators whose work may have been used in Nvidia’s data-scraping are concerned about the practice having occurred without consent. The vast majority of content creators put a lot of work and investment into making a video; to think of it being repurposed for commercial AI with zero compensation is problematic at best.
Legal experts indicate that while existing copyright laws are limited in their radius of protection, they never intended to battle the issues AI now raises. Current legal frameworks have been entirely outgunned by the rapid advancement of AI technology and require renewed regulations to suit the new challenges of data acquisition for AI training.
The Nvidia case over data scraping just goes to show the mess and headaches that are involved in artificially intelligent development. Volumes of data are undeniably needed to train AI models, but how bulk data is acquired is increasingly questionable. At all times, innovation and ethical responsibility are always treading on thin lines, and companies must thread their ways very carefully to avoid legal pitfalls and avert loss of public confidence.