Tech Giants Train AI on YouTube Content Without Permission
Apple, Salesforce, Anthropic, and other major technology companies have used video transcripts from thousands of YouTube creators to train their AI models without the creators’ consent. This practice potentially violates YouTube’s terms, according to reports from Proof News and Wired.
These companies utilized “The Pile,” a dataset assembled by the nonprofit EleutherAI. The Pile was created to provide a valuable dataset for smaller companies or individuals who lack the resources of Big Tech. However, larger companies have also used this dataset. The Pile includes a variety of sources, such as books, Wikipedia articles, and YouTube captions scraped from 173,536 YouTube videos across more than 48,000 channels. This includes content from popular creators like MrBeast, PewDiePie, and Marques Brownlee, as well as mainstream media channels.
Marques Brownlee’s Response
Marques Brownlee acknowledged the complexity of the situation, noting that while Apple used the dataset, it did not collect the data itself. He emphasized that this issue would continue to evolve over time.
Impact on Various Channels
The dataset includes videos from numerous mainstream and online media brands, such as those produced by Ars Technica and other Condé Nast brands like Wired and The New Yorker. Ironically, one of the videos included was an Ars Technica-produced short film jokingly claiming to be written by AI.
The Challenge of AI Content Proliferation
As AI-generated content becomes more common, creating datasets that do not include already AI-produced content will become increasingly difficult. Although the use of The Pile is not new, its widespread usage by tech companies has led to multiple lawsuits from intellectual property owners. Defendants, including OpenAI, argue that this kind of data scraping falls under fair use, but these cases have yet to be resolved in court.
Specific Findings by Proof News
Proof News uncovered detailed information about the use of YouTube captions and created a tool for searching The Pile for individual videos or channels. This work highlights the extensive nature of data collection and how little control content creators have over their work once it’s on the web.
Potential Uses and Ethical Concerns
It’s not always clear if the data was used to produce competitive content for end users. For instance, Apple might have used the dataset for research purposes or to enhance text autocomplete features on its devices.
Reactions from Content Creators
Content creators expressed surprise and frustration upon learning their work was used without permission. David Pakman, host of The David Pakman Show, emphasized the effort and resources that go into creating his content and criticized the unauthorized use of his work. Julia Walsh, CEO of the production company Complexly, responsible for SciShow and other educational content by Hank and John Green, also expressed frustration over the use of their educational videos without consent.
Legal and Ethical Questions
The scraping of YouTube content might violate YouTube’s terms, which prohibit accessing videos through automated means. EleutherAI founder Sid Black stated he used a script to download captions via YouTube’s API, similar to a web browser.
Responses from Companies and Google
Anthropic, one of the companies that used The Pile, argued that the use of the dataset did not violate YouTube’s terms, which cover direct use of its platform. A Google spokesperson mentioned that Google has taken measures to prevent unauthorized scraping but did not provide specific details.
Ongoing Controversy
This is not the first time AI and tech companies have faced criticism for using YouTube videos to train their models without permission. For example, OpenAI is believed to have used YouTube data to train its models, although not all allegations have been confirmed. In an interview with The Verge’s Nilay Patel, Google CEO Sundar Pichai suggested that using YouTube videos to train AI models like OpenAI’s Sora would violate YouTube’s terms. However, scraping captions via the API is a different matter.