LeVeLeR Corporate Logo
Transnational Strategic Advisory
Cross-Border Regulatory Navigation
Global Capital & Trade Resolution
 

Bespoke Intelligence

INTEL DOSSIER

Copyrighted Works in a Training Data and IP Infringement:

Lessons from a GAI copyright infringement case (US)

 

The rise of generative AI software has revolutionized creative industries, enabling developers to build tools that produce art, text, and music at the click of a button. Yet, these advancements have sparked contentious legal debates about copyright, particularly when copyrighted works are used to train AI models. The recent case of Andersen v. Stability AI Ltd. has brought these issues to the forefront, clarifying key legal boundaries for developers and artists alike. Even though many issues are not definitely decided yet, it implies that developers should be aware of legal risks involving training data and copyright infringement. Here, I will focus on “use of protected works in a training data” and “what is protected work?” based on the rulings.

 

The Ruling in Andersen v. Stability AI Ltd.
 

On October 30, 2023, the U.S. District Court for the Northern District of California issued a pivotal ruling in the Andersen case. The plaintiffs, visual artists Sarah Andersen, Kelly McKernan, and Karla Ortiz, accused Stability AI, Midjourney, and DeviantArt of copyright infringement. They alleged that their artworks were included in the datasets used to train generative AI software without consent.
 

Here, the court ruled that registration is a precondition for artists to bring copyright infringement claims. Under 17 U.S.C. § 106, a plaintiff can only assert exclusive rights over a work that is officially registered with the U.S. Copyright Office. This decision underscores the legal necessity of registration to secure copyright protections, particularly in AI-related disputes.
 

Moreover, the court ruled that the induced copyright claim was sufficiently pled: Allegations in artists' first amended complaint were sufficient to state a claim for induced copyright infringement against developer of text-to-image generator, in putative class action alleging developer used artists’ artistic works as training images to allow generator to produce output images in the style of those works; complaint alleged that developer helped to train and develop certain artificial intelligence software that trained on artists' works, and therefore, knew the software used or invoked the works in its operation, and that developer actively induced others to download the software by distributing it through popular coding websites and by making and selling its generator that included the software.
 

The court addressed several nuanced aspects of copyright law, including direct and induced infringement claims. Notably, while the plaintiffs successfully pled claims for induced infringement, the court dismissed certain direct infringement claims against the platform operators. These rulings emphasized the complexities of proving substantial similarity between training data and AI-generated outputs.
 

The ruling shows that the court recognize rights of copyright owners if registered, and there may be legal risks and viable disputes arising if developers use protected works in their training data and/or a library. Then, what developers should have done to mitigate the risks? It would be helpful to set a policy protecting IP rights and take measures and procedures to protect the rights from its designing phase. Consider all relevant laws, regulations and rights of others and plan out how to resolve issues from the beginning.

 

The following list is not a complete list and there are other issues to be considered, such as privacy, monitoring, management and operational. However, the following list is a good start to thinking about how to manage a training data.

 

Legal Challenges in Using Copyrighted Works for AI Training
 

Using copyrighted materials to train AI presents a myriad of legal challenges:
 

  1. Exclusive Rights Under Copyright Law:  Section 106 grants copyright owners exclusive rights to reproduce, distribute, and create derivative works. AI developers who scrape copyrighted content for training may infringe these rights unless they obtain proper authorization.

  2. Ownership and Derivative Works:  The question of whether AI-generated outputs constitute derivative works of copyrighted materials remains unresolved. Courts have generally required substantial similarity between the original work and the derivative to establish infringement. In the Andersen case, the court found that not all AI-generated outputs plausibly relied on copyrighted data.

  3. Fair Use Defense:  Many developers argue that training AI models constitutes fair use, especially if the use is transformative or for educational purposes. However, courts have yet to establish consistent guidelines for applying the fair use doctrine in the context of AI training.

  4. Digital Millennium Copyright Act (DMCA) Compliance:  The DMCA imposes additional obligations, particularly regarding copyright management information (CMI). Developers must avoid actions that could be interpreted as facilitating or concealing infringement, as failure to do so could lead to DMCA liability.

  5. Government guidelines:  The copyright office launched an initiative examining copyright law and policy issues raised by AI, including the scope of copyright in AI generated works and the usage of copyrighted materials in AI training. (https://www.copyright.gov/ai/)

 

Practical Steps for Software Developers to Mitigate Liability
 

Given the legal uncertainties, software developers must take proactive measures to reduce the risk of copyright liability:
 

  1. Conduct a Comprehensive Data Audit
    Before using datasets, developers should ensure that all materials are either in the public domain, licensed, or otherwise cleared for use. Tools that trace the origins of dataset components can be invaluable.

  2. Secure Explicit Licenses
    When using copyrighted materials, seek explicit licenses from copyright holders. Licensing agreements should clearly outline the scope of permissible use, including whether training AI models is covered.

  3. Implement Robust Documentation Practices
    Maintain detailed records of the datasets used, including sources and licenses. Transparent documentation can demonstrate good-faith efforts to comply with copyright law.

  4. Consider Fair Use Limitations
    If relying on fair use, developers should conduct a risk analysis based on the four statutory factors: purpose, nature, amount, and market impact of the use. Courts are more likely to find fair use when the application is transformative and non-commercial.

  5. Engage in Collaborative Solutions
    Collaborate with artists, industry groups, and policymakers to develop ethical guidelines for AI training. Voluntary frameworks can help establish best practices and reduce the likelihood of legal disputes.

  6. Incorporate Technical Safeguards
    Use filters to exclude copyrighted materials from datasets or implement tools that prevent the generation of outputs closely resembling specific copyrighted works.

  7. Monitor Evolving Jurisprudence
    The legal landscape surrounding AI and copyright is still evolving. Developers should regularly review court decisions and adapt their practices to align with emerging legal standards.

 

Conclusion
 

The Andersen v. Stability AI ruling underscores the critical importance of compliance with copyright registration requirements and highlights the legal challenges associated with using copyrighted works in AI training. While the case offers some clarity, many questions remain unanswered, particularly regarding fair use and derivative works.
 

To navigate these complexities, software developers must adopt a proactive approach that combines legal compliance, ethical considerations, and technical safeguards. By doing so, they can minimize liability risks while fostering innovation in a legally responsible manner. As courts continue to address these issues, staying informed and adaptable will be key to thriving in the rapidly evolving AI landscape.

 

Contact us for more information.

Written by: Jeemyung Hong, 08 December 2024