The transparency of training data is worth supporting, but it cannot solve copyright-related issues.
Artificial intelligence’s development fundamentally changes data processing and the information society. Various AI applications have brought many benefits, but their imprudent use has also raised identified issues. This has prompted the EU legislators to take action. The Artificial Intelligence Act (AI Act) has been in preparation for several years.
The regulatory proposal became concrete for the first time on April 21, 2021. This was when the Commission presented its original proposal for regulating artificial intelligence. Subsequently, the preparation of the regulation continued as usual, with the EU Council, under the leadership of rotating presidencies, advancing the preparation by drafting compromise proposals.
The world, however, faced significant changes in the autumn of 2022, while the regulation had been under preparation for a long time. ChatGPT gathered 100 million users within a month of its launch. Large language models and generative artificial intelligence became commonplace at an unprecedented speed (Tuomi 2023). The AI Act’s original intention was not to specifically regulate generative artificial intelligence.
The Parliament is acting hastily
The EU Parliament approved its negotiation position on the June 14, 2023 draft regulation. At this stage, the Parliament was the only legislative body participating in the legislative process with time to react to technological advancements. It proposed changes to the draft regulation, specifically aimed at regulating basic models that underlie generative artificial intelligence. At the same time, copyright regulation took a backseat in the Artificial Intelligence Act.
Although the Parliament’s efforts can be considered understandable, the result is hasty and problematic in many ways. The Parliament’s negotiation position appears to include provisions influenced by the strong lobbying efforts of right-holder organisations and, to some extent, the consequences of an “artificial intelligence panic”. These provisions are foreign to the copyright doctrine and potentially harmful to the information society. The key issue is related to an amendment in Article 28b, specifically, point c, which would obligate:
“The providers of generative artificial intelligence to document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law.AI Act Article 28b, point c.
The European Parliament’s proposed amendment would require the provider of machine learning systems to list the copyrighted works used for training the model. The idea is understandable but unfeasible.
Why the proposal doesn’t work – practical problems
First, the proposal must consider the vast amount of information required for training generative artificial intelligence. Large language models are trained on massive datasets. These datasets contain billions of images, text, and files containing hundreds of works.
“In the training of GPT-3.5, approximately 300 billion words, mostly gathered from the internet, were used. For GPT-4, perhaps 1000 billion words. CommonCrawl contains 240 billion pages of text, and 3-5 billion more are added monthly. Large language models are based on this inspiration.”Chief Scientist Ilkka Tuomi to the Education and Culture Committee in September 2023
Now, with a stroke of the pen, the EU Parliament is proposing the establishment of vast private copyright registries. Until now, such a requirement has not existed in Europe or anywhere else in the world
“In reality, this requirement would be impossible to meet, as it would essentially amount to disclosing a summary of all the content available on the internet.”Computer & communications industry association.
A question arises: what would, in practise, constitute a ‘sufficiently detailed summary’ of likely billions of works surpassing the copyright threshold, which are part of training data?
Secondly, works found on the internet often lack comprehensive copyright and authorship information. Determining whether a specific piece of data used in training is protected by copyright frequently requires making a legal judgment based on incomplete information. Given the masses of data used in generative AI training, such determinations should be made automatically. This task can be challenging even for a copyright lawyer in an individual case. Experts unanimously agree that it is impossible to create an application capable of making such determinations. AI is good at many things but cannot and should not act as copyright judges.
The proposed amendment is in the wrong place at the wrong time
In a legal sense, the proposed amendment would require disclosing what copyrighted information artificial intelligence has accessed during its training. While one might initially think that facilitating prohibition rights is wise, the Parliament’s proposal is also not worth supporting from this perspective.
In 2023, it is still legally unclear how courts treat the use of copyrighted works as training data. When the copyright directive was last amended in the EU, applications representing generative artificial intelligence like ChatGPT were not commonly used. As a result of the directive, national legislatures amended the Copyright Act in March 2023 to allow data mining:
Anyone with lawful access to a work may make copies of it for the purpose of text and data mining and may keep copies exclusively for that purpose unless the author has expressly and appropriately reserved this right.Copyright Act, Section 13b
Although the rights-holder organisations unanimously insist that the training involves (extremely relevant in terms of copyright) reproduction, the matter has yet to be conclusively decided. Several legal cases are pending in different EU countries on this subject. None of the disputes have yet to reach the European Court of Justice, and obtaining a resolution will take time.
The Artificial Intelligence Act should not address copyright-related issues
For centuries, copyright has prohibited copying and distribution. On the other hand, viewing, learning, absorbing influences, and imitating styles have been allowed actions until now. The question of what constitutes viewing and what constitutes copying strikes at the heart of copyright’s fundamental principles. Such guidelines must be prepared with extreme care and consideration.
An entirely unrelated Artificial Intelligence Act should keep the interpretation of the Copyright Directive clear. The right to issue interpretative statements on the Copyright Directive lies with the Court of Justice of the European Union. Alternatively, EU legislators should consider changes by amending the Copyright Directive. Also, the copyright directive left the implementation of the directive to member countries’ legislators. Now, the AI-act threatens to introduce copyright matters through directly applicable legislation.
Collecting data on the use of copyrighted material is a clear first step towards collecting licensing fees or other financial compensation. This issue is, therefore, also economically significant. It will likely have unpredictable effects on developing various AI applications across the entire EU internal market.
Promoting innovation and copyright should be carefully balanced
In Finland, the Parliamentary Committee on Culture and Education supported the EU Parliament’s proposal after primarily consulting right-holding organisations. The Grand Committee, on the other hand, heard a broader group of experts and paid attention to whether the obligation proposed by the European Parliament to publish a summary of the material used for training general AI systems is feasible in practice. The Grand Committee’s observation that the proposal’s relationship with existing copyright legislation should be clarified in further preparation is also correct. Solutions should not solely be built based on the wishes of right-holder organisations (although understandable from their perspective).
Transparency in training data is good and commendable. However, an infeasible model should not be chosen as the means to achieve it, where the AI provider must be able to distinguish copyrighted material from other material. Ultimately, such obligations prevent the regulation’s key goal of promoting innovation and violate the copyright doctrine.
Regarding the latter, one must be cautious of recent developments where commercial interests have chipped away at the copyright system that has evolved over centuries. Legislative unpredictability does not reverse technological progress. Instead, it can significantly impact how the European digital market develops. An example is the copyright-unsustainable ‘link tax,’ which extended the publisher’s rights to cover headlines. It is probably not a coincidence that the social media service formerly known as Twitter removed headlines from several news websites in 2023. This left not only ordinary users but also journalists disappointed.
“By the way, this makes most of the posts quite cryptic and lacking context. It’s harder to understand what on earth is being commented on when you see, for example, just Purra’s picture – not the headline of the HS article and what this is related to. And it’s quite problematic in many ways.” https://t.co/Fxj95c94gk— Elina Lappalainen (@ElinaLappalaine) October 5, 2023
Immediate action is needed to fix the AI-act
The final content of the regulation is still unclear. Member states, including Finland, can prevent mistakes made in the Parliament’s hastily drafted article regarding the special obligations of generative AI providers.
The key is to understand that even well-intentioned rules are meaningless if they are practically impossible to follow in practice. The problem of asymmetric information cannot be solved with an obligation that is impossible to comply with. This type of regulation serves no one and, from the perspective of regulatory theory, is of low quality.
The idea of copyright regulation accumulated over centuries, the exchange between creators, the public, and the society that underlies it, and the supporting case law is at risk of being overshadowed by the short-sighted regulation of a few services pushed by lobbyists.