June Edition 2023

62 Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through language. The learning process of NLP involves training algorithms on large volumes of text data, enabling them to understand, interpret, and generate human language in a meaningful and useful way. The necessity for vast databases in this process stems from the complexity and variability of human language. The larger and more diverse the database, the better the algorithm can understand and generate human language. This learning process results in a "trained model", a separate file where relevant information is stored. However, the creation of these databases often involves copying large amounts of text from various sources, which can potentially infringe on copyright laws as a direct result of its creation process. The opinion of the Ministry of Justice suggests that the creation of ML databases or datasets could potentially be considered "fair use" under the Copyright Law, falling under the categories of "self-learning" and "research". This interpretation aligns with the spirit of the law, as ML is essentially a form of inductive self-learning. The only difference between human learning and ML is the technical process of learning, which should not be a barrier to the application of "fair use". The Ministry's opinion further discusses potential market failures and prohibitive transaction costs that could arise in AI enterprises due to copyright issues. The creation of an effective dataset would require negotiating with each copyright owner, a process that could be timeconsuming, costly and practically impossible. Delays imposed by any single rightsholder could completely frustrate the entire project, given the competitive constraints and ambitious milestones common in entrepreneurial ventures. The Ministry's opinion suggests shielding from liability the creation of ML datasets that include vast and diverse copyrighted works, since, arguably, in such event each individual work included in the dataset holds a relatively immaterial weight in the dataset. The result of this approach is a solution whereby an ex-ante statement is made, declaring that the creation of datasets for ML, in most cases, falls under the fair use doctrine. An ex-ante statement might seem unusual, as fair use decisions are typically made retroactively after the unauthorized use of copyrighted content, but it could be a necessary statement, given the unique challenges posed by ML. While the Ministry's opinion may mark the direction of the Ministry's