Solana Larsen’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


Figure 1. Tiers of openness of datasets for LLM training.
Figure 3: Flowchart illustrating the process of analyzing copyright office records to identify books in the public domain due to lacking a renewal. The two input nodes are labeled in blue and the output nodes are classified as being clear exclusions (red), requires more investigation (purple), or believed to be in the public domain (green). Each node is labeled with the number of works at that stage.
Towards Best Practices for Open Datasets for LLM Training
  • Preprint
  • File available

January 2025

·

49 Reads

Stefan Baack

·

Stella Biderman

·

Kasia Odrozek

·

[...]

·

Thomas Wolf

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

Download

Citations (1)


... from the Berkman Klein Center for Internet & Society at Harvard University and another byHagendorff (2019). Other recent reports include the general framework developed inRicks et al. (2020) and faith-based frameworks includingMoore et al. (2019) for the Ethics and Religious Liberty Commission of the Southern Baptist Convention or the Catholic Church in the Rome Call for AI Ethics (2020), signed by IBM and Microsoft 3 . These frameworks address different overlapping principles, some of which are relevant to medical ethics as described below.Fjeld et al. (2020), for example, classify 35 different ethical frameworks in the context of AI, in eight themes: Privacy, Accountability, Safety and Security, Transparency and Explainability, Fairness and Non-discrimination, Human Control of Technology, Professional Responsibility, and Promotion of Human Values. ...

Reference:

AAAS (2021). Artificial Intelligence and COVID-19: Applications and Impact Assessment. (Report prepared by Ilana Harrus and Jessica Wyndham under the auspices of the AAAS Scientific Responsibility, Human Rights and Law Program)
Creating Trustworthy AI: A Mozilla white paper on challenges and opportunities in the AI era