Harvard and Google to Release AI Training Dataset of 1 Million Public-Domain Books

Harvard University is preparing to release a dataset of approximately 1 million public-domain books, accessible to anyone for training large language models and other AI tools. The books span multiple genres, languages, and authors, including works by Dickens, Dante, and Shakespeare. The books, now free from copyright due to their age, aim to make AI training resources more accessible beyond big tech firms.

The dataset contains books derived from Google’s long-running Google Books scanning project, meaning Google will play a role in distributing it. While the release timeline and method remain unclear, this project is positioned as a significant resource for AI research and development.

Harvard first announced its Institutional Data Initiative (IDI) in March 2024. The initiative is intended to act as a “trusted conduit for legal data for AI.” The formal launch of the IDI includes financial support from Microsoft and OpenAI, highlighting its importance in shaping the future of AI.

Greg Leppert, the IDI’s executive director, explained that the dataset is designed to “level the playing field.” It will provide access to smaller research labs, startups, and other entities looking to train large language models (LLMs).

Further details about the dataset and its release process are expected soon. Stay updated by following developments from Harvard’s Institutional Data Initiative and Google.