Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

December 11, 2024

1

Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks

→ Continue reading at WIRED

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Similar Articles

Most Popular

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Similar Articles

Trump’s Trade War Is Strengthening China’s Soft Power

Labor Leaders Fear Elon Musk and DOGE Could Gain Access to Whistleblower Files

Most Popular

State Supreme Courts have become an electoral battleground. But some states choose a different path

Bruce Pearl can become the face of college basketball, for better or worse, at Final Four

An All-Star Lineup of Coaches Plans to Dethrone Kansas City