Meet RedPajama, a project for open source LLMs | | | | Turtles AI

Meet RedPajama, a project for open source LLMs
DukeRem23 April 2023
  Foundation models such as #GPT-4 have revolutionized the #AI landscape. Yet, most of these cutting-edge tools are confined within commercial models or are only partially open. Today, we're thrilled to announce a groundbreaking initiative to bridge this gap: #RedPajama, a project designed to create a suite of fully open-source models. The project's first milestone – reproducing #LLaMA's training dataset containing over 1.2 trillion #tokens – has been successfully completed. As commercial #APIs continue to restrict research, customization, and usage with sensitive data, the AI community is witnessing a shift toward open-source projects. This movement, reminiscent of Linux's impact in the software world, has paved the way for semi-open models like LLaMA, #Alpaca, #Vicuna, and #Koala, as well as fully open models such as Pythia, OpenChatKit, Open Assistant, and Dolly. RedPajama aspires to solidify the open-source movement and foster innovation by offering a leading, fully open-source language model. Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute are collaborating on this ambitious endeavour. RedPajama comprises three key components:
  1. High-quality pre-training data with broad coverage
  2. Base models trained at scale on this data
  3. Instruction tuning data and models to enhance usability and safety
This announcement marks the release of the first component, pre-training data. RedPajama's starting point is LLaMA, which boasts a vast dataset of 1.2 trillion tokens and a 7 billion parameter model capable of running on a wide array of GPUs. However, LLaMA and its derivatives are restricted to non-commercial research purposes. RedPajama's objective is to create a fully open-source reproduction of LLaMA, opening the door to commercial applications and transparent research pipelines. The RedPajama base dataset is now available for download via Hugging Face. Comprising ~5TB unzipped on disk and ~3TB compressed, the dataset includes seven data slices: CommonCrawl, C4, GitHub, arXiv, Books, Wikipedia, and StackExchange. Each slice undergoes meticulous pre-processing and filtering, with quality filters available on GitHub. This transparency allows anyone to reproduce RedPajama-Data-1T. In partnership with the Meerkat project, a Meerkat dashboard and embeddings for exploring the GitHub subset of the corpus have been released. The dashboard is available for installation and direct use on GitHub. Looking forward, RedPajama's next steps involve training a robust base model and instruction tuning. The INCITE program, supported by Oak Ridge Leadership Computing Facility (OLCF), will help train a full suite of models, with the first set to be released in the coming weeks. Utilizing OpenChatKit's hundreds of thousands of high-quality natural user instructions, instruction-tuned versions of RedPajama models will be made available, unlocking their immense potential.