Mayank Mishra, Taishi Nakamura, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Tanmay Laud, Felix Friedrich, Prateek Yadav, Minh Chien Vu, Terry Yue Zhuo, Diganta Misra, Dung Nguyen, Nam Pham, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Peter Szemraj, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Arnav Dantuluri, Nicolò Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Matthew Blumberg, Erik Orth, Ray Tam, Rio Yokota, Robin Graham, TeH_Venom, KoboldHenk, Yu Hou, Yuchen Lu, Victor May, Huu Nguyen, Sampo Pyysalo (equal contribution)On Jan 24, 2024, Ontocord.AI and the MDEL open source community quietly released a preview version of our model, Aurora-M. Aurora-M is an open-source 15.5B parameter model with multilingual and coding capabilities. In this blog we describe further our efforts to create smarter and more lawful AI for everyone. Aurora-M is an extended pretrained version of the StarCoderPlus model, trained on an additional 435B tokens, bringing the total training to approximately 2T tokens.Aurora-M is proficient at coding while also having strong multilingual performance, has familiarity with a range of specialized domains and is also safe by design. It was trained on Japanese, English, Vietnamese, Hindi and Finnish language data.Domain knowledge in the datasets also includes chemical SMILES formulae, financial data, legal contracts, political debates, climate change data, ABC music notations, coding, math and many other domains.To our knowledge, Aurora-M is the first open source model to be red teamed according to the Biden-Harris executive order requirements and we have also tried to align it to general safety standards.Our contribution is to provide a methodology and model that is able to retain much of its English and coding abilities while adding SOTA (state-of-the-art) or near SOTA results in multilingual settings. The model is also red teamed for modern AI laws while retaining helpfulness, without, we believe, exaggerated safety.We trained the model on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland. We thank them and all the participants of the Aurora-M and MDEL efforts. We would also like to thank the wonderful BigCode team for releasing and developing the StarCoder models in the open.We release 3 different models:

We use about 1.5TB of text data from the Stack, Refined Web, Red Pajama 1 and Pile datasets along with specific datasets created as part of the MDEL efforts. These datasets contain text in Japanese, English, Vietnamese, Hindi and Finnish language.This dataset was cleaned using standard methods similar to what was done with Cultura-X and Cultura-Y. We also used the red-pajama fastText filter that was based on Wikipedia linked articles to filter out articles that look less like Wikipedia linked articles. In addition, we created fastText filters for Japanese, Finnish, Vietnamese and Hindi with linked Wikipedia articles but found that these were less effective because Wikipedia articles in those languages were much more limited. Therefore, we tried to find other reference text as “good” text in those languages, but had varying results, which we will explain in our dataset card. In particular, we could not find a satisfactory “good” source of text for Finnish so only applied standard data cleaning on Finnish.We also mixed in publicly available instruction tuning datasets including the OIG dataset, OpenAssistant, Pseudo-Code Instructions, Gorilla and others in two stages. In the first stage, we used lower quality but more generic instructions and for the second stage, we used higher quality instructions, chat data such as Ultrachat and a safety instructions dataset called Biden Harris red teaming dataset that we created ourselves.In both stages we also use pretraining datasets like commoncrawl, wikipedia etc as is common for pretraining models. Here we list the instruction tuning datasets used during the first pretraining stage:In the second stage, we use the following instruction tuning datasets (some are repeated from stage 1 of training):Below is our reading of red teaming requirements of the Executive Order (2023, October 30, The White House) on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We focus specifically on Section 3(d) and 3(k):The term “AI red teaming” means a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI. Artificial Intelligence red teaming is most often performed by dedicated “red teams” that adopt adversarial methods to identify flaws and vulnerabilities, such as harmful or discriminatory outputs from an AI system, unforeseen or undesirable system behaviors, limitations, or potential risks associated with the misuse of the system.The term “dual-use foundation model” means an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits or could be easily modified to exhibit high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety or any combination of those matters, such as by:Models meet this definition even if they are provided to end users with technical safeguards that attempt to prevent users from taking advantage of the relevant unsafe capabilities. So broadly, the Executive Order defines AI red teaming as testing for flaws and vulnerabilities, including:While LLMs are very powerful, they are prone to generating toxic, harmful and even dangerous content. They can also generate biased outputs and generate false information. Although users must wield LLMs responsibly – considering the potential consequences of their generated content – developers bear the responsibility to design LLMs with care, emphasizing ethical guidelines and protecting them against potential attacks that could bypass safety protocols and undermine their guiding principles. Motivated by this and by considering the latest AI regulations, we construct a large dataset of instruction-response pairs to enhance the safety and robustness of our model. Specifically, our effort focused on the following main areas of concern under the Biden-Harris US Executive Order on AI:Our generated dataset serves to mitigate these specific issues outlined in the order.Aurora-M was trained on the LUMI supercomputer. The training was done using 32 nodes, each node equipped with 4x AMD MI250X GPUs for 74 days with server downtime. It should also be noted that LUMI uses 100% hydro-powered energy. LUMI’s waste heat is also used to heat hundreds of households in the city of Kajaani.Due to unavailability of FlashAttention kernels on AMD GPUs at the time of training, we had to use a PyTorch implementation of attention which restricted us to 2k context length for training and making our training inefficient. We used a custom fork of Megatron-LM which is compatible with both NVIDIA and AMD GPUs.As mentioned in the previous section, we use a 2 stage curriculum. In the first stage, we took the massive pretraining corpuses of the 5 languages and mix in the low quality instruction tuning datasets mentioned in the previous section. We train the model for 90k steps on this dataset.For the second stage, we used the higher quality instruction datasets mentioned in the previous section along with the Biden Harris red teaming dataset’s train split intermixed with oversampled Wikipedia, subsampled English data, oversampled Python code and also markdown in order to steer the model towards producing well formatted text. In this step we also remove text with large number of symbols and numbers. We train the model for 14k steps in the second stage.We see a steeper decline in training loss after 90k steps, this could be attributed to much cleaner instruction tuning datasets. We leave this to further investigation.Please find the full WandB training report here.Here we provide the following plot on different language and code evaluations aggregated on a wide variety of tasks. We significantly outperform the StarCoder models on a variety of language tasks while being comparable on coding tasks. We also outperform Finnish GPT, Llama-2 and StarCoder models on Finnish, Vietnamese and Hindi. We avoid mentioning the details on the exact evaluations for brevity in this blog.

We will release a technical report which will describe our thorough evaluations and more details about the model and its limitations. Aurora-M is an open source effort that includes volunteers from academia and industry to promote linguistic fairness, lawfulness and performant AI research. We began this journey in Spring 2023 and our work should NOT be confused with the AuroraGPT project. As part of Ontocord.AI’s commitment to enabling open science and equal access to AI knowledge we support projects like Aurora-M, Ontocord.AI prioritizes lawfulness, data quality and utilizes data filtering, synthetic data and safety instructions for AI development. We are honored to be able to apply some of these techniques in our Aurora-M work. Please reach out to us if you have questions: engage@ontocord.ai.· Sign up or log in to comment