Toward 1 Trillion Parameters Microsoft upgrades its DeepSpeed optimization library.

Published

Sep 16, 2020

Reading time

2 min read

An open source library could spawn trillion-parameter neural networks and help small-time developers build big-league models.

What’s new: Microsoft upgraded DeepSpeed, a library that accelerates the PyTorch deep learning framework. The revision makes it possible to train models five times larger than the framework previously allowed, using relatively few processors, the company said.

How it works: Microsoft debuted DeepSpeed in February, when it used the library to help train the 17 billion-parameter language model Turing-NLG. The new version includes four updates:

Three techniques enhance parallelism to use processor resources more efficiently: Data parallelism splits data into smaller batches, model parallelism partitions individual layers, and pipeline parallelism groups layers into stages. Batches, layers, and stages are assigned to so-called worker subroutines for training, making it easier to train extremely large models.
ZeRO-Offload efficiently juggles resources available from both conventional processors and graphics chips. The key to this subsystem is the ability to store optimizer states and gradients in CPU, rather than GPU, memory. In tests, a single Nvidia V100 was able to train models with 13 billion parameters without running out of memory — an order of magnitude bigger than PyTorch alone.
Sparse Attention uses sparse kernels to process input sequences up to an order of magnitude longer than standard attention allows. In tests, the library enabled Bert and Bert Large models to process such sequences between 1.5 and 3 times faster.
1-bit Adam improves upon the existing Adam optimization method by reducing the volume of communications required. Models that used 1-bit Adam trained 3.5 times faster than those trained using Adam.

Results: Combining these improvements, DeepSpeed can train a trillion-parameter language model using 800 Nvidia V100 graphics cards, Microsoft said. Without DeepSpeed, the same task would require 4,000 Nvidia A100s, which are up to 2.5 times faster than the V100, crunching for 100 days.

Behind the news: Deep learning is spurring a demand for computing power that threatens to put the technology out of many organizations’ reach.

A 2018 OpenAI analysis found the amount of computation needed to train large neural networks doubled every three and a half months.
A 2019 study from the University of Massachusetts found that high training costs may keep universities and startups from innovating.
Semiconductor manufacturing giant Applied Materials estimated that AI’s thirst for processing power could consume 15 percent of electricity worldwide by 2025.

Why it matters: AI giants like Microsoft, OpenAI, and Google use enormous amounts of processing firepower to push the state of the art. Smaller organizations could benefit from technology that helps them contribute as well. Moreover, the planet could use a break from AI’s voracious appetite for electricity.

We’re thinking: GPT-3 showed that we haven’t hit the limit of model and dataset size as drivers of performance. Innovations like this are important to continue making those drivers more broadly accessible.

Subscribe to The Batch