Bigger, Faster Transformers Increasing parameters without slowing down transformers

Published

Feb 24, 2021

Reading time

2 min read

Performance in language tasks rises with the size of the model — yet, as a model’s parameter count rises, so does the time it takes to render output. New work pumps up the number of parameters without slowing down the network.

What’s new: William Fedus, Barret Zoph, and Noam Shazeer at Google Brain developed the Switch Transformer, a large-scale architecture (the authors built a version comprising 1.6 trillion parameters) that’s nearly as fast as a much smaller model.

Key insight: The approach known as mixture-of-experts uses only a subset of a model’s parameters per input example. Like mixture-of-experts, Switch Transformer chooses which of many layers would best process a given input.

How it works: The authors trained Switch Transformer to predict words that had been removed at random from a large text dataset scraped from the web. The dataset was preprocessed to remove offensive language, placeholder text, and other issues.

A typical transformer extracts a representation from each input token, such as a word, and then uses self-attention to compare the representations before passing them to a fully connected layer. Switch Transformer replaces the fully connected layer with one of a number (determined by a hyperparameter) of fully connected layers.
A softmax layer calculates the probability that any particular fully connected layer is best for processing a given token. Then it uses the chosen layer in the usual manner.
The fully connected layers process tokens in parallel. The authors added a loss to encourage them to be equally active. On a hardware chip, a separate processor core handles each layer, so this loss encourages equal distribution of the load on each core.

Results: The authors compared Switch Transformer (7.4 billion parameters) to T5 (223 million parameters), a variant similar to the original transformer that was trained on the same dataset, using negative log perplexity, a measure of the model’s uncertainty (higher is better). The new model achieved -1.561 negative log perplexity compared to T5’s -1.731. Switch Transformer ran at two-thirds the speed of T5 — it executed 1,000 predictions per second compared to T5’s 1,600 — with 33 times the number of parameters. It beat a mixture-of-experts transformer, presumably of roughly the same size, on both counts.

Why it matters: In deep learning, bigger is better — but so is a manageable computation budget.

We’re thinking: Transformers come in an increasing variety of flavors. We hope this summary helps you remember which is switch.

Subscribe to The Batch