As we approach the end of the year, many of us consider setting goals for next year. I wrote about setting learning goals in a previous letter. In this one, I’d like to share a framework that I’ve found useful: process goals versus outcome goals.
A process goal is one that calls for regular engagement in an activity; for example, deciding to spend at least N hours weekly studying deep learning, exercising three times a week, or applying for a certain number of jobs. An outcome goal is one that stipulates a particular result. For example, by next year, you might want to complete your university degree, reach a specific weight, get a certain job or — I hope you’ll do this if you haven’t already! — finish the Deep Learning Specialization.
When people think about setting goals, most gravitate toward outcome goals. But they have a downside: They’re often not fully within your control, and setbacks due to bad luck can be demoralizing. In contrast, process goals are more fully within your control and can lead more reliably toward the outcome you want.
Learning is a lifelong process. Though it can have a profound impact, often it takes time to get there. Thus, when it comes to learning, I usually set process goals in addition to outcome goals. Process goals for learning can help you keep improving day after day and week after week, which will serve you better than a burst of activity in which you try to cram everything you need to know.
When you set New Year resolutions, I hope you’ll consider both outcome goals and process goals. In particular, process goals that help you to…
Ring in the New
We leave behind a year in which AI showed notable progress in research as well as growing momentum in areas such as healthcare, logistics, and manufacturing. Yet it also showed its power to do harm, notably its ability to perpetuate bias and spread misinformation. We reviewed these events in our winter holiday and Halloween special issues. The coming year holds great potential to bring AI’s benefits to more people while ameliorating flaws that can lead to bad outcomes. In this issue, AI leaders from academia and industry share their highest hopes for 2022.
Clean Up Web Datasets
From language to vision models, deep neural networks are marked by improved performance, higher efficiency, and better generalizations. Yet, these systems are also marked by perpetuation of bias and injustice, inaccurate and stereotypical representation of groups, lack of explainability and brittleness. I am optimistic that we will move slowly toward building more equitable AI, thanks to critical scholars who have been calling for caution and foresight. I hope we can adopt measures that mitigate these impacts as a routine part of building and deploying AI models.
The field does not lack optimism. In fact, everywhere you look, you find overenthusiasm, overpromising, overselling, and exaggeration of what AI models are capable of doing. Mainstream media outlets aren’t the only parties guilty of making unsustainable claims, overselling capabilities, and using misleading language; AI researchers themselves do it, too.
Language models, for example, are given human-like attributes such as “awareness” and “understanding” of language, when in fact models that generate text simply predict the next word in a sequence based on the previous words, with no understanding of underlying meaning. We won't be able to foresee the impact our models have on the lives of real people if we don't see the models themselves clearly. Acknowledging their limitations is the first step toward addressing the potential harms they are likely to cause.
What is more concerning is the disregard towards work that examines datasets. As models get bigger and bigger, so do datasets. Models with a trillion parameters require massive training and testing datasets, often sourced from the web. Without the active work of auditing, carefully curating, and improving such datasets, data sourced from the web is like a toxic waste. Web-sourced data plays a critical role in the success of models, yet critical examination of large-scale datasets is underfunded, and underappreciated. Past work highlighting such issues is marginalized and undervalued. Scholars such as Deborah Raji, Timnit Gebru, and Joy Buolamwini have been at the forefront of doing the dirty and tiresome work and cleaning up the mess. Their insights should be applied at the core of model development. Otherwise, we stand to build models that reflect the lowest common denominators of human expression: cruelty, bigotry, hostility, and deceit.
My own work has highlighted troubling content — from misogynistic and racial slurs to malignant stereotypical representations of groups — found in large-scale image datasets such as TinyImages and ImageNet. One of the most distressing things I have ever had to do as a researcher was to sift through LAION-400M, the largest open-access multimodal dataset to date. Each time I queried the dataset with a term that was remotely related to Black women, it produced explicit and dehumanizing images from pornographic websites.
Such work needs appropriate allocations of time, talent, funding and resources. Moreover, it requires support for the people who must do this work. It causes deep, draining emotional and psychological trauma. The researchers who do this work — especially people of color who are often in precarious positions — deserve pay commensurate to their contribution as well as access to counseling to help them cope with the experience of sifting through what can be horrifying, degrading material.
The nascent work in this area so far — and the acknowledgement, however limited, that it has received — fills me with hope in the coming year. Instead of blind faith in models and overoptimism about AI, let’s pause and appreciate the people who are doing the dirty background work to make datasets, and therefore models, more accurate, just, and equitable. Then, let's move forward — with due caution — toward a future in which the technology we build serves the people who suffer disproportionately negative impacts; in the word of Pratyusha Kalluri, towards technology that shifts power from the most to the least powerful.
My highest hope for AI in 2022 is that this difficult and valuable work — and those who do such work, especially Black women — will become part and parcel of mainstream AI research. These scholars are inspiring the next generation of responsible and equitable AI. Their work is reason not for defeatism or skepticism but for hope and cautious optimism.
Abeba Birhane is a cognitive science PhD researcher at the Complex Software Lab in the school of computer science at University College Dublin.
Train Robots in the Real World
Robots are tremendously useful machines, and I would like to see them applied to every task where they can do some good. Yet we don’t have enough programmers for all this hardware and all these tasks. To be useful, robots need to be intelligent enough to learn from experience in the real world and communicate what they’ve learned for the benefit of other robots. I hope that the coming year will see great progress in this area.
Unlike many typical machine learning systems, robots need to be highly reliable. If you’re using a face detection system to find pictures of friends in an image library, it’s not much of a problem if the system fails to find a particular face or finds an incorrect one. But mistakes can be very costly when physical systems interact with the real world. Consider a warehouse robot that surveys shelves full of items, identifies the ones that a customer has paid for, grasps them, and puts them in a box. (Never mind an autonomous car that could cause a crash if it makes a mistake!) Whether this robot classifies objects accurately isn’t a matter of life and death, but even if its classification accuracy is 99.9 percent, one in 1,000 customers will receive the wrong item.
After decades of programming robots to act according to rules tailored for specific situations, roboticists now embrace machine learning as the most promising path to building machines that achieve human performance in tasks like the warehouse robot’s pick-and-place. Deep learning provides excellent visual perception including object recognition and semantic segmentation. Meanwhile, reinforcement learning offers a way to learn virtually any task. Together, these techniques offer the most promising path to harnessing robots everywhere they would be useful in the real world.
What’s missing from this recipe? The real world itself. We train visual systems on standardized datasets, and we train robot behaviors in simulated environments. Even when we don’t use simulations, we keep robots cooped up in labs. There are good reasons to do this: Benchmark datasets represent a useful data distribution for training and testing on particular tasks, simulations allow dirt-cheap virtual robots to undertake immense numbers of learning trials in relatively little time, and keeping robots in the lab protects them — and nearby humans — from potentially costly damage.
But it is becoming clear that neither datasets nor simulations are sufficient. Benchmark tasks are more tightly defined than many real-world applications, and simulations and labs are far simpler than real-world environments. Progress will come more rapidly as we get better at training physical robots in the real world.
To do this, we can’t treat robots as solitary learners that bumble their way through novel situations one at a time. They need to be part of a class, so they can inform one another. This fleet-learning concept can unite thousands of robots, all learning on their own and from one another by sharing their perceptions, actions, and outcomes. We don’t yet know how to accomplish this, but important work in lifelong learning and incremental learning provides a foundation for robots to gain real-word experience quickly and cost-effectively. Then they can sharpen their knowledge in simulations and take what they learn in simulations back to the real world in a loop that takes advantage of the strengths of each environment.
In the coming year, I hope that roboticists will shut down their sims, buy physical robots, take them out of the lab, and start training them on practical tasks in real-world settings. Let’s try this for a year and see how far we get!
Wolfram Burgard is a professor at the University of Freiburg, where he heads the Autonomous Intelligent Systems research lab.
Learning From the Ground Up
Things are really starting to get going in the field of AI. After many years (decades?!) of focusing on algorithms, the AI community is finally ready to accept the central role of data and the high-capacity models that are capable of taking advantage of this data. But when people talk about “AI,” they often mean very different things, from practical applications (such as self-driving cars, medical image analysis, robo-lawyers, image/video editing) to models of human cognition and consciousness. Therefore, it might be useful to distinguish two broad types of AI efforts: semantic or top-down AI versus ecological or bottom-up AI.
The goal of top-down AI is to match or exceed human performance on a specific human task like image labeling, driving, or text generation. The tasks are defined either by explicit labels (supervised learning), a set of rules (e.g., rules of the road), or a corpus of human-produced artifacts (for instance, GPT3 is trained on human-written texts using human-invented words). Thus, top-down AI is necessarily subjective and anthropocentric. It is the type of AI where we have seen the most advances to date.
Bottom-up AI, on the other hand, aims to ignore humans, their tasks and their labels. Its only goal is to predict the surrounding world given sensory inputs (passive and active). Because the world is continuously changing, this goal will never be reached. But the hope is that, along the way, a general, task-agnostic model of the world will emerge. Self-supervised learning on raw sensory data, various generative models such as GANs, and intrinsic motivation approaches (e.g., curiosity) are all attempts at bottom-up AI.
While top-down AI is currently king in industry as well as academia, its focus on imitating humans (via labels and tasks) points to its main limitation. It is like an undergraduate student who didn’t attend lectures all semester but still gets an A by cramming for the final exam — its knowledge is of a superficial nature. Real understanding must be built up slowly and patiently, from the raw sensory inputs upward. This is already starting to happen, and I hope the progress of bottom-up AI will continue in 2022.
As a teenager in the 1980s USSR, I spent a lot of time hanging out with young physicists (as one does) talking about computers. One of them gave a definition of artificial intelligence that I still find the most compelling: “AI is not when a computer can write poetry. AI is when a computer will want to write poetry.” By this definition, AI may be a tall order, but if we want to bring it closer, I suspect we will need to start from the bottom up.
Happy 2022! Bottoms up!
Alexei Efros is a professor of computer science at UC Berkeley.
AI That Adapts to Changing Conditions
Until recently, big data processing has been dominated by batch systems like MapReduce and Spark, which allow us to periodically process a large amount of data very efficiently. As a result, most of today’s machine learning workload is done in batches. For example, a model might generate predictions once a day and be updated with new training data once a month.
While batch-first machine learning still works for many companies, this paradigm often leads to suboptimal model performance and lost business opportunities. In the coming year, I hope that more companies will deploy models that can generate predictions in real time and update more frequently to adapt to changing environments.
Consider an ecommerce website where half the visitors are new users or existing users who aren’t logged in. Because these visitors are new, there are no recommendations personalized to them until the next batch of predictions is computed. By then, it’s likely that many of these visitors will have left without making a purchase because they didn’t find anything relevant to them.
In the last couple of years, technically progressive companies have moved toward real-time machine learning. The first level is online prediction. These companies use streaming technologies like Kafka and Kinesis to capture and process a visitor’s activities on their sites — often called behavioral data — in real-time. This enables them to extract online features and combine them with batch features to generate predictions tailored to a specific visitor based on their activities. Companies that have switched to online prediction, which include Coveo, eBay, Faire, Stripe, and Netflix, have seen more accurate predictions. This leads to higher conversion rates, retention rates, and eventually higher revenue. Online inference also enables sophisticated evaluation techniques like contextual bandits that can determine the best-performing model using much less data than traditional A/B testing.
The next level of real-time machine learning is continual learning. While machine learning practitioners understand that data distributions shift continually and models go stale in production, the vast majority of models used in production today can’t adapt to shifting data distributions. The more the distribution shifts, the worse the model’s performance. Frequent retraining can help combat this, but the holy grail is to automatically and continually update the model with new data whenever it shows signs of going stale.
Continual learning not only helps improve performance, but it can also reduce training costs. When you retrain your model once a month, you may need to train it from scratch on a lot of data. However, with continual learning, you may only need to fine-tune it with a much smaller amount of new data.
A handful of companies have used continual learning successfully including Alibaba, ByteDance, and Tencent. However, it requires heavy infrastructure investment and a mental shift. Therefore, it still meets with a lot of resistance, and I don’t expect many companies to embrace it for at least a few years.
In 2022, I expect a lot more companies to move toward online prediction, thanks to increasingly mature streaming technologies and a growing number of success stories. And the same underlying streaming infrastructure can be leveraged for real-time model analytics.
Chip Huyen works on a startup that helps companies move toward real-time machine learning. She teaches Machine Learning Systems Design at Stanford University.
Language Models That Reason
I believe that natural language processing in 2022 will re-embrace symbolic reasoning, harmonizing it with the statistical operation of modern neural networks. Let me explain what I mean by this.
AI has been undergoing a natural language revolution for the past half decade, and this will continue into 2022 and well beyond. Fueling the revolution are so-called large language models (sometimes called foundation models), huge neural networks pretrained on gigantic corpora that encode rich information about not only language but also the world as described by language. Models such as GPT-3 (OpenAI), Jurassic-1 (AI21 Labs), Megatron-Turing NLG (Microsoft-Nvidia), and Wu-Dao 2.0 (Beijing Academy of Artificial Intelligence), to name some of the largest ones, perform impressively well on a variety of natural language tasks from translation to paraphrasing. These models dominate academic leaderboards and are finding their way into compelling commercial applications.
For all the justified excitement around large language models, they have significant shortcomings. Perhaps most notably, they don’t exhibit true understanding of any kind. They are, at heart, statistical behemoths that can guess sentence completions or missing words surprisingly well, but they don’t understand (nor can they explain) these guesses, and when the guesses are wrong — which is often — they can be downright ridiculous.
Take arithmetic. GPT-3 and Jurassic-1 can perform one- and two-digit addition well. This is impressive, as these general-purpose models were not trained with this task in mind. But ask them to add 1,123 to 5,813 and they spit out nonsense. And why would they not? None of us learned addition merely by observing examples; we were taught the underlying principles.
What’s missing is reasoning, and math is just an example. We reason about time, space, causality, knowledge and belief, and so on via symbols that carry meaning and inference on those symbols. These abstract symbols and reasoning don’t emerge from the statistics encoded in the weights of a trained neural network.
The new holy grail is to inject this sort of semantic, symbolic reasoning into the statistical operation of the neural machinery. My co-founders and I started AI21 Labs with this mission, and we’re not alone. So-called neuro-symbolic models are the focus of much recent (and some less-recent) research. I expect that 2022 will see significant advances in this area.
The result will be models that can perform tasks such as mathematical, relational, and temporal reasoning reliably. No less important, since the models will have access to symbolic reasoning, they will be able to explain their answers in a way that we can understand. This robustness and explainability will help move natural language processing from the current era of statistical pattern recognition into an era of trustworthy, understandable AI. This is not only intellectually exciting, but it also unlocks practical applications in domains in which trustworthiness is essential such as finance, law, and medicine.
The year 2022 likely will not mark the end of the quest for such models, but I believe that it may be recognized as a pivotal year in this quest.
Yoav Shoham is a co-founder of AI21 Labs and professor emeritus of computer science at Stanford University.
Foundation Models for Vision
Large models pretrained on immense quantities of text have been proven to provide strong foundations for solving specialized language tasks. My biggest hope for AI in 2022 is to see the same thing happen in computer vision: foundation models pretrained on exabytes of unlabeled video. Such models, after fine-tuning, are likely to achieve strong performance and provide label efficiency and robustness for a wide range of vision problems.
Foundation models like GPT-3 by OpenAI and Gopher by DeepMind have shown a powerful ability to generalize in numerous natural language processing tasks, and vision models pretrained jointly on images and text, such as CLIP by OpenAI, Florence by Microsoft, and FLAVA by Facebook have achieved state-of-the-art results on several vision-and-language understanding tasks. Given the large amount of video readily available, I think the most promising next step is to investigate how to take advantage of unlabeled video to train large-scale vision models that generalize well to challenging real-world scenarios.
Why video? Unlike static images, videos capture dynamic visual scenes with temporal and audio signals. Neighboring frames serve as a form of natural data augmentation, providing various object (pose, appearance), camera (geometry), and scene (illumination, object placements) configurations. They also capture the chronological order of actions and events critical for temporal reasoning. In these ways, the time dimension provides critical information that can improve the robustness of computer vision systems. Furthermore, the audio track in video can contain both natural sounds and spoken language that can be transcribed into text. These multimodal (sound and text) signals provide complementary information that can aid learning visual representations.
Learning from large amounts of unlabeled video poses unique challenges that must be addressed by both fundamental AI research and strong engineering efforts:
- What architectures are most appropriate to process multimodal signals from video? Can they be handcrafted, or should we search out optimal architectures that capture the inductive biases of multimodal data more effectively?
- What are the most effective ways to use temporal and multimodal information in an unsupervised or self-supervised manner?
- How should we deal with noise such as compression artifacts, visual effects added after recording, abrupt scene changes, and misalignment between imagery, soundtrack, and transcribed audio?
- How can we design challenging video tasks that measure progress in a conclusive manner? Existing video benchmarks contain human actions that are short-term (e.g., run, push, pull) and some are easily recognized from a single frame (e.g., playing a guitar). This makes it difficult to draw conclusive insights. What kinds of tasks would be compelling and comprehensive for video understanding?
- Video processing is notoriously resource heavy. How can we develop compute- and memory-efficient video models to speed up large-scale distributed training?
These are exciting research and engineering challenges for which I hope to see significant advances in 2022.
Yale Song is a researcher at Microsoft Research in Redmond, where he works on large-scale problems in computer vision and artificial intelligence.
Advance AI for Good
There’s a reason why artificial intelligence is sometimes referred to as “software 2.0”: It represents the most significant technological advance in decades. Like any groundbreaking invention, it raises concerns about the future, and much of the media focus is on the threats it brings. And yet, at no point in human history has a single technology offered so many potential benefits to humanity. AI is a tool whose goodness depends on how we use it.
In 2022, I hope that the general public gains a greater appreciation of the benefits that AI brings to their lives. There are misconceptions and fears stemming from cases where AI has been used intrusively, and biased systems may have an unfairly adverse impact on some groups of people in areas like law enforcement, finance, insurance, and healthcare. Nonetheless, learning algorithms have shown potential in fighting Covid-19, detecting wildfires before they rage out of control, and anticipating catastrophic failure of things like buildings and airplanes.
AI — deep learning models especially — can be a powerful instrument for social good. Computers never tire. They learn from more data than a human can absorb in a lifetime and enable people to accomplish some tasks much faster and with fewer errors. Applying these capabilities to problems like food production, healthcare, and climate change could bring unprecedented progress.
Modern daily life requires AI as well. Social media platforms couldn’t exist without automated moderation models that root out toxicity and hate, and these problems threaten to escalate to a new level as human interactions move to virtual reality. Billions of people use the largest of these networks. If a major social media company were to moderate its content manually, it would need to hire a million people. Moderation is not scalable without machine learning models.
Another hope I have for 2022 is that AI gains newcomers with engineering backgrounds rather than machine learning or data science. You shouldn’t need an advanced degree to build AI, and as the technology matures and requires less coding it will become easier to create ML models without knowing their internal workings.
Increased access to the field is actually a key to realizing the broad social benefits of AI. A more diverse AI workforce would build less-biased systems. Equitable models are paramount as society automates more and more. Banks use AI models to determine who gets a mortgage, and employers use them to determine who gets a job interview. Machine learning models have a strong influence on society, and while the wrong ones can cause harm, the right ones can effect a cycle of positive change. Matt Zeiler is the founder and CEO of Clarifai, an AI platform that helps enterprises transform unstructured image, video, text, and audio data into actionable insights.