Data-Centric AI Development, Part 2 A Critical Shift in Perspective

Published
Reading time
2 min read
System explaining an AI system

Dear friends,

Earlier today, I spoke at a DeepLearning.AI event about MLOps, a field that aims to make building and deploying machine learning models more systematic. AI system development will move faster if we can shift from being model-centric to being data-centric. You can watch a video of the event here.

Unlike traditional software, which is powered by code, AI systems are built using both code (including models and algorithms) and data:
AI systems = Code + Data

When a system isn’t performing well, many teams instinctually try to improve the Code. But for many practical applications, it’s more effective instead to focus on improving the Data.

Progress in machine learning has been driven for decades by efforts to improve performance on benchmark datasets, in which researchers hold the Data fixed while improving the Code. But for many applications — especially ones where the dataset size is modest (<10,000 examples) — teams will make faster progress by focusing instead on making sure the dataset is good:

  • Is the definition of y given x clear and unambiguous? For example, do different data labelers draw bounding boxes consistently? Do speech transcriptionists label ambiguous audio consistently, for instance, writing “um, yes please” rather than “um … yes please”?
  • Does the input distribution x sufficiently cover the important cases?
  • Does the data incorporate timely feedback from the production system, so we can track concept and data drift?

It’s a common joke that 80 percent of machine learning is actually data cleaning, as though that were a lesser task. My view is that if 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.

Rather than counting on engineers to chance upon the best way to improve a dataset, I hope we can develop MLOps tools that help make building AI systems, including building high-quality datasets, more repeatable and systematic. MLOps is a nascent field, and different people define it differently. But I think the most important organizing principle of MLOps teams and tools should be to ensure the consistent and high-quality flow of data throughout all stages of a project. This will help many projects go more smoothly.

I have much more to say on this topic, so check out my talk here. Thanks to my team at Landing AI for helping to crystalize these thoughts.

Keep learning!

Andrew

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox