How to Think About the Privacy of Cloud-Based AI How private is your data on cloud-based AI platforms? Here's a framework for evaluating risks.

Published

Feb 14, 2024

Reading time

4 min read

Dear friends,

The rise of cloud-hosted AI software has brought much discussion about the privacy implications of using it. But I find that users, including both consumers and developers building on such software, don’t always have a sophisticated framework for evaluating how software providers store, use, and share their data. For example, does a company’s promise “not to train on customer data” mean your data is private?

Here is a framework for thinking about different levels of privacy on cloud platforms, from less to more:

No Guarantees: The company provides no guarantees that your data will be kept private. For example, an AI company might train on your data and use the resulting models in ways that leak it. Many startups start here but add privacy guarantees later when customers demand them.
No Outside Exposure: The company does not expose your data to outsiders. A company can meet this standard by not training on your data and also by not posting your data online. Many large startups, including some providers of large language models (LLMs), currently operate at this level.
Limited Access: In addition to safeguards against data leakage, no humans (including employees, contractors, and vendors of the company) will look at your data unless they are compelled via a reasonable process (such as a subpoena or court order, or if the data is flagged by a safety filter). Many large cloud companies effectively offer this level of privacy, whether or not their terms of service explicitly say so.
No Access: The company cannot access your data no matter what. For example, data may be stored on the customer’s premises, so the company doesn’t have access to it. If I run an LLM on my private laptop, no company can access my prompts or LLM output. Alternatively, if data is used by a SaaS system, it might be encrypted before it leaves the customer’s facility, so the provider doesn’t have access to an unencrypted version. For example, when you use an end-to-end encrypted messaging app such as Signal or WhatsApp, the company cannot see the contents of your messages (though it may see “envelope” information such as sender and recipient identities and the time and size of the message).

These levels may seem clear, but there are many variations within a given level. For instance, a promise not to train on your data can mean different things to different companies. Some forms of generative AI, particularly image generators, can replicate their training data, so training a generative AI algorithm on customer data may run some risk of leaking it. On the other hand, tuning a handful of an algorithm’s hyperparameters (such as learning rate) to customer data, while technically part of the training process, is very unlikely to result in any direct data leakage. So how the data is used in training will affect the risk of leakage.

Similarly, the Limited Access level has its complexities. If a company offers this level of privacy, it’s good to understand exactly under what circumstances its employees may look at your data. And if they might look at your data, there are shades of gray in terms of how private the data remains. For example, if a limited group of employees in a secure environment can see only short snippets that have been disassociated from your company ID, that’s more secure than if a large number of employees can freely browse your data.

In outlining levels of privacy, I am not addressing the question of security. To trust a company to deliver a promised level of privacy is also to trust that its IT infrastructure is secure enough to keep that promise.

Over the past decade, cloud hosted SaaS software has gained considerable traction. But some customers insist on running on-prem solutions within their own data centers. One reason is that many SaaS providers offer only No Guarantees or No Outside Exposure, but many customers’ data is so sensitive that it requires Limited Access.

I think it would be useful for our industry to have a more sophisticated way to talk about privacy and help users understand what guarantees providers do and do not deliver.

As privacy becomes a global topic, regulators are stepping in, and this is adding further complexity to tech businesses. For example, if one jurisdiction changes the definition of a child from someone under 13 to anyone under 18, that might require changes to how you store data of individuals age 13 to 18; but who has time to keep track of such changes?

I've been delighted to see that here, AI can help. Daphne Li, CEO of Commonsense Privacy (disclosure: a portfolio company of AI Fund), is using large language models to help companies systematically evaluate, and potentially improve, their privacy policies as well as keep track of global regulatory changes. In the matter of privacy, as in other areas, I hope that the title of my TED AI talk — “AI Isn’t the Problem, It’s the Solution” — will prove to be true.

Keep learning!

Andrew

P.S. Check out our new short course with Amazon Web Services on “Serverless LLM Apps With Amazon Bedrock,” taught by Mike Chambers. A serverless architecture enables you to quickly deploy applications without needing to set up and manage compute servers to run your applications on, often a full-time job in itself. In this course, you’ll learn how to implement serverless deployment by building event-driven systems. We illustrate this approach via an application that automatically detects incoming customer inquiries, transcribes them with automatic speech recognition, summarizes them with an LLM using Amazon Bedrock, and runs serverless with AWS Lambda. We invite you to enroll here!

Subscribe to The Batch