Skip to Content

How can we build AIs with user data while respecting personal privacy?

Warrick Cooke
25 Mar 2022

In a world where customer data is a major source of value, privacy concerns could limit innovation.

Most companies want to be data-driven. From healthcare to automotive to consumer products, they all want to collect data on the people using their products so they can launch personalized apps and gain customer insights. Whether it’s data on patient lifestyles, driving habits, or skincare regimes, the information is valuable.

However, the user’s data belongs to the user. They are not obliged to part with it. Sometimes they will do so in return for a clear benefit or freebie. But users increasingly opt-out when they can, especially when it concerns things like health metrics that they may not want on a company server. They will also hold companies to account if they don’t take proper care of their data.

There is another challenge emerging. Data is valuable because it can be used to build artificial intelligence (AI) models at the heart of personalized customer apps. However, these models can be reverse engineered to identify the private data used to train them, even if the data is anonymous. In one well-known example, Netflix made an anonymous data set available to a data science competition. However, some clever data scientists showed how they could identify private records by combining them with public data from IMDb, an online database of information on films, television series, home videos, video games, and streaming online content.[1]

Creating AI tools that respect user privacy

Imagine a wearable device that monitors your health metrics and gives you personalized health advice to explore solutions. Such a device would collect data about your state of health (e.g., heart rate, steps taken, etc.) and other parameters to help maintain optimal health, such as temperature, humidity, weather, etc. This data would feed a model trained to spot markers of health concerns and recommend solutions.

One way to maintain privacy is to store and process all the data on the device. But, of course, this requires massive computing power. However, edge computing can run sophisticated models at the wearable-device level. In addition, processing at the edge means the company doesn’t receive the user data, so there are no privacy concerns for the user.

Still, some data, such as weather information, needs to be requested from an external source. This could disclose personally identifiable data to the source since a weather request shares the location.

A disclosure like this highlights how hard it is for the user to prevent sharing personal data. One solution is to make lots of requests via a proxy server. The device’s internal model knows the location, so it discards the wrong ones, but the receiver has no idea which is the right one, who is requesting it, or why it has been requested.

How can we model data without compromising privacy?

The proxy server idea above is a good solution if your primary goal is to provide users with a useful AI tool. But what if you want to collect their data?

Say you are studying arthritis. You want to dig into your wearables’ health data to pull out records of users with arthritis so you can review the link between lifestyle and change in health metrics over time. Or, more prosaically, you may want to know how the device is used so users can maximize its value.

If you take the data off the device and upload it to your company’s cloud to be processed, you get into privacy issues.

When stored or transmitted, the private data is encrypted, which doesn’t cause too many worries. However, it needs to be decrypted to train models. This step creates the possibility that the user’s identity is revealed to people working on the model, creating a window of opportunity for your data to be stolen while being decrypted. Reverse-engineering the model makes it possible to identify the user.

Decryption requires user consent, which may not be forthcoming. People worry about their data being hacked, and not everyone likes the idea of strangers looking at their data, even those beyond reproach, such as data scientists.

The solution is to adopt techniques that allow anonymous data to be combined into larger models without anyone seeing anything that could be used to identify an individual.

Three techniques that deliver personal data privacy

One relatively simple solution to reverse engineering is to insert fake records. Here, the model can be designed to compensate for the noise, so anyone attacking the model would be unable to identify real users.

There are more sophisticated techniques that provide greater end-to-end data privacy. One is differential privacy which performs random changes to data at the point of collection (i.e., on the device) before transmitting the data. So the model – or anyone who steals the data – has no idea whether any individual data record is accurate. But because we know the level of randomness and the probability that a piece of data is wrong, we can reconstruct an accurate group-level picture that is reliably predictive of user behavior.

Homomorphic encryption is another option that is starting to be used. This complex modeling technique uses completely anonymized datasets that allow the data to be processed while still encrypted. For example, it makes it possible to find data on people with arthritis from the wearables data set, run calculations on it, and create a useful model based on group-level insights without decrypting any personal records.

The math of homomorphic encryption date back to the 1970s, but computing power has only recently allowed us to use it in practical applications. As a result, the applications are limited to well-funded organizations that can throw significant computing power at the problem. However, it is gaining interest and is likely to become an important tool for building complex AIs without compromising privacy.

Building a privacy-preserving app

For makers of privacy applications and devices, the options are best considered in the design stage. That is because it is hard to layer stringent privacy requirements on top of a fully formed app.

The early design stage should encompass exploring the available data, gaining insights from the data, and adding data that would be beneficial to acquire, such as location or weather data. If computation is being partially or wholly handled on the device, the technical capabilities and restraints must be considered. In addition, it is crucial to explore data privacy techniques that ensure the user cannot be identified. Privacy needs to be given serious consideration in the context of a complete understanding of the data being processing.

As models become more complex and hackers become more sophisticated, privacy needs to be built into AIs from the very start.

Capgemini Engineering helps customers design and monetize intelligent services from connected products while ensuring personal data remains private and secure. To discuss your intelligent product and data privacy challenges, contact: engineering@capgemini.com

Author: Warrick COOKE, Consultant, Hybrid Intelligence, Capgemini Engineering

Warrick is an experienced strategic technology consultant who helps clients apply advanced and emerging technologies to solve their business problems. He has worked with scientific and R&D companies across multiple domains, including the pharmaceutical and energy sectors.

[1] Bruce Schneier, “Why ‘Anonymous’ Data Sometimes Isn’t,” Dec 12, 2007, Wired
https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/