AI promises to transform — and indeed, has already transformed — entire industries, from civic planning and health care to cybersecurity. But privacy remains an unsolved challenge. Spotlighting the issue, two years ago, Microsoft quietly removed a dataset with more than 10 million images of people after it came to light that some subjects weren’t aware that they’d been included in that data.
A partial solution to the problem of privacy in AI that’s been proposed is differential privacy. Differential privacy involves injecting a small amount of noise into data before feeding it into an AI system, thus making it difficult to extract the original data from the system. Someone seeing a differentially private AI system’s prediction can’t tell if a particular person’s information was used to develop the system.
Expanding differential privacy
Google’s announcement marks both a year since it began collaborating with OpenMined and Data Privacy Day, which commemorates the January 1981 signing of Convention 108, the first legally binding international treaty dealing with data protection. Google open-sourced its differential privacy library — which the company claims is used in core products like Google Maps — in September 2019, before the arrival of Google’s experimental module that tests the privacy of AI models.
In 2019, we launched our open-sourced version of our foundational differential privacy library in C++, Java and Go. Our goal was to be transparent, and allow researchers to inspect our code. We received a tremendous amount of interest from developers who wanted to use the library in their own applications, including startups like Arkhn, which enabled different hospitals to learn from medical data in a privacy-preserving way, and developers in Australia that have accelerated scientific discovery through provably private data, Google differential privacy product lead Miguel Guevara wrote in a blog post. Since then, we have been working on various projects and new ways to make differential privacy more accessible and usable.
Growing support
Google is among several tech giants that have released differential privacy tools for AI in recent years. In May 2020, Microsoft debuted SmartNoise, which was developed in collaboration with researchers at Harvard. Not to be outdone, Meta (formerly Facebook) recently open-sourced a PyTorch library for differential privacy dubbed Opacus.
Studies underline the urgent need for techniques to conceal private data in the datasets used to train AI systems. Researchers have shown that even anonymized X-ray datasets can reveal patient identities, for example. And large language models like OpenAI’s GPT-3 are known to, when fed certain prompts, leak names, phone numbers, addresses, and more from training datasets.