Societies, economies and services of all kinds rely, more and more, on the massive processing of data for decision making or simply to improve their product. This implies that every time we perform an action in the digital world – and progressively also in the physical one – it is registered somewhere, being processed and crossed with other databases of a varied origin.
Along the way, user privacy is at stake and the tool that stands out the most to preserve it is the so-called "differential privacy". This definition applies the statistical and mathematical concepts necessary so that, in a robust way, we can rely on the non-identification of these data, which can be in many cases sensitive or critical, but allowing the use of the great trends that derive from them.
This approach to data processing proposed by differential privacy – whose origin is partially among the publications of Cynthia Dwork, a researcher at Microsoft – is being implemented by technological giants of the likes of Google – which has been committed to it since when it did not even they called that, it's been five years ago in Chrome–, Apple or Uber. The ultimate purpose: to accumulate and process more data of yours of all kinds, without even being able to assess what are the specific data that actually make them yours.
Recently, Google released part of the libraries that they use internally for this purpose, so that any company or organization that handles large amounts of data can continue to do so, but with some certain guarantees at the level of privacy and without having to program everything from scratch. We speak with Miguel Guevara, product manager in the privacy and data protection division at Google, which gives Hypertext Some of the keys to this new free software initiative.
An armored treatment by statistics
To protect the reading of a database against sensitive details, it is not enough to replace part of the data with encoded strings – through hashing of the most sensitive, such as names – and a clear precedent for this is the case of Netflix. In 2007, when the platform began broadcasting video on demand, to improve its recommendation system, it offered a prize of one million dollars to those who managed to improve the performance of their algorithm by at least 10%.
For this they published a database with 100 million valuations from 500,000 users, with some elements hasheados, so they were not directly identifiable. To his surprise, these data were partially and easily deanonymized when crossed with IMDb assessments. A couple of researchers, from the University of Texas, soon obtained the details of the users who used both platforms "discovering their apparent political preferences and other sensitive information," as we read the abstract of the publication.
Of course, this risk grows as the databases of which we are part have more and more entries at different levels that allow dismantle that anonymization based on contextual information which can in many cases be obtained relatively easily and even through public availability.
Privacy, defined robustly – and mathematically –
Differential privacy covers this hole and "allows to know aggregate statistics about a population," Guevara explains, "and at the same time prevents in a very systematic way that an observer can obtain information about a specific user." It does basically adding more statistical noise to the answer the more specific the question is What we are doing to the database. As an adaptation of Heisenberg's uncertainty principle in physics, applied by social imperative in data science.
If we want to get very specific data on a very small subset of subjects in the sample, the noise will be larger the smaller the size of the sample, and therefore the results will tend progressively to be more and more useless at a practical level. "At that time, the noise you are introducing is so much, that the results become garbage," says Miguel Guevara. In this way, managing large databases under differential privacy schemes becomes, a priori, quite reassuring.
If you look for too many details under the standards of differential privacy "the noise you are introducing is so much, that the results become rubbish," says Guevara
In any case, the use of differential privacy in a given project does not shield the specific information of the individuals that appear in it. Not at least per se. There are several methods to apply it, and it is the so-called global model where "what the controller can do is place a layer between the database and those who are accessing that information, and that layer uses differential privacy and that is what we are doing open source", according to the Google product manager, who argues that this technique" is very flexible. "
This approach allows companies to work on the differential privacy model, always maintaining control of the data on which they work, based on the figure of a supervisor of these. Miguel Guevara comments that "it gives the data controller the possibility of making a very rational decision about the risk that he wants to incur in sharing that data."
Privacy or justice
Guevara tells how, according to recent studies, "you cannot have, in the context of machine learning, justice and privacy": "Imagine a Quechua group in the highlands of Peru that also wants to use Google's predictive keyboard. If we want to train a model for them, we need some type of information about these databases. But if we train them with differential privacy we will end up producing a model that does not work for very small populations. "
And, he says, "the debate is super recent" but that despite the technique already allows adjust to the needs of each environment: "the parameters of differential privacy allow, if you wish, to protect the presence or absence of groups". As an example, that of "some kind of minority you can imagine". Among them, ethnic groups are especially vulnerable, such as "Muslims in a country where there are not many."
Open and collaborative initiative
With TensorFlow, the Mountain View giant already provides one of the most used library sets in data science. Also in the field of privacy and encryption. With this new contribution, Google expects again a great adoption: "there are very few libraries in this field, and especially libraries that operate at scale" like the one they publish now and that "we also use internally in our services," says Guevara . "It took us a long time to develop this library, like two years to be strong enough. My hope is that organizations that do not have such resources, or do not have the time, can use it to get more value from the data they have without compromising the privacy of its users. "
"Organizations that do not have such resources, or do not have the time, can use it to get more value from the data they have without compromising the privacy of their users"
And it is not a unidirectional process, but it is also reciprocal with the community, from which they hope to receive certain feedback at several levels and that even allow to strengthen the privacy in their products. "We are very inspired by the cryptographic field. In 'crypto', to prove that an encryption algorithm is safe, what people have done has been to release it to the community, so that the community begins to attack it and find out if there are failures or not". "We hope it comes from organizations, civil society, governments and researchers. This first stage of the library is very focused for people who have software skills, or data scientists. Any kind of feedback is welcome."
Thanks to this technique with just a few years of life projects can be developed that preserve the privacy of those who appear in them without having to reinvent the wheel in each implementation. This is something that giants like Apple or Uber can afford, although perhaps not other smaller companies.
In Google, they expect this type of bookstore to be considered by any type of project that handles a significant volume of data. Regarding its size, Miguel says that "any company or organization that manages data from more than one hundred individuals can benefit from this library": "social scientists, economists" or perhaps also to detect "consumption patterns of a sensitive nature ".
The debate that may arise for a company is to make an extra effort to access a smaller amount of data, or in a less granular way. Given the question of whether the incentive – which can be ethical, but also preventive – is enough, Guevara argues that "the same doubt you have now we had internally." "We have discovered that people get used to using data that is not so accurate, the result of differential privacy. It can be a slow process because of the change in perspective on how we understand data today. It implies accepting that the data we are going to get will have some noise, which some will be completely suppressed, but it is important to remember that the large population trends on a database remain completely there and the statistical rigor still exists there. "
"As people begin to have this intuition about how it works, this intuition can give them greater assurance about how their data is being used."
Given the growing collection of data by companies, and after being asked about the possible change of perception in the eyes of users, the head of Google parallels the situation experienced in data encryption. "30 years ago, the idea of encryption was very strange for most people. I think it still is, but we have reached as a community a level of intuitive understanding that allows people more or less to have a feeling of security when they know that your information is encrypted. I hope something similar happens with differential privacy and that as people begin to have this intuition about how it works, this intuition can give them greater security about how their data is being used. "