De-identification is the process used to prevent a person’s identity from being connected with information. Common uses of de-identification include human subject research for the sake of privacy for research participants. Common strategies for de-identifying datasets include deleting or masking personal identifiers, such as name and social security number, and suppressing or generalizing quasi-identifiers, such as date of birth and zip code. The reverse process of defeating de-identification to identify individuals is known as re-identification. Several successful re-identifications attempts[1][2][3][4] have purported to doubt on the effectiveness of de-identification in protecting individuals' privacy. A systematic review of the evidence found that published re-identification attacks were performed on data sets that were not de-identified properly (using recognized standards).

De-identification is adopted as one of the main approaches of the data privacy protection. It is commonly used in the fields of communications, multimedia, biometrics, bigdata, cloudcomputing, datamining, internet, socialnetworks and audio–video surveillance.[5]

The United States President's Council of Advisors on Science and Technology and others have recently deemed de-identification "somewhat useful as an added safeguard" but not "a useful basis for policy" as "it is not robust against near‐term future re‐identification methods".[6]

Example

edit

A survey is conducted, such as a census, to collect information about a group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in such a way that people can participate in the survey and when the result is published it will not be possible to match any participant's individual response with any data published in the result.

A online shopping website wants to know its users' preferences and shopping habits. It decides to retrieve the customers' data from its database and do analysis on them. The personal data information including the personal identifiers were collected directly when customers created their accounts. They need to pre-handle the data by de-identification techniques before analyzing the data records to avoid violating the customers' privacy.

Anonymization and de-identification

edit

Anonymization refers to irreversibly severing a data set from the identity of the data contributor in a study to prevent any future re-identification, even by the study organizers under any condition.[7][8] De-identification is also a severing of a data set from the identity of the data contributor, but may include preserving identifying information which could only be re-linked by a trusted party in certain situations.[7][8][9] There is a debate in the technology community of whether data that can be re-linked, even by a trusted party, should ever be considered de-identified.

Techniques

edit

The common strategies of de-identification are masking the personal identifiers and generalizing quasi-identifiers. The pseudonymization is the main technique used to mask the personal identifiers from the data records and k-anonymization is usually adopted for generalizing the quasi-identifiers.

Pseudonymization

edit

The pseudonymization is performed by replacing real names with a temporary ID, it deletes or masks the personal identifiers to make individuals unidentified. This method could make it possible to track the individual's record over time even though the record will be updated. However, it could not prevent the individual from being identified if some specific combinations of attributes in data record indirectly identify the individual. [10]

k-anonymization

edit

k-anonymization defines attributes that could indirectly point to the individual's identity as quasi-identifiers(QIs) and deal with data by making at least k individuals have same combination of QI values.[10] The QI values are handled following specific standards. For example, the k-anonymization replaces some original data in the records with new range values and keep some values unchanged. The new combination of QI values prevent the individual from being identified and also avoid destroying the data records.

Applications

edit

Research into de-identification is driven mostly for protecting health information.[11] Some libraries have adopted methods used in the healthcare industry to preserve their readers' privacy.[11]

In big data, the de-identification is widely adopted by individuals and organizations.[6] With the development of social media, e-commerce and big data, the de-identification is required and used for data privacy when the users' personal data are collected for analyzing by companies or third-party organizations. The social network sites collect and save their users' data for analysis of user behavior. They adopt this approach to protect their users' privacy. Those online shopping websites adopt this method as well.

Limits

edit

De-identification laws in the United States of America

edit

See also

edit

References

edit
  1. ^ Sweeney, L. (2000). "Simple Demographics Often Identify People Uniquely". Data Privacy Working Paper. 3.
  2. ^ de Montjoye, Yves-Alexandre; Hidalgo, César A.; Verleysen, Michel; Blondel, Vincent D. (2013-03-25). "Unique in the Crowd: The privacy bounds of human mobility". Scientific Reports. 3: 1376. doi:10.1038/srep01376. ISSN 2045-2322. PMC 3607247. PMID 23524645.
  3. ^ Montjoye, Yves-Alexandre de; Radaelli, Laura; Singh, Vivek Kumar; Pentland, Alex “Sandy” (2015-01-30). "Unique in the shopping mall: On the reidentifiability of credit card metadata". Science. 347 (6221): 536–539. doi:10.1126/science.1256297. hdl:1721.1/96321. ISSN 0036-8075. PMID 25635097. S2CID 206559189.
  4. ^ Narayanan, A. (2006). "How to break anonymity of the netflix prize dataset". arXiv:cs/0610105.
  5. ^ Ribaric, Slobodan; Ariyaeeinia, Aladdin; Pavesic, Nikola (2016-09-01). "De-identification for privacy protection in multimedia content: A survey". Signal Processing: Image Communication. 47: 131–151. doi:10.1016/j.image.2016.05.020.
  6. ^ a b PCAST. "Report to the President - Big Data and Privacy: A technological perspective" (PDF). Retrieved 28 March 2016.
  7. ^ a b Godard, B. A.; Schmidtke, J. R.; Cassiman, J. J.; Aymé, S. G. N. (2003). "Data storage and DNA banking for biomedical research: Informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective". European Journal of Human Genetics. 11: S88–122. doi:10.1038/sj.ejhg.5201114. PMID 14718939. S2CID 20453472.
  8. ^ a b Fullerton, S. M.; Anderson, N. R.; Guzauskas, G.; Freeman, D.; Fryer-Edwards, K. (2010). "Meeting the Governance Challenges of Next-Generation Biorepository Research". Science Translational Medicine. 2 (15): 15cm3. doi:10.1126/scitranslmed.3000361. PMC 3038212. PMID 20371468.
  9. ^ McMurry, AJ; Gilbert, CA; Reis, BY; Chueh, HC; Kohane, IS; Mandl, KD (2007). "A self-scaling, distributed information architecture for public health, research, and clinical care". J Am Med Inform Assoc. 14 (4): 527–33. doi:10.1197/jamia.M2371. PMC 2244902. PMID 17460129.
  10. ^ a b [null Ito K, Kogure J, Shimoyama T, Tsuda H.] Ito K, Kogure J, Shimoyama T, Tsuda H. (2016). "De-identification and Encryption Technologies to Protect Personal Information". FUJITSU SCIENTIFIC & TECHNICAL JOURNAL. 28-36.
  11. ^ a b Nicholson, S.; Smith, C. A. (2006). "Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA". Proceedings of the American Society for Information Science and Technology. 42: n/a. doi:10.1002/meet.1450420106.