There was recent news of a multi-billion dollar start-up that utilized an actual customer’s network environment for sales demonstrations. To make matters worse, the practice went on for years, without the customer’s (which happened to be a medical facility) permission or knowledge (which had the potential of violating The Health Insurance Portability and Accountability Act of 1996 (HIPAA). It is understandable for a company to want to demonstrate their products or services in a life-like manner, but data privacy and customer confidentiality are legal and regulatory obligations. There are ways, however, to demonstrate products and services using data that is close to production while protecting your customer’s data, complying with your own company’s legal and regulatory obligations, and still produce a quality demo.
First, let us take a quick look at some of the reasons why maintaining the confidentiality of customer data is so important. Beyond ethical and contractual reasons, there are also regulatory regimes and frameworks that span the globe that require the protection of personal data, such as HIPAA, Japan Personal Information Protection Act, OECD Guidelines, EU General Data Protection Regulation, and the APEC Privacy Framework. In addition to legal and regulatory obligations, customers have become more ‘privacy aware’ in recent years, with increased attention to what data is collected, how it is used, who it is shared with, whether it’s sold or rented, and its eventual destruction. A step to minimize privacy risk and exposure would be to de-identify or anonymize the data and set up a demo environment.
Anonymizing or de-identifying data prevents an observer from directly, or by aggregation and/or inference, identifying the actual person about whom the data relates (i.e., the data subject). Properly anonymized data would no longer be Personally Identifiable Information (PII) if it were not possible to identify any individual data subjects. De-identification, on the other hand, replaces PII with pseudonyms or alternative identifiers leaving only authorized users with the ability to re-identify the data subjects.
With a few simple steps, a company can anonymize or de-identify data to protect their customers, data subjects, and themselves. For example, in Excel, a team can leverage formulas such as RIGHT(), REPT(), and LEN() to randomize or redact a social security number to show only the last 4 digits. A macro can also be written to overwrite the original data for more complete anonymization. The VBA code examples are also published on the internet and easily accessible to programmers. Generalization is another way to accomplish this for data. An example of generalization is taking specific data, such as household incomes $175,234, $64,502 and $32,324 and make them ranges “more than $150,000”, “between $100,000 and $50,000” and “less than $35,000”. Other techniques include:
- Data swapping: swap data across the table to make the original data locations and linkages randomized.
- Randomization: using a mathematical formula to mix the data with random numbers or values.
- Perturbation or Noise: add random values and mismatched data to overwrite and confuse the original data.
- Redaction: suppressing or removing identifying data fields from the data set.
Anonymization or de-identification can provide your sales teams with data that is robust enough to give customers as realistic operation of the offering while fulfilling your legal and regulatory obligations and customer expectations of privacy and confidentiality. The National Institutes of Standards and Technologies (NIST) has a robust guideline as well on de-identification: http://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf
To learn more about Cisco’s Data Protection Program, and how we view it as part of our DNA, visit our website: http://www.cisco.com/c/en/us/about/trust-transparency-center/data-protection.html
Totally agree Greg. Here in Oz there were privacy concerns over the compulsory census. We were assured that once data was gathered it would be anonymised, with any PII taken out. Data then linked by identifiers.
The last 4 digits of an SSN are the sensitive ones. The others are mostly guessable if you know where and when somebody was born, which is why the 4 digits often get used as a password. Like any password, you shouldn’t be storing it – at most you should be storing a salted hash (e.g. combining the customer’s other information and some long secret that belongs to your database) that you calculate each time and compare, and even that’s pretty weak for a 4-digit password.
Great point, thanks for the comment.
AGREED !