The SADA Engineering Blog

SADA is a dedicated Google Cloud Premier Partner focused on delivering innovative cloud…

Follow publication

Cloud DLP de-identification techniques and their use cases

--

By Mohita Agarwal, SADA Data Engineer

Cloud Data Loss Prevention

Cloud Data Loss Prevention (Cloud DLP) is a fully managed service meticulously designed to uncover, categorize, and safeguard sensitive data, regardless of its location, be it within databases, textual content, or even images. This service not only offers comprehensive visibility but also empowers the organization to classify sensitive data thoroughly. Ultimately, Cloud DLP plays a pivotal role in diminishing data-related risks by examining and altering structured and unstructured data, employing techniques such as obfuscation and de-identification, such as masking and tokenization. Moreover, Cloud DLP can be instrumental in executing re-identification analyses, thereby enriching the comprehension of data privacy risks.

Re-identification risk analysis involves scrutinizing data to identify properties that could elevate the chances of subjects being recognized. For instance, consider a marketing dataset that includes demographic attributes like age, job title, and zip code. On the surface, these demographics may not appear identifying, but certain combinations of age, job title, and zip code might uniquely correspond to a small cohort of individuals or even a single individual, thereby heightening the risk of re-identifying that person.

Various de-identification techniques and their use cases

Masking

This de-identification technique masks a string either fully or partially by replacing a given number of characters with a specified fixed character. Additionally, you can specify if you need to ignore any characters from masking.

Use cases for data masking

Reducing breach impact: Data masking is a data security method used to anonymize an organization’s sensitive data. In the unfortunate event of a data leak or breach, this method ensures that the exposed data cannot be utilized to identify an individual or for illicit purposes such as fraudulent transactions.

Creating safe test data: Teams frequently require functional datasets for testing purposes. Masked data preserves the required integrity for testing without jeopardizing real user data. Automated data masking facilitates the rapid and secure provisioning of masked test data to non-production environments, enabling teams to fulfill this need efficiently.

Internal access control: Data masking serves as a crucial tool for anonymizing sensitive data, ensuring that individuals within an organization who lack the authorization to access specific information are unable to do so. In cases where a member of your organization requires access to data for legitimate purposes but lacks authorization to view, for instance, credit card details, masking the last four digits of the card number provides a protective measure, preventing exposure to those sensitive elements.

Speeding up development: Manual data cleanup is a tedious, time-intensive process that can impede production speed. Automated data masking seamlessly integrates with current systems, eliminating the need for manual interventions. This streamlines the development and testing processes, enabling teams to work with authentic data efficiently, without causing disruptions.

Protecting data in transit: Once data has been masked, it remains masked in transit and in the cloud. Should that data be breached, it cannot be used to identify individuals or for fraudulent purposes.

Protecting client privacy: Privacy is important to almost everyone, which is why organizations need to take measures to ensure confidential client data and personally identifiable information (PII) is protected, particularly in scenarios where an organization shares data with third parties.

Compliance with data privacy laws: Data privacy regulations, such as the GDPR, HIPAA, GLBA, PIPEDA, etc., aim to safeguard the safety, security, and confidentiality of Personally Identifiable Information (PII). Implementing data masking within the development life cycle removes identifiers, including the 18 identifiers identified by the HIPAA Safe Harbor method, before data is shared within an organization. This approach aligns with the Privacy by Design principle mandated by Article 25 of the GDPR.

Redaction

Redaction protects a value by removing it completely.

Replacement

Replacement replaces each input value with a given value specified by the user or with a default value.

Use cases for redaction/replacement

Financial services: Financial institutions grapple with the challenge of managing an overwhelming volume of confidential and sensitive customer data. To navigate this, they frequently extract pertinent information from the extensive datasets they handle. Examples of data redaction or replacement in financial services include concealing or substituting items like credit/debit card numbers, bank account numbers, mobile phone numbers, and more.

Pharmaceutical and life sciences: Healthcare institutions often spend a significant amount of time on patient-related paperwork. The ability to quickly redact or replace sensitive information frees up staff to focus on patient care.

Law enforcement: They can employ data redaction or replacement to maintain databases, ensure compliance with criminal/victim identification requirements, and save valuable time.

Transportation: The transportation industry deals with extensive documentation, ranging from invoices to toll tax receipts and more. Data redaction or replacement expedites processes in this sector.

Government: Government entities store a wide array of sensitive data, necessitating the implementation of comprehensive safety protocols to prevent any data breaches. Data redaction and replacement represent essential components of these procedures, aiding in the safeguarding of sensitive information and ensuring compliance with audit requirements.

IT & operations: Even a minor breach can have severe repercussions, potentially grinding an entire organization’s operations to a standstill. Data redaction and replacement provide IT professionals with the right tools to effectively redact sensitive data, enhancing their workflow efficiency and bolstering security measures.

Pseudonymization with secure hash

This de-identification technique replaces input value with a secure one-way hash generated using a data encryption key. Secure-keyed cryptographic hashing is a method that involves creating a cryptographic hash of an input string using a secret key, similar to HMAC, and is generally considered a more secure approach as opposed to using only a hash function. Unique identifiers with primary key behavior are de-identified this way for very large datasets.

An input value is replaced with a value that has been encrypted and hashed using Hash-based Message Authentication Code (HMAC)-Secure Hash Algorithm (SHA)-256 on the input value with a cryptographic key. The hashed output of the transformation is always the same length and can’t be re-identified.

Use cases for pseudonymization with secure hash

Patient record comparison: Without the ability to reveal the plaintext versions of the pseudonyms, re-identifying individual data entries in this set is extremely difficult since a secure pseudonymization function was used for hashing.

Medical research institution use case: The pseudonymisation scheme can unfold its ideal potential here in the task of detecting correlations and statistical patterns in symptoms and medications, without even ever revealing the true value of the underlying symptom or medication.

Cybersecurity: Fraud detection systems, such as credit card transaction monitoring and dark web monitoring, access financial and personal information. It’s crucial to employ pseudonymization methods to protect this data. Just as discussed earlier, in many real-world applications, data is sanitized using cryptographic hashes before further processing. Pseudonymization is increasingly important as a security technique and a way to implement data minimization in various contexts. It facilitates personal data processing while providing strong safeguards for data protection.

Pseudonymization by GDPR: GDPR mandates the use of appropriate technical and organizational measures to safeguard personal data, with pseudonymization often being the method of choice to maintain data utility.

Pseudonymization with format-preserving token

This technique replaces an input value with a “token” or surrogate value, of the same character set and length using format-preserving encryption (FPE). Preserving the format can help ensure compatibility with legacy systems that have restricted schema or format requirements. It involves replacing an input value with an encrypted value that has been generated using format-preserving encryption. Here’s a simple example of how FPE could be used in Python to de-identify personal data (CCN).

Use cases for pseudonymization with format-preserving token

Payment processing: Within the realm of payment processing, safeguarding sensitive information like credit card numbers and bank account details is paramount. Pseudonymization is employed to obscure these financial data points. Format-preserving tokenization emerges as a valuable solution, as it replaces such information with tokens that retain the structure of the original data. This approach enables businesses to uphold payment workflows and maintain compatibility with their existing systems while enhancing data security.

Healthcare data: Healthcare organizations often deal with sensitive patient data, including medical record numbers, insurance IDs, and diagnosis codes. Format-preserving tokenization can help protect patient privacy while still enabling efficient data exchange and integration between various medical systems.

Identification numbers: Many industries use identification numbers, such as employee IDs, customer IDs, and membership IDs. Format-preserving tokenization can be applied to these numbers to maintain consistent identification while safeguarding individuals’ privacy.

Membership and loyalty programs: Companies that provide loyalty programs or membership services may find it necessary to pseudonymize customer IDs or account numbers. Through the utilization of format-preserving tokens, they can maintain the capacity to deliver personalized services while simultaneously mitigating the risk associated with potential customer data exposure.

Software testing and development: During software testing and development, actual production data is often used to simulate real-world scenarios. However, exposing sensitive customer information in these environments can be risky. Format-preserving tokenization allows developers to work with realistic data without compromising privacy.

Research and analytics: Researchers and analysts often require access to large datasets for analysis and research purposes. By pseudonymizing sensitive data using format-preserving tokens, organizations can share datasets without revealing the actual personal information, making it safer to collaborate and share data.

E-commerce and order processing: E-commerce platforms collect various types of personal data, such as shipping addresses and phone numbers. Format-preserving tokenization can help protect this information while ensuring seamless order processing and customer communication.

Legal and compliance: Legal documents and contracts might contain sensitive information that needs to be pseudonymized before being shared with third parties. Format-preserving tokenization can help comply with data protection regulations while retaining the original document structure.

Customer support and CRM: Customer support systems often store customer information for reference. By using format-preserving tokens, businesses can provide efficient support without exposing sensitive details.

Data migration and sharing: When sharing data between systems or migrating data to new platforms, format-preserving tokenization can help maintain data integrity and relationships without exposing actual PII.

While format-preserving tokenization can provide an additional layer of privacy and security, it’s essential to consider the overall data protection strategy, including encryption, access controls, and user consent, to ensure comprehensive data privacy and compliance with relevant regulations.

Generalization bucketing

This technique masks input values by replacing them with “buckets” or ranges, within which the input value falls. For example, specific ages can be bucketed into age ranges or distinct values into ranges like “low,“ “medium,” or “high.”

Use cases for generalized bucketing

Note: These are speculative interpretations based on the term “generalization bucketing” in the field of machine learning and data science.

Data preprocessing for supervised learning: In machine learning, data preprocessing is a crucial step. Generalization bucketing could refer to a technique where continuous or numerical features are discretized into buckets or bins. This can help with generalization, especially when dealing with noisy data, by reducing the impact of outliers and making the model more robust.

Categorical variable aggregation: When working with categorical variables, especially when they have a large number of categories, it’s common to aggregate or group similar categories together. This can help prevent overfitting and improve generalization by reducing the dimensionality of the input space.

Feature engineering: Feature engineering involves creating new features from existing ones to improve model performance. Generalization bucketing might relate to a technique where certain features are transformed into more general or abstract representations that capture the underlying patterns in the data.

Ensemble methods: Ensemble methods combine the predictions of multiple models to achieve better performance. Generalization bucketing could be used in the context of creating diverse subsets of data or models to improve the ensemble’s ability to generalize to new, unseen data.

Data augmentation: In tasks like image classification, data augmentation involves applying transformations to the training data, such as rotations, flips, and translations, to increase the diversity of the training set and improve generalization. Generalization bucketing might refer to a systematic way of applying these transformations to different “buckets” of data.

Date shifting

This technique shifts dates by a random number of days per user or entity. This helps obfuscate actual dates while still preserving the sequence and duration of a series of events or transactions.

Date shifting, also known as data anonymization or pseudonymization, is a technique used to protect the privacy and confidentiality of sensitive information by altering the dates while maintaining the overall characteristics and patterns of the data. This is often employed in situations where it’s necessary to share or analyze data without revealing specific dates.

Time extraction

This technique extracts or preserves a portion of Date, Timestamp, and TimeOfDay values. Time extraction refers to the process of identifying and extracting temporal information from text data. This can include dates, times, durations, and other time-related expressions. Time extraction has a wide range of use cases across various industries.

Use cases for data shifting/time extraction

Medical research and healthcare data: In medical research, patient data often needs to be shared with researchers while protecting patient privacy. Date shifting/time extraction can be used to de-identify data, making it difficult to trace specific medical events back to individuals.

Financial data: Financial institutions may need to share transaction data for analysis purposes, but they must ensure that personal and transaction details remain private. Date shifting/time extraction can be applied to transaction dates to obscure the exact timing of transactions.

Human resources and payroll: Companies might need to provide HR or payroll data for analysis without revealing specific dates of employee actions like hiring, promotions, or terminations. Date shifting/time extraction can be used to protect employee privacy.

Epidemiological studies: In epidemiology, analyzing the spread of diseases requires access to patient data. Date shifting/time extraction can help prevent the identification of specific patients while still allowing researchers to observe trends and patterns.

Judicial data: Law enforcement agencies may need to share crime data for research purposes, but they need to safeguard sensitive information. Date shifting/time extraction can be applied to crime incident dates to protect the privacy of victims and suspects.

Educational research: Educational institutions might share student data for educational research purposes. Date shifting/time extraction can help protect student privacy while still allowing researchers to analyze academic trends.

Market research: Companies conducting market research might need to share data about consumer behavior while ensuring individual privacy. Date shifting/time extraction can be used to obscure the timing of purchases and actions.

Social sciences research: Sociological or psychological studies might require data about events like social interactions or emotional experiences. Date shifting/time extraction can help protect participants’ identities.

Transportation and logistics: When analyzing transportation routes, delivery schedules, or traffic patterns, date shifting/time extraction can be applied to protect sensitive information while maintaining the overall flow of the data.

Government statistics: Government agencies might need to share statistical data without revealing exact dates of events like census responses or economic indicators. Date shifting/time extraction can help protect citizen privacy.

About Mohita Agarwal

As a Data Engineer with around five years of experience, Mohita has been at the forefront of transforming data landscapes by designing and implementing robust data pipelines. Her expertise lies in orchestrating seamless migrations from on-prem systems to Google Cloud Platform (GCP), enabling organizations to harness the power of cloud-based data solutions. She has worked on SQL, Python, and cloud-native technologies, particularly excelling in leveraging Google Cloud tools such as BigQuery, Apache Airflow, Composer, Cloud functions, Cloud SQL, GCS, and Pub/Sub.

About SADA

At SADA, we climb every mountain, clear every hurdle, and turn the improbable into possible — over and over again. Simply put, we propel your organization forward. It’s not enough to migrate to the cloud, it’s what you do once you’re there. Accelerating application development. Advancing productivity and collaboration. Using your data as a competitive edge. When it comes to Google Cloud, we’re not an add-on, we’re a must-have, driving the business performance of our clients with its power. Beyond our expertise and experience, what sets us apart is our people. It’s the spirit that carried us from scrappy origins as one of the Google Cloud launch partners to an award-winning global partner year after year. With a client list that spans healthcare, media and entertainment, retail, manufacturing, public sector and digital natives — we simply get the job done, every step of the way. Visit SADA.com to learn more.

If you’re interested in becoming a part of the SADA team, please visit our careers page.

Sign up to discover human stories that deepen your understanding of the world.

--

--

Published in The SADA Engineering Blog

SADA is a dedicated Google Cloud Premier Partner focused on delivering innovative cloud technologies and tools, combined with expert engineers and exceptional customer experience.

Written by SADA

Global business and cloud consulting firm | Helping CIOs and #IT leaders transform in the #cloud| 3-time #GoogleCloud Partner of the Year.

No responses yet

Write a response