GDPR CSV Data Masker

What is a GDPR CSV Data Masker?

This tool scans CSV files for personally identifiable information (PII) — emails, phone numbers, names, SSNs, IP addresses, credit card numbers, and dates of birth — and replaces them with safe placeholder values. It is designed for developers and data engineers who need to share datasets, write tests, or populate staging environments without exposing real personal data that falls under GDPR, CCPA, HIPAA, or similar data protection regulations. All processing is done entirely in your browser; no data is ever uploaded to a server.

Mask styles explained

Redact replaces the entire value with [REDACTED], making it completely uninformative. Partial keeps the first character and masks the rest (e.g. a***@e*** for emails), which is useful when debugging requires recognising which record is which. Fake substitutes a plausible but fictional value — a random email address, phone number, or name — so the dataset still looks realistic for UI screenshots or demo environments. Hash (MD5) replaces each value with the first 12 characters of its MD5 digest, which is deterministic: the same input always produces the same output, making it useful for joining anonymised tables across exports.

Frequently Asked Questions

What does GDPR say about personal data in test environments?

Article 5 of GDPR requires that personal data be collected for specified, explicit, and legitimate purposes and not processed in a manner incompatible with those purposes. Using production personal data in development or staging environments is almost always incompatible with the original collection purpose. The EDPB (European Data Protection Board) explicitly recommends pseudonymisation or anonymisation before using data in non-production systems. Failure to do so can result in fines of up to €20 million or 4% of annual global turnover under Article 83.

What types of data count as PII under GDPR?

GDPR defines personal data broadly as any information relating to an identified or identifiable natural person. This includes obvious fields like name, email, phone number, home address, and date of birth, but also less obvious identifiers such as IP addresses, cookie IDs, device fingerprints, location data, and any combination of attributes that could single out an individual. This tool currently detects the most commonly exported PII types: email, phone, name, SSN/national ID, IP address, credit card number, and date of birth.

What is data masking and how does it differ from encryption?

Data masking replaces real values with structurally similar but fictional ones. Unlike encryption, masked data cannot be reversed — there is no key that recovers the original value. This makes masking suitable for sharing datasets outside secure environments. Encryption protects data in transit and at rest but the original data can be recovered with the correct key, so encrypted data is still considered personal data under GDPR. Masking (when done correctly) can achieve de-identification, removing data from GDPR scope entirely.

What is the difference between anonymisation and pseudonymisation?

Anonymisation irreversibly removes the ability to identify an individual — truly anonymised data falls outside GDPR's scope. Pseudonymisation replaces direct identifiers with artificial ones (like a hash or UUID) but the mapping back to real identities still exists somewhere. GDPR explicitly mentions pseudonymisation as a recommended safeguard (Recital 28) and it reduces risk, but pseudonymised data is still considered personal data. The "hash" mask style in this tool produces pseudonymous output if the original data is not stored alongside it. The "redact" and "fake" styles produce anonymised output.

When should I use hash masking instead of redact or fake?

Use hash masking when you need referential integrity across tables — for example, if a users.csv and an orders.csv both contain the same email address, hashing will produce the same 12-character string in both files, so joins still work in the anonymised dataset. Use redact when you want maximum protection and do not need to recognise individual records at all. Use fake data when the masked dataset will be shown in UI screenshots, demos, or fed into a system that validates format (e.g. an email field that rejects [REDACTED]).