What Does Data Masking Mean?
Data Masking is the replacement of existing
sensitive information in test or development databases with
information that looks real but is of no use to anyone who might wish
to misuse it. In general, the users of the test, development or
training databases do not need to see the actual information as long
as what they are looking at looks real and is consistent.
Data Masking Techniques
- Substitution
- Shuffling
- Redaction / Null
- Number and Date Variance
- Blurring
- Masking Out Date
- Table Internal Synchronization
- Cross Schema Synchronization
- Selective Masking: Ability to Apply a WHERE Claus
- User Defined SQL Commands
Substitution
This technique consists of randomly replacing the contents of a column of data with information that looks similar but is completely unrelated to the real details. For example, the surnames in a customer database could be sanitized by replacing the real last names with surnames drawn from a largish random list.
Substitution is very effective in terms of preserving the look and feel of the existing data. The downside is that a largish store of substitutable information must be available for each column to be substituted. For example, to sanitize surnames by substitution, a list of random last names must be available. Then to sanitize telephone numbers, a list of phone numbers must be available. Frequently, the ability to generate known invalid data (credit card numbers that will pass the checksum tests but never work) is a nice-to-have feature.
Substitution data can sometimes be very hard to find in large quantities - however any data masking software should contain datasets of commonly required items. When evaluating data masking software, the size, scope and variety of the datasets should be considered. Another useful feature to look for is the ability to build your own custom datasets and add them for use in the masking rules.
Shuffling
Shuffling is similar to substitution except that the substitution data is derived from the column itself. Essentially the data in a column is randomly moved between rows until there is no longer any reasonable correlation with the remaining information in the row.
There is a certain danger in the shuffling technique. It does not prevent people from asking questions like “I wonder if so-and-so is on the supplier list?” In other words, the original data is still present and sometimes meaningful questions can still be asked of it. Another consideration is the algorithm used to shuffle the data. If the shuffling method can be determined, then the data can be easily “in-shuffled”. For example, if the shuffle algorithm simply ran down the table swapping the column data in between every group of two rows it would not take much work from an interested party to revert things to their un-shuffled state.
Shuffling is rarely effective when used on small amounts of data. For example, if there are only 5 rows in a table it probably will not be too difficult to figure out which of the shuffled data really belongs to which row. On the other hand, if a column of numeric data is shuffled, the sum and average of the column still work out to the same amount. This can sometimes be useful.
Shuffle rules are best used on large tables and leave the look and feel of the data intact. They are fast, but great care must be taken to use a sophisticated algorithm to randomize the shuffling of the rows.
Redaction / Null
Data redaction is the destruction of sensitive data, such as any personally identifiable information (PII). PII can be used on its own or with other information to identify or locate a single person, or to identify an individual in context. Enabling redaction allow you to transform PII to a pattern that does not contain any identifiable information. For example, you could replace all Social Security numbers (SSN) like 123-45-6789 with an unintelligible pattern like XXX-XX-XXXX, or replace only part of the SSN (XXX-XX-6789).
Although encryption techniques are available to protect Hadoop data, the underlying problem with using encryption is that an admin who has complete access to the cluster also access to unencrypted sensitive user data. Even users with appropriate ACLs on the data could have access to logs and queries where sensitive data might have leaked.
Data redaction provides compliance with industry regulations such as PCI and HIPAA, which require that access to PII be restricted to only those users whose jobs require such access. PII or other sensitive data must not be available through any other channels to users like cluster administrators or data analysts. However, if you already have permissions to access PII through queries, the query results will not be redacted. Redaction only applies to any incidental leak of data. Queries and query results must not show up in clear text in logs, configuration files, UIs, or other unprotected areas.
Number and Date Variance
The Number Variance technique is useful on numeric or date data. Simply put, the algorithm involves modifying each number or date value in a column by some random percentage of its real value.
This technique has the nice advantage of providing a reasonable disguise for the data while still keeping the range and distribution of values in the column to within existing limits. For example, a column of salary details might have a random variance of ±10% placed on it. Some values would be higher, some lower but all would be not too far from their original range. Date fields are also a good candidate for variance techniques. Birth dates, for example, could be varied with in an arbitrary range of ± 120 days which effectively disguises the personally identifiable information while still preserving the distribution.
The variance technique can prevent attempts to discover true records using known date data or the exposure of sensitive numeric or date data
Blurring
Alter the existing value randomly with in a define range
Masking Out Data
Generic term for this process is data anonymization, means replacing certain fields with a mask character (such as an X). This effectively disguises the data content while preserving the same formatting on front end screens and reports. For example, a column of credit card numbers might look like:
4346 6454 0020 5379
4493 9238 7315 5787
4297 8296 7496 8724
and after the masking operation the information would appear as:
4346 XXXX XXXX 53794493 XXXX XXXX 5787
4297 XXXX XXXX 8724
The masking characters effectively remove much of the sensitive content from the record while still preserving the look and feel. Take care to ensure that enough of the data is masked to preserve security. It would not be hard to regenerate the original credit card number from a masking operation such as: 4297 8296 7496 87XX since the numbers are generated with a specific and well known checksum algorithm. Also care must be taken not to mask out potentially required information.
A masking operation such as XXXX XXXX XXXX 5379 would strip the card issuer details from the credit card number. This may, or may not, be desirable.
If the data is in a specific, invariable format, then Masking Out is a powerful and fast option. If numerous special cases must be dealt with then masking can be slow, extremely complex to administer and can potentially leave some data items inappropriately masked.
Table Internal Synchronization
Sometimes the same data appears in multiple rows within the same table. In the example below, the name Robert Smith appears in the FIRST_NAME and LAST_NAME columns in multiple rows.
In other words, some of the data items are de-normalized because of repetitions in multiple rows. If the name Robert Smith changes to Albert Wilson after masking, then the same Robert Smith referenced in other rows must also change to Albert Wilson in a consistent manner. This requirement is necessary to preserve the relationships between the data rows and is called Table-Internal Synchronization.
A Table-Internal Synchronization operation will update columns in groups of rows within a table to contain identical values. This means that every occurrence of Robert Smith in the table will contain Albert Wilson. Good data anonymization software should provide support for this requirement.
Cross Schema Synchronization
Many databases contain de-normalized data in tables which are located in multiple schemas. If this data is related, a Table-To-Table Synchronization operation may be required after the data masking operations have concluded. The analysis phase conducted before the construction of the masking routines should pay attention to this requirement and the masking software should be able to support it if required.
Selective Masking: Ability to Apply a WHERE Clause
It is essential to be able to use specific criteria to choose the rows on which the masking operations are performed. In effect, this means that it must be possible to apply a Where Clause to a set of data and have the masking operations apply only to that subset of the table.
As an example of a Where Clause requirement, consider a masking operation on a column containing first names. These names are gender specific and the end users of the database may well require Male and Female names to be present in the appropriate rows after the masking operations complete. Two rules, each with a Where Clause based on the gender column will be required here. There is a potential trap here – note the discussion entitled Where Clause Skips in the Data Masking Issues section of this document.
User Defined SQL Commands
It is very helpful, essential in many circumstances, to be able to run user defined SQL statements. Such statements could be used, for example, to create an index which speeds up the operation of other rules or to assist with a complex synchronization operation.
Note that many databases use different internal mechanisms for the creation and execution of a block of statements (i.e. a procedure) than they do for simple SQL statements. It is usually important that the solution chosen be able to support both block SQL constructs as well as simple insert, update and delete statements.

0 Comments: