Substitution
This technique consists of randomly replacing the contents of a column of data with information that looks similar but is completely unrelated to the real details. For example, the surnames in a customer database could be sanitized by replacing the real last names with surnames drawn from a largish random list.
Substitution is very effective in terms of preserving the look and feel of the existing data. The downside is that a largish store of substitutable information must be available for each column to be substituted. For example, to sanitize surnames by substitution, a list of random last names must be available. Then to sanitize telephone numbers, a list of phone numbers must be available. Frequently, the ability to generate known invalid data (credit card numbers that will pass the checksum tests but never work) is a nice-to-have feature.
Substitution data can sometimes be very hard to find in large quantities - however any data masking software should contain datasets of commonly required items. When evaluating data masking software, the size, scope and variety of the datasets should be considered. Another useful feature to look for is the ability to build your own custom datasets and add them for use in the masking rules.
Shuffling
Shuffling is
similar to substitution except that the substitution data is derived
from the column itself. Essentially the data in a column is randomly
moved between rows until there is no longer any reasonable
correlation with the remaining information in the row.
There is a certain
danger in the shuffling technique. It does not prevent people from
asking questions like “I wonder if so-and-so is on the supplier
list?” In other words, the original data is still present and
sometimes meaningful questions can still be asked of it. Another
consideration is the algorithm used to shuffle the data. If the
shuffling method can be determined, then the data can be easily
“in-shuffled”. For example, if the shuffle algorithm simply ran
down the table swapping the column data in between every group of two
rows it would not take much work from an interested party to revert
things to their un-shuffled state.
Shuffling is
rarely effective when used on small amounts of data. For example, if
there are only 5 rows in a table it probably will not be too
difficult to figure out which of the shuffled data really belongs to
which row. On the other hand, if a column of numeric data is
shuffled, the sum and average of the column still work out to the
same amount. This can sometimes be useful.
Shuffle rules are
best used on large tables and leave the look and feel of the data
intact. They are fast, but great care must be taken to use a
sophisticated algorithm to randomize the shuffling of the rows.
Redaction / Null
Data redaction is
the destruction of sensitive data, such as any personally
identifiable information (PII). PII can be used on its own or with
other information to identify or locate a single person, or to
identify an individual in context. Enabling redaction allow you to
transform PII to a pattern that does not contain any identifiable
information. For example, you could replace all Social Security
numbers (SSN) like 123-45-6789 with an unintelligible pattern like
XXX-XX-XXXX, or replace only part of the SSN (XXX-XX-6789).
Although
encryption techniques are available to protect Hadoop data, the
underlying problem with using encryption is that an admin who has
complete access to the cluster also access to unencrypted sensitive
user data. Even users with appropriate ACLs on the data could have
access to logs and queries where sensitive data might have leaked.
Data redaction
provides compliance with industry regulations such as PCI and HIPAA,
which require that access to PII be restricted to only those users
whose jobs require such access. PII or other sensitive data must not
be available through any other channels to users like cluster
administrators or data analysts. However, if you already have
permissions to access PII through queries, the query results will not
be redacted. Redaction only applies to any incidental leak of data.
Queries and query results must not show up in clear text in logs,
configuration files, UIs, or other unprotected areas.
Number and Date Variance
The Number
Variance technique is useful on numeric or date data. Simply put, the
algorithm involves modifying each number or date value in a column by
some random percentage of its real value.
This technique has
the nice advantage of providing a reasonable disguise for the data
while still keeping the range and distribution of values in the
column to within existing limits. For example, a column of salary
details might have a random variance of ±10% placed on it. Some
values would be higher, some lower but all would be not too far from
their original range. Date fields are also a good candidate for
variance techniques. Birth dates, for example, could be varied with
in an arbitrary range of ± 120 days which effectively disguises the
personally identifiable information while still preserving the
distribution.
The variance
technique can prevent attempts to discover true records using known
date data or the exposure of sensitive numeric or date data
Blurring
Alter the existing
value randomly with in a define range
Masking Out Data
Generic term for
this process is data anonymization, means replacing certain fields
with a mask character (such as an X). This effectively disguises the
data content while preserving the same formatting on front end
screens and reports. For example, a column of credit card numbers
might look like:
4346 6454 0020 5379
4493 9238 7315 5787
4297 8296 7496 8724 and after the masking operation the information would appear as:
4346 XXXX XXXX 5379
4493 XXXX XXXX 5787
4297 XXXX XXXX 8724 The masking characters effectively remove much of the sensitive content from the record while still preserving the look and feel. Take care to ensure that enough of the data is masked to preserve security. It would not be hard to regenerate the original credit card number from a masking operation such as: 4297 8296 7496 87XX since the numbers are generated with a specific and well known checksum algorithm. Also care must be taken not to mask out potentially required information.
A masking operation such as XXXX XXXX XXXX 5379 would strip the card issuer details from the credit card number. This may, or may not, be desirable.
If the data is in a specific, invariable format, then Masking Out is a powerful and fast option. If numerous special cases must be dealt with then masking can be slow, extremely complex to administer and can potentially leave some data items inappropriately masked.
Table Internal Synchronization
Sometimes the same data appears in multiple rows within the same table. In the example below, the name Robert Smith appears in the FIRST_NAME and LAST_NAME columns in multiple rows.
In other words, some of the data items are de-normalized because of repetitions in multiple rows. If the name Robert Smith changes to Albert Wilson after masking, then the same Robert Smith referenced in other rows must also change to Albert Wilson in a consistent manner. This requirement is necessary to preserve the relationships between the data rows and is called Table-Internal Synchronization.
A Table-Internal Synchronization operation will update columns in groups of rows within a table to contain identical values. This means that every occurrence of Robert Smith in the table will contain Albert Wilson. Good data anonymization software should provide support for this requirement.
Cross Schema Synchronization
Many databases contain de-normalized data in tables which are located in multiple schemas. If this data is related, a Table-To-Table Synchronization operation may be required after the data masking operations have concluded. The analysis phase conducted before the construction of the masking routines should pay attention to this requirement and the masking software should be able to support it if required.
Selective Masking: Ability to Apply a WHERE Clause
It is essential to be able to use specific criteria to choose the rows on which the masking operations are performed. In effect, this means that it must be possible to apply a Where Clause to a set of data and have the masking operations apply only to that subset of the table.
As an example of a Where Clause requirement, consider a masking operation on a column containing first names. These names are gender specific and the end users of the database may well require Male and Female names to be present in the appropriate rows after the masking operations complete. Two rules, each with a Where Clause based on the gender column will be required here. There is a potential trap here – note the discussion entitled Where Clause Skips in the Data Masking Issues section of this document.
User Defined SQL Commands
It is very helpful, essential in many circumstances, to be able to run user defined SQL statements. Such statements could be used, for example, to create an index which speeds up the operation of other rules or to assist with a complex synchronization operation.
Note that many databases use different internal mechanisms for the creation and execution of a block of statements (i.e. a procedure) than they do for simple SQL statements. It is usually important that the solution chosen be able to support both block SQL constructs as well as simple insert, update and delete statements.