Database performance is heavily influenced by data volume and cardinality (or how unique that data may be).
It’s interesting to watch the look on someone’s face when this concept clicks. There is a look of understanding from a developer when she can extrapolate why SQL ran in seconds on a small, local, dockerized, database but the same SQL runs in minutes in a much larger QA environment. Then there is the look of concern from a manager of “Uh … this may not be good” when I explain that the performance load testing database doesn’t look as close to production volume or cardinality when comparing the two.
This problem starts with a discussion with a manager about this kind of data mismatch. I noticed significant differences between volume and cardinality when I compared our performance testing environment to production. The variance was significant enough that I questioned if the results of our performance tests were valid. Sure, the new release passed testing … but did it really?
It turns out the root of my problem was Personal Identifiable Information (PII). This is data that could directly link to a person on its own (i.e. social security number) or with a combination of less direct data (i.e. Street and Zip Code). In my case, the amount of data in the production database was significantly larger with higher cardinality while the performance load testing database had less volume and a lot more duplication. To add insult to injury, some of the most critical tables used by the application were tables that looked the least like production.
A decision was made early in the project to not refresh a lower-level environment with production-like data. This decision was made because there was no easy way to sanitize PII data after it was pumped into the lower-level environment. The artificially generated PII data initially seeded into the performance load testing database did look like production at the time, but a variance between the two was introduced as time passed, and the data volume and variety in production increased.
I needed to solve that original problem, but I wasn’t sure how. I know there are tools you can purchase for this exact issue, but I was wondering if there was a homegrown solution to my problem.
I thought a new process that utilizes a Jenkins pipeline could be built that centered around one of a few approaches.
Programmatic Data Obfuscation
I could team up with a developer to help write code that could read a string, rearrange those same characters randomly, and then update the row with a new randomly generated value.
|Physically alters data||Cardinality could artificially become more unique.|
|Can’t alter en masse. Process to change data happens a row at a time.|
|Small values could randomly rearrange to same value.|
|Lots of logic and coding involved to handle special characters. (Discovered in testing).|
I could utilize the data masking feature included with Db2’s Row and Column Access Control (RCAC). This would not physically alter data but would display a predetermined masked result based on a user’s access level. For example, a manager would see the complete value of a client’s social security number (123-45-6789) while customer service sees only the last four digits of that same social security number (XXX-XX-6789). If you are interested in more detail on data masking, you can reference the Row and Column Access Control (Column Masking) blog I wrote describing this process and how to implement it.
|No performance impact. Optimizer uses actual data to make decision on data access but displays masked data to end user.||Data is not physically altered and could be accessed with hacked credentials.|
|Quick to implement. Not a long running process.|
Db2 One Way Hashing
My teammate, Brian Cocherell, had the creative idea of using the native Db2 hashing function as a possible solution to this problem. The function would physically alter PII data with a hashed result while maintaining cardinality of that column within the table. What made this option attractive is that a unique hashed result would be generated each time the function was called, but the same hashed result would be duplicated for any source data already duplicated within the table. For example, the FIRST_NAME of “Michael” generates the same hashed value of “ABCD1234” every time it was run through the function.
(Many times, but not every time. Discovered in testing)
|Not as quick or simple to implement as data masking.|
|Data physically altered.||Small values could randomly rearrange to same value. (Situational. Discovered in testing).|
|Can’t alter en masse. Process to change data happens a row at a time.|
There were two things to keep in mind as I assessed the technical feasibility of getting one of these options to work. First, I am trying to follow direction from my boss that “physically altering the data is preferred over data masking, and data masking is preferred over nothing”.
Second, I am very concerned that a process may update indexed data with an obfuscated value that improves cardinality. If that process improves the cardinality of data too much, we could artificially improve execution times when performance would normally be slower.
To expand on this, let’s say that a table has an indexed column (LAST_NAME) which developed lower cardinality over time. Originally, the million-row table contained an indexed column containing 95% unique values and 5% duplicate values. That is, 95% of the rows are the only row containing that value, while 5% contain values that other rows also contain. Years later, we find that same table has grown to 10 million rows and that same indexed column now contains 65% unique values and 35% duplicate values.
The Db2 optimizer may make different data access decisions as the index column becomes less unique over time. Initially, the index on LAST_NAME may have been incredibly efficient, and the optimizers go to index for common SQL coming from the application. As time goes on, and more duplication is introduced, the Db2 optimizer may decide that index on PHONE_NUMBER is looking a little more attractive than the index on LAST_NAME. Maybe if duplication gets too bad on LAST_NAME, the optimizer decides that a table scan is a better alternative than using the index at all.
It’s now years later and the LAST_NAME index is not often used by Db2’s optimizer. To obfuscate PII data, we introduce a process that reads a value, rearranges the same characters, and produces a new result. Each time the process encounters the highly duplicated last name of “Smith” a new obfuscated value is produced. The first time our process runs into “Smith” it may rearrange to “imtsh”, but the second time the value is rearranged into “htims”, and the third attempt we rearrange into “ahmti”. What was once a duplicate value is now three unique values. If this process ends up making common last names a unique obfuscated value enough times, that original index on LAST_NAME may look more attractive to Db2’s optimizer than the PHONE_NUMBER index it adopted over time.
Deep Dive of Db2 One-Way Hashing
Before we could get into any serious testing of our three options, I needed to see if the concept of one-way hashing worked as we theorized. There were many questions I needed an answer for. How would I call the function? What did I need to input and what would the output look like? Was the output something I could use? Would hashing really preserve cardinality? Those were just a few of the open questions I had.
Using the HASH scalar function was simple enough. I was able to quickly generate a hash value that I thought could be used or manipulated to fit the specific length of a column.
Awesome! Now I just need to figure out how to manipulate that value to fit my needs such as an alphabetic name, alphanumeric street address, or numeric zip code.
That is where I hit my first problem. I learned that the value produced is VARBINARY which could not be manipulated like a STRING or INT type. To solve this problem, I needed to use TRANSLATE SCALAR FUNCTION to convert our hashed value into a hexadecimal value. At that point I could manipulate the result as needed to generate a value of a certain size that was alphabetic, numeric, or alphanumeric. Matter of fact, I was able to combine variations of this method to concatenate a numeric hash, space, and alphabetic hash and turn something like “123 Sesame Street” to “968 AFD CABFD”. You can see an example of this code below.
Findings for this initial local test were interesting and they flushed out a few “gotchas” to watch for in the larger scale test.
As I mentioned, VARBINARY forced us to cast the value as hexadecimal if we wanted to trim the result to a specific length. As an example, we would want to manipulate the hashed 65 character result to 10 alphabetic characters for FIRST_NAME or 6 numeric characters for ZIP_CODE. We also discovered that a desire to trim your result to something more manageable introduced a new problem.
Hexadecimal only uses the letters A-F (instead of A-Z) and 0-9 which can limit the possible number of combinations to make an obfuscated value. When you limit your hashed result further and produce only numeric or alphabetic values, the possibilities shrink even more. Your combinations become very limited If you trim to a small alphabetic value like a two-character state abbreviation. We learned these limitations can artificially introduce duplication into your data.
For example, let’s say I generate a value for my nickname “Mike” which produced a hexadecimal value of “2A3B4F”. Later in the process the name “Luke” produced “6A7B9F”. At first this looks perfect because we generated unique results. However, when I force this to an alphabetic value by removing the numeric part of the result and limit the value to be six characters long, both names generated a result of ‘ABF’. Unfortunately, this introduced data duplication where it may not normally be, and we lost the possibility of maintaining perfect cardinality for that column of data. The question now became “do we unexpectedly influence performance because we introduce duplication? If so, how much is performance influenced?”
The last thing to investigate was if case affected generated hexadecimal value. I was able to confirm this does happen. The word “Mike” and the word “MIKE” generated two separate obfuscated values that were unique but that also were used again and again when the same original value was found in another row.
Over the period of a few months, we were able to arrange three separate tests that explored one of the three options. We copied a 720G OLTP production database into a secure load testing environment where we could run a 30-minute test of each option. The load test consisted of four phases that included queries against multiple history tables (read heavy), pulling order details (read heavy), and two distinct phases that simulated placing an order (read/write). Three separate tables, ranging from 750K to 1.1M rows, had 29 PII data fields that needed to be scrubbed of PII data. Two of the three tables had PII data that were part of an index such as first name, last name, zip code, and/or e-mail.
It should be noted that we were viewing results from the perspective of a normal performance load test. This means we looked at metrics from the application perspective and not from the database perspective. We focused on how many application requests completed, how long they took to execute, etc.
In hindsight, I am kicking myself for not gathering more database related metrics during testing. I was hyper-focused on if we affected performance tests as we did them now, and I didn’t even consider that I would want to review database metrics behind the throughput change to discover why a certain result happened.
It also takes time to make data changes for most of these, so the process run time of implementing the data changes is a factor we needed to consider.
|Programmatic Obfuscation||Db2 Data Masking||Db2 Hashing|
|Process Run Time||2 Hours 43 Minutes||Less than 5 Minutes||2 Hours 55 Minutes|
|Column Cardinality Change||Range: 17%-89%
|No Change||Range: 0%-36%
|Query Throughput||Phase 1: 2%
Phase 2: .5 %
Phase 3: 10%
Phase 4: 9%
|Phase 1: 1%
Phase 2: 2%
Phase 3: 8%
Phase 4: 8%
|Phase 1: 1%
Phase 2: 0 %
Phase 3: 0%
Phase 4: 0%
|Avg Execution Time||Phase 1: 2%
Phase 2: 1%
Phase 3: 9%
Phase 4: 20%
|Phase 1: 1%
Phase 2: 1%
Phase 3: 18%
Phase 4: 11%
|Phase 1: 1%
Phase 2: 0%
Phase 3: 0%
Phase 4: 0%
We learned a lot from these tests, so let me break down some of the findings.
Programmatic Data Obfuscation
Although the process ran slightly faster than hashing, there was an impact on data that translated to an artificial improvement in speed. That lift in speed would not normally occur with untouched, production data. This proved my initial concern was valid and that performance testing results could lead us to believe a new deployment of SQL would perform better than it normally would.
We discovered that FIRST_NAME and LAST_NAME had a dramatic change in cardinality with a column of data becoming 80%-90% more unique. Cardinality of other columns were also impacted heavily with an average change of 30-60% (leaning closer to 60%). I suspect the process impact on these indexed columns is what caused the artificial lift in speed.
We also discovered that there were many special characters within the source data that needed to be accounted for in our programming. This was a thorn in our side as we tested this method. Special characters often meant the obfuscation process would abnormally break resulting in us adding more logic to the code, resetting, and trying again.
Db2 Data Masking
When I read the result of this test and saw the improvement in performance my first thought was “I incorrectly calculated this result”. When I realized the results were correct, I immediately ordered a second test thinking something was off with the tool or baseline test. I was surprised to learn the second test produced similar results.
Let me be clear, this should not happen. Masking should not, and really cannot, impact performance because the original value of data, not the masked value, is used in the optimizers decision making process. But I proved that wrong, twice.
This resulted in me reaching out to a few people at IBM who confirmed that my understanding is correct. There should be no change. This caused me to seek out a team member who understood the application side of things better than I. Together we came up with a theory on what happened.
Keep in mind what I mentioned just before showing the test results. I was monitoring throughput at an application level, not at the database level. I was not monitoring locking, or waits, or some other internal Db2 metric. I was monitoring how much and how fast the application processed because that is how load testing would normally be measured in our shop. Our theory was the application is responding unexpectedly to repetitive data that it was receiving as a result of masking.
The applications were receiving and making a lot of decisions based on data that was redacted. All PII data was returned to the application with the same result of the word “REDACTED”.
What is in the database:
123 Sesame Street
Atlanta, GA 30024
What the application sees:
REDACTED, REDACTED, REDACTED
The testing phases that improved in speed were not the phases that were read heavy. The phases that showed performance improvement were the phases where the application was reading information, making a decision based on what was collected, and writing information back to the database. Our theory was that the application is working with some level of cache or buffer that is using the same repeated decision-making process after observing the same “REDACTED” result over and over. This is only theory but is one that the application folks also seemed to think was possible.
Ideally, we would redo the test and look at internal Db2 metrics to see if there was truly an increase in throughput from the database side (not application side). However, we decided against further testing when we discovered a viable option that we could physically alter data without performance impact. Remember one of my primary goals … “physically altering data is preferred over data masking”. We just proved a viable option to do this via Db2 hashing so masking was no longer a priority.
Db2 One-Way Hashing
I was surprised at this result. Small-scale proof of concept testing led me to believe performance would get worse because column cardinality would drop as duplication was injected into an updated row. I believed an increase in duplication would lead to a less effective index scan which ultimately leads to a hit in performance. However, we never saw that performance hit during our testing. Matter of fact, of the three options, this method seemed to impact performance the least.
A state abbreviation (of only 2 characters) was the hardest hit with a 36% drop in cardinality. FIRST_NAME and LAST_NAME seemed to have the next highest impact with a 2%-18% change depending on the table. Interestingly, the other columns did not change that much at all with an average of <1% change.
Even with the percentage of duplicate values increasing for a column, performance seemed to stay the same. The unexpected lack of a negative performance impact may be situational to our specific database design. This result may not be the same in other environments. Either way, this proved one way hashing could be a viable option for our needs in this specific situation.
Our original goal asked if we could find a way to physically obfuscate data in a way that would not accidentally influence performance results. Did we meet that goal?
Yes, we did find a viable option to physically obfuscate data, but it comes with a huge caveat. Data masking aside (as data isn’t physically altered), I believe this could be done with a technique that uses Db2 one way data hashing in specific situations. We were able to prove this as a viable option within our database architecture. We had a 720G OLTP database where a process scrubbed PII data in a small number of tables with a basic index on either name, zip code, or e-mail. That is a specific, fairly small, test case.
Can you obfuscate data this way? Yes. Should you? It is way too early to tell and is only something you can answer after deep diving the effect on your specific environment. There are many more questions to ask before this could really be considered a true option for masking data.
How would results of this test been affected by a database that is much larger in scale? Would the design of a data warehouse or analytics workload impact sensitivity to cardinality? What about any effects due to a change in data distribution? We were able to show lower impact on column cardinality but what is the impact on data distribution when you are limited to the alphabetic characters A-F? If there is an impact on distribution, do we really care?
Hopefully, this blog acts as a springboard for you as the reader. Learn from my data, my mistakes, and my approach to see if this is something you can adopt as another tool in your database administration toolbox.
As for me, the story is still being written. I have taken what I have learned and have spent the last three months rolling it into a much larger data refresh and PII scrubbing process that is currently in the last phases of testing. This process is encapsulated in over 1300 lines of bash code called by a multi-step Jenkins Pipeline that is way beyond anything I thought I could ever develop. But man, this process will be something to behold when this thing is finished and is finally deployed in a few weeks. All that discovery, development, and production of something great because my friend and coworker said “Huh. Hey Mike, have you ever thought of one way hashing as a solution to this problem …”.