In a DB2 DPF (Database Partitioning Feature) database, the choice of the right distribution keys is an important decision that DBAs have to make. An incorrect distribution key (also called hash key) could skew the data distribution for one or more tables between the database partitions. This could result in a subset of database partitions becoming the bottleneck for query processing. In this blog post, I will share some insight into this topic with couple of examples.
What is a Distribution key?
A distribution key is a column (or set of columns) that dictate the database partition on which DB2 will physically store a row in a table. This means that a distribution key applies to each (hash partitioned) table and not to the entire database. For example, table1 could have hash key as (col1, col2) while table2 could have hash key as (col3, col4). For a table, a distribution key is defined by using the CREATE TABLE statement with the DISTRIBUTE BY clause. My earlier post has details on creating tables on more than one partition in a DB2 DPF database.
How Hashing Works
When a new row is inserted into a hash partitioned table, DB2 applies the hashing algorithm on the values of distribution keys for that row. The output of the hashing algorithm is a number from 0 to 32767 (was 4095 until DB2 9.7). This number corresponds to one of the entries in the array that contains the value of the database partition number where the row is to be stored.
In the example below, as the hashing algorithm returned an output value of 5, the row would be stored on partition #3.
To Hash or Not
When a data modeler requests a new table to be created, as a DBA, I ask for the expected cardinality (row count) for that table. I usually create a table across multiple database partitions when the table’s cardinality is greater than 100,000 rows. This is not a hard-and-fast rule and is mostly at the DBA’s discretion. The idea is to hash large tables so as to divide work among database partitions.
Here is an example of creating and loading a table in a multi-partition database. I am choosing not to hash partition this table to make a point.
CREATE TABLE DB2INST1.TRANSACTION_NO_HASH ( FACT_TRANS_DETAIL_GK BIGINT NOT NULL, ACCT_NUM_TEXT VARCHAR(20) NOT NULL , TRANS_POST_DT DATE NOT NULL , TRANS_PROD_CD CHAR(3) NOT NULL , TRANS_POST_SEQ_NUM SMALLINT NOT NULL , TRANS_EFF_DT DATE NOT NULL , DEBIT_CREDIT_IND CHAR(1) NOT NULL , TRANS_AMT DECIMAL(15,2) NOT NULL , TRANSACTION_DESC VARCHAR(47) , RUNNING_BAL_AMT DECIMAL(13,2) ); DB20000I The SQL command completed successfully. -- DDL Statements for Primary Key on Table DB2INST1.TRANSACTION_NO_HASH ALTER TABLE DB2INST1.TRANSACTION_NO_HASH ADD CONSTRAINT PK_TRANSACTION_NO_HASH PRIMARY KEY (ACCT_NUM_TEXT, TRANS_POST_DT, TRANS_POST_SEQ_NUM); DB20000I The SQL command completed successfully.
After inserting 2M rows and doing runstats (with distribution on all columns and detailed indexes), below is the column cardinality information from system catalogs.
db2 "select card from syscat.tables where tabname = 'TRANSACTION_NO_HASH' AND TABSCHEMA = 'DB2INST1' with ur" CARD -------------------- 2000000 1 record(s) selected.
The below query shows that all 2M rows are physically stored on the partition number 0 (as this is a non-hashed table).
db2 "SELECT COUNT(*) AS ROW_COUNT, DBPARTITIONNUM(RUNNING_BAL_AMT) AS PARTITION_NUM FROM DB2INST1.TRANSACTION_NO_HASH GROUP BY DBPARTITIONNUM(RUNNING_BAL_AMT) ORDER BY 2 WITH UR" ROW_COUNT PARTITION_NUM ----------- ------------- 2000000 0 1 record(s) selected.
In the above SQL, I chose the column RUNNING_BAL_AMT for the function DBPARTITIONNUM. However, any column could be chosen as long as it is a valid column name in that table.
Choosing a Distribution Key – Colocation Vs. Even Data Distribution
One of the most important goals of choosing the right distribution key is to reduce the time required to process a query. If the data to be processed is suitably partitioned, parallel execution is possible without excessive overhead and bottlenecks.
In a DPF database (Massively Parallel Architecture), as multiple database partitions process data in parallel, a query’s run time is only as long as processing time on the slowest database partition.
When choosing a distribution key for a table, the available strategies are:
- Co-locating frequently joined tables and
- Ensuring even distribution of a table’s data
Co-Locating Frequently Joined Tables
Co-location minimizes data transfer overhead. For a given query, co-location ensures the matching rows between two tables to always reside on the same database partition. This strategy needs certain requirements to be met which are out of scope for the present article.
This approach could prove to be beneficial when joining a fact table with a dimension table. However, given the data in fact tables to be lot more (larger row count) than in a dimension table, co-locating could quickly result in data-skew on a fact table and so could result in processing delays on a particular partition. Also, co-location tends to be application / workload specific. A single new query which is completely different from the existing ones could prove the distribution strategy to be non-optimal.
Ensuring Even Distribution of Table’s Data
This strategy aims to minimize data skew by ensuring even data distribution on disks across multiple nodes. As DPF is shared-nothing architecture (dedicated disk for each database partition), this results in all database partitions (on which the table is created) to work on every query. This means more parallelism although there could be data transfer across the partitions. In my experience, I have observed this strategy to scale well with data growth and is not application / workload specific because as long as high-quality distribution keys are selected, there would be minimum data-skew.
So, how do you ensure even distribution of table’s data? The key is to choose high-cardinality column(s) as the distribution key.
Cardinality of Distribution Key
The column(s) chosen for the distribution key should have a high combined cardinality to ensure the most even possible distribution of data across partitions. Before a distribution key is picked, it is important to get a feel of column-cardinality. This could be done by loading a sample representative data.
For the table created above, the SQL below gives us a pretty good idea of which columns have high cardinality compared to others.
db2 “SELECT colname, colcard FROM syscat.columns WHERE tabname='TRANSACTION_NO_HASH' AND tabschema = 'DB2INST1' ORDER BY colcard desc with ur” COLNAME COLCARD ---------------------------- -------------------- FACT_TRANS_DETAIL_GK 1833320 RUNNING_BAL_AMT 541113 TRANSACTION_DESC 149504 TRANS_AMT 74752 ACCT_NUM_TEXT 58000 TRANS_EFF_DT 368 TRANS_POST_DT 316 TRANS_POST_SEQ_NUM 102 TRANS_PROD_CD 2 DEBIT_CREDIT_IND 2 10 record(s) selected.
Hashing on Low-Cardinality Columns
Now, let us create a new table with similar table structure and choose a low cardinality column TRANS_POST_DT as its distribution key, knowing that this would be a bad choice as such a low cardinality distribution key would skew data distribution across database partitions. In this example, I created the tablespace (DDL not shown for brevity) TEST_TBSP_HASH and TEST_TBSP_HASH_IX on 12 partitions.
CREATE TABLE DB2INST1.TRANSACTION_HASH1( FACT_TRANS_DETAIL_GK BIGINT NOT NULL , ACCT_NUM_TEXT VARCHAR(20) NOT NULL , TRANS_POST_DT DATE NOT NULL , TRANS_PROD_CD CHAR(3) NOT NULL , TRANS_POST_SEQ_NUM SMALLINT NOT NULL , TRANS_EFF_DT DATE NOT NULL , DEBIT_CREDIT_IND CHAR(1) NOT NULL , TRANS_AMT DECIMAL(15,2) NOT NULL , TRANSACTION_DESC VARCHAR(47) , RUNNING_BAL_AMT DECIMAL(13,2) ) DISTRIBUTE BY HASH (TRANS_POST_DT) IN TEST_TBSP_HASH INDEX IN TEST_TBSP_HASH_IX; DB20000I The SQL command completed successfully. ALTER TABLE DB2INST1.TRANSACTION_HASH1 ADD CONSTRAINT PK_TRANSACTION_HASH1 PRIMARY KEY (ACCT_NUM_TEXT, TRANS_POST_DT, TRANS_POST_SEQ_NUM); DB20000I The SQL command completed successfully.
An important point to note is that when a table is hashed across multiple partitions, it logically exists in the DB2’s catalog once. However, the table’s data itself is spread across multiple database partitions.
After the same 2M rows were copied into the above table and runstats are run, below is the information from syscat.tables
db2 "select card from syscat.tables where tabname = 'TRANSACTION_HASH1' AND TABSCHEMA = 'DB2INST1' with ur" CARD -------------------- 2303808 1 record(s) selected.
However, only 2M rows were loaded into the table TRANSACTION_HASH1 as is evident from the SQL below:
db2 "select count(*) from DB2INST1.TRANSACTION_HASH1" 1 ---------- 2000000 1 record(s) selected.
DB2’s system catalogs have the row count (card) for this table as 2.30 M, a whopping 15% off from the correct value. This is because, when runstats command is run, by default, DB2 does the runstats on only the 1st partition on which the table is created.
db2 "SELECT COUNT(*) AS ROW_COUNT, DBPARTITIONNUM(TRANS_POST_DT) AS PARTITION_NUM FROM DB2INST1.TRANSACTION_HASH1 GROUP BY DBPARTITIONNUM(TRANS_POST_DT) ORDER BY 2 WITH UR" ROW_COUNT PARTITION_NUM ----------- ------------- 191984 1 188357 2 227031 3 182864 4 113382 5 40560 6 213304 7 201435 8 127661 9 104527 10 194245 11 214650 12 12 record(s) selected.
As previously discussed, I chose the column TRANS_POST_DT for the function DBPARTITIONNUM. However, any column could be chosen as long as it is a valid column name in that table.
We could quickly calculate the % skew on each partition. For example, for partition #1, skew is 15.19%
Here is the math:
Theoretical avg. no. of rows on each partition 2000000/12 = 166,666
Observed # of rows on partition 1 = 191984
For partition #1, % skew = ((191984 – 166666)/166666)*100 = 15.19%
To avoid re-inventing wheel and to save time, I rely on a stored procedure at IBM’s developer works to do this math for me.
$db2 “call estimate_existing_data_skew ('DB2INST1','TRANSACTION_HASH1',100)” Return Status = 0 DATA SKEW ESTIMATION REPORT FOR: DB2INST1.TRANSACTION_HASH1 This report is based on the existing partitioning keys Accuracy is based on 100% sample of data --------------------------------------------------------------------------------------- DB2INST1.TRANSACTION_HASH1 Estimated total number of records in the table: : 2,000,000 Estimated average number of records per partition : 166,666 Row count at partition 1 : 191,984 (Skew: 15.19%) Row count at partition 2 : 188,357 (Skew: 13.01%) Row count at partition 3 : 227,031 (Skew: 36.21%) Row count at partition 4 : 182,864 (Skew: 9.71%) Row count at partition 5 : 113,382 (Skew: 31.97%) Row count at partition 6 : 40,560 (Skew: 75.66%) Row count at partition 7 : 213,304 (Skew: 27.98%) Row count at partition 8 : 201,435 (Skew: 20.86%) Row count at partition 9 : 127,661 (Skew: 23.40%) Row count at partition 10 : 104,527 (Skew: 37.28%) Row count at partition 11 : 194,245 (Skew: 16.54%) Row count at partition 12 : 214,650 (Skew: 28.79%) Number of partitions: 12 (1, 2,3,4,5,6,7,8,9,10,11,12)
As you would notice, data skew % on the first partition matches with % skew between actual row count and ‘card’ value from syscat.tables. This reinforces the fact that DB2 picked partition #1 (as expected as it is the first database partition on which the table exists) to do runstats and that DB2’s catalog information in a DB2 DPF database is heavily influenced by the choice of distribution keys.
Hashing on high-cardinality columns
Let us take a look at one more example in which the distribution key is the primary key. Hashing on the primary key is the best possible way to minimize data skew across partitions. However, if the primary key column is a timestamp or anything associated with time, it is best to avoid using such a column as a distribution key. If most of the queries go after a single date, choosing such a column would result in process skew. Sometimes, a high cardinality surrogate key (usually a unique-generated column) could be a good choice. However, most of the time surrogate keys are not used in a join operation and when such a key is used as distribution key, no co-location is possible. Personally, I prefer to hash on primary key as long as the number of columns is not too many… usually, <=4 columns to avoid too much of overhead when applying hashing algorithm.
An important point to remember is that skew analyses are being done on a fairly large data set (2M rows).
For 2M rows and distribution key = primary key (ACCT_NUM_TEXT, TRANS_POST_DT, TRANS_POST_SEQ_NUM), below is skew data:
Estimated total number of records in the table: : 2,000,000 Estimated average number of records per partition : 166,666 Row count at partition 1 : 166,075 (Skew: 0.35%) Row count at partition 2 : 168,052 (Skew: 0.83%) Row count at partition 3 : 167,450 (Skew: 0.47%) Row count at partition 4 : 165,207 (Skew: 0.87%) Row count at partition 5 : 166,456 (Skew: 0.12%) Row count at partition 6 : 167,996 (Skew: 0.79%) Row count at partition 7 : 167,483 (Skew: 0.49%) Row count at partition 8 : 164,753 (Skew: 1.14%) Row count at partition 9 : 165,649 (Skew: 0.61%) Row count at partition 10 : 167,659 (Skew: 0.59%) Row count at partition 11 : 167,357 (Skew: 0.41%) Row count at partition 12 : 165,863 (Skew: 0.48%)
The skew % would be totally different if analysis is done on a fairly small data set.
For only 120 (multiple of # of database partitions) rows and distribution key = primary key (ACCT_NUM_TEXT, TRANS_POST_DT, TRANS_POST_SEQ_NUM, TRANS_PROD_CD), below is skew data:
Estimated total number of records in the table: : 120 Estimated average number of records per partition : 10 Row count at partition 1 : 106 (Skew: 960.00%) Row count at partition 2 : 0 (Skew: 100.00%) Row count at partition 3 : 0 (Skew: 100.00%) Row count at partition 4 : 14 (Skew: 40.00%) Row count at partition 5 : 0 (Skew: 100.00%) Row count at partition 6 : 0 (Skew: 100.00%) Row count at partition 7 : 0 (Skew: 100.00%) Row count at partition 8 : 0 (Skew: 100.00%) Row count at partition 9 : 0 (Skew: 100.00%) Row count at partition 10 : 0 (Skew: 100.00%) Row count at partition 11 : 0 (Skew: 100.00%) Row count at partition 12 : 0 (Skew: 100.00%)
How to find the Distribution Key for an Existing Table
Here is a SQL that could be used to find distribution key for an existing table. Just substitute the table name with the one that you are interested in.
db2 "select tabschema, tabname, colname, partkeyseq from syscat.columns where tabname like 'TRANSACTION%' and partkeyseq !=0 order by partkeyseq with ur" TABSCHEMA TABNAME COLNAME PARTKEYSEQ -------------------- -------------------------------------------------------------------------------------- ---------- DB2INST1 TRANSACTION_HASH1 ACT_NUM_TEXT 1 DB2INST1 TRANSACTION_HASH1 TRANS_POST_DT 2 DB2INST1 TRANSACTION_HASH1 TRANS_POST_SEQ_NUM 3 3 record(s) selected.
Here are few important takeaways:
- Distribution key can consist of one or multiple columns
- Avoid low cardinality columns such as ‘state’ and ones indicating a Boolean indicator (Ex: IS_ACTIVE).
- Choice of distribution key directly determines data skew across database partitions
- Data skew for a table plays a big role in DB2’s catalog data for that table
- For correct interpretation of results, data skew analysis should be done on data set which is representative of the entire data that would be loaded.
- DB2’s statistics can be misleading for multi-partition environments since they estimate row count based on the partition on which DB2 did the runstats.
Pavan’s work experience includes database consulting engagements at Fortune Finance / Health care / Insurance clients. He currently works full time as a Sr. Database Engineer and is an IBM Certified Advanced Database Administrator – DB2 10.1 for Linux, UNIX and Windows.
In his current role, he manages and supports a large multi-terabyte Enterprise Data Warehouse (EDW) on DB2 LUW DPF (Database Partitioning Feature) and Highly Available (24/7) OLTP (DB2 / HADR) databases.
He was selected into “Top 10 User Speakers” at the 2014 IDUG Technical Conference in Phoenix, AZ.
His LinkedIn profile is: www.linkedin.com/in/pavankristipati/. He can be reached at pavan @db2talk.com (no space before @)
He is a regular blogger at www.db2talk.com.