I read an article from Craig Mullins on “Working with variable data“. The rest of this post is going to assume you’ve read it, so click through and then come back.
I’ve seen performance issues caused by VARCHAR before, and Craig makes some interesting points. VARCHARs are actually one of the big reasons to Reorg frequently. His post is geared toward the mainframe, but much of it applies on DB2 for LUW too.
To expand on what he’s talking about, here’s a query that will help you find columns in your database that are VARCHARS with a length of less than 30. You obviously may want to look at columns on a variety of criteria, not just ones with a short length, but this makes for an easy example for this post.
select substr(tabschema,1,18) as tabschema, substr(tabname,1,30) as tabname, substr(colname,1,30) as colname, length from syscat.columns where typename='VARCHAR' and length < 30 and tabschema not in ('SYSIBM', 'SYSIBMADM', 'SYSCAT') with ur;
The output from this query looks something like this:
TABSCHEMA TABNAME COLNAME LENGTH ------------------ ------------------------------ ------------------------------ ----------- XXXXXXXX ADDRESS BILLINGCODE 17 XXXXXXXX BUADDR BACODE 17 XXXXXXXX BUADDR BASTATE 20 XXXXXXXX BUADDR BAZIPC 20 XXXXXXXX BUSPROF EMPLOYEEID 20 XXXXXXXX BUSPROF ALTERNATEID 20 XXXXXXXX BUYORG BUSTATE 20 XXXXXXXX BUYORG BUZIPC 20 XXXXXXXX COUNTRY CALLINGCODE 16 XXXXXXXX EMLBOUNCED STATUSCODE 10
Now in a WebSphere Commerce database, there’s not a lot I can do about most of these – though I’ll do a similar analysis for custom tables, which by the Commerce standards should all start with ‘X’, and are thus easy to filter out with a where clause in this query.
Assuming you have tables you may be able to change, you’ll want to go into further detail on them like Craig recommends – something like this – using the query from Craig’s post:
SELECT 10*COUNT(*) as "SPACE_USED_AS_CHAR(10)", SUM(2+LENGTH(COUNTRYCODE)) as "SPACE_USED_AS_VARCHAR(10)", 10*COUNT(*)-SUM(2+LENGTH(COUNTRYCODE)) as "TOTAL_SPACE_SAVED_BY_VARCHAR", 10 as "AVG_SPACE_AS_CHAR(10)", AVG(2+LENGTH(COUNTRYCODE)) "AVG_SPACE_AS_VARCHAR(10)", 10-AVG(2+LENGTH(COUNTRYCODE)) as "AVG_SPACE_SAVED_BY_VARCHAR" FROM XSTORESKU WITH UR;
SPACE_USED_AS_CHAR(10) SPACE_USED_AS_VARCHAR(10) TOTAL_SPACE_SAVED_BY_VARCHAR AVG_SPACE_AS_CHAR(10) AVG_SPACE_AS_VARCHAR(10) AVG_SPACE_SAVED_BY_VARCHAR ---------------------- ------------------------- ---------------------------- --------------------- ------------------------ -------------------------- 20222270 8088908 12133362 10 4 6 1 record(s) selected.
Note, that you want to change the ’10’s in the above query to whatever length of column the VARCHAR you’re working with is. In this case, if I were to define this column as CHAR(10) instead of VARCHAR(10), I would lose about about 11.5 MB. Small potatoes. In this case, it’s not a likely column to change much once the record is inserted (for a US-based business, Countries don’t often change, and knowing this client, I suspect there are only two possible two-character values). If, on the other hand, it were a column that was likely to change, there could be significant overhead associated with moving the record, so it might be worth going for the CHAR. You don’t want to impact query perfromance for just a miniscule 11.5 MB of space.
Note that the change from CHAR(10) to VARCHAR(10) causes no essential change in functionality – the same lengths of data can be stored.
In my specific example, maybe the better question would be can we do a shorter length – is there really a case where we would need all 10 characters, when most of my rows are only 2 characters? Let’s look at the data in the same format as above with that change:
The query would be:
SELECT 2*COUNT(*) as "SPACE_USED_AS_CHAR(2)", SUM(2+LENGTH(COUNTRYCODE)) as "SPACE_USED_AS_VARCHAR(10)", 2*COUNT(*)-SUM(2+LENGTH(COUNTRYCODE)) as "TOTAL_SPACE_SAVED_BY_VARCHAR", 2 as "AVG_SPACE_AS_CHAR(10)", AVG(2+LENGTH(COUNTRYCODE)) "AVG_SPACE_AS_VARCHAR(10)", 2-AVG(2+LENGTH(COUNTRYCODE)) as "AVG_SPACE_SAVED_BY_VARCHAR" FROM WSCOMUSR.XSTORESKU WITH UR;
Note that I’ve changed some of those 10’s to 2’s to reflect the length change I’m considering.
SPACE_USED_AS_CHAR(2) SPACE_USED_AS_VARCHAR(10) TOTAL_SPACE_SAVED_BY_VARCHAR AVG_SPACE_AS_CHAR(10) AVG_SPACE_AS_VARCHAR(10) AVG_SPACE_SAVED_BY_VARCHAR --------------------- ------------------------- ---------------------------- --------------------- ------------------------ -------------------------- 4044454 8088908 -4044454 2 4 -2 1 record(s) selected.
I can see that I could save just under 4 MB by changing this column to a CHAR(2). I’m quite sure that it’s not worth it here, but there are hundreds of little decisions like this that add up when you’re building custom tables, so it’s sure something to keep in mind.
I do review tables that my developers write, and one of the things I look for is VARCHAR(1), VARCHAR(2), and VARCHAR(3) – since these almost never make sense.
Thanks for the great post.
Changing the datatype from varchar to char for a given column may result in
a) data mismatch if the same column is being used in local or join predicate.
b) varchar returns value without padding where as char returns back with padding
Good developer would handle these two things with trim function. Esp it would be difficult to review the entire code if it is already in production.