VARCHAR: Friend or Foe

Posted by

I read an article from Craig Mullins on “Working with variable data“. The rest of this post is going to assume you’ve read it, so click through and then come back.

I’ve seen performance issues caused by VARCHAR before, and Craig makes some interesting points. VARCHARs are actually one of the big reasons to Reorg frequently. His post is geared toward the mainframe, but much of it applies on DB2 for LUW too.

To expand on what he’s talking about, here’s a query that will help you find columns in your database that are VARCHARS with a length of less than 30. You obviously may want to look at columns on a variety of criteria, not just  ones with a short length, but this makes for an easy example for this post.

select  substr(tabschema,1,18) as tabschema,
        substr(tabname,1,30) as tabname,
        substr(colname,1,30) as colname,
        length
    from syscat.columns
    where       typename='VARCHAR'
                and length < 30
                and tabschema not in ('SYSIBM', 'SYSIBMADM', 'SYSCAT')
    with ur;

The output from this query looks something like this:

TABSCHEMA          TABNAME                        COLNAME                        LENGTH
------------------ ------------------------------ ------------------------------ -----------
XXXXXXXX           ADDRESS                        BILLINGCODE                             17
XXXXXXXX           BUADDR                         BACODE                                  17
XXXXXXXX           BUADDR                         BASTATE                                 20
XXXXXXXX           BUADDR                         BAZIPC                                  20
XXXXXXXX           BUSPROF                        EMPLOYEEID                              20
XXXXXXXX           BUSPROF                        ALTERNATEID                             20
XXXXXXXX           BUYORG                         BUSTATE                                 20
XXXXXXXX           BUYORG                         BUZIPC                                  20
XXXXXXXX           COUNTRY                        CALLINGCODE                             16
XXXXXXXX           EMLBOUNCED                     STATUSCODE                              10

Now in a WebSphere Commerce database, there’s not a lot I can do about most of these – though I’ll do a similar analysis for custom tables, which by the Commerce standards should all start with ‘X’, and are thus easy to filter out with a where clause in this query.

Assuming you have tables you may be able to change, you’ll want to go into further detail on them like Craig recommends – something like this – using the query from Craig’s post:

SELECT  10*COUNT(*) as "SPACE_USED_AS_CHAR(10)",
            SUM(2+LENGTH(COUNTRYCODE)) as "SPACE_USED_AS_VARCHAR(10)",
            10*COUNT(*)-SUM(2+LENGTH(COUNTRYCODE)) as "TOTAL_SPACE_SAVED_BY_VARCHAR",
            10 as "AVG_SPACE_AS_CHAR(10)",
            AVG(2+LENGTH(COUNTRYCODE)) "AVG_SPACE_AS_VARCHAR(10)",
            10-AVG(2+LENGTH(COUNTRYCODE)) as "AVG_SPACE_SAVED_BY_VARCHAR"
    FROM XSTORESKU
    WITH UR;
SPACE_USED_AS_CHAR(10) SPACE_USED_AS_VARCHAR(10) TOTAL_SPACE_SAVED_BY_VARCHAR AVG_SPACE_AS_CHAR(10) AVG_SPACE_AS_VARCHAR(10) AVG_SPACE_SAVED_BY_VARCHAR
---------------------- ------------------------- ---------------------------- --------------------- ------------------------ --------------------------
              20222270                   8088908                     12133362                    10                        4                          6

  1 record(s) selected.

Note, that you want to change the ’10’s in the above query to whatever length of column the VARCHAR you’re working with is. In this case, if I were to define this column as CHAR(10) instead of VARCHAR(10), I would lose about about 11.5 MB. Small potatoes. In this case, it’s not a likely column to change much once the record is inserted (for a US-based business, Countries don’t often change, and knowing this client, I suspect there are only two possible two-character values). If, on the other hand, it were a column that was likely to change, there could be significant overhead associated with moving the record, so it might be worth going for the CHAR. You don’t want to impact query perfromance for just a miniscule 11.5 MB of space.

Note that the change from CHAR(10) to VARCHAR(10) causes no essential change in functionality – the same lengths of data can be stored.

In my specific example, maybe the better question would be can we do a shorter length – is there really a case where we would need all 10 characters, when most of my rows are only 2 characters? Let’s look at the data in the same format as above with that change:

The query would be:

SELECT  2*COUNT(*) as "SPACE_USED_AS_CHAR(2)",
            SUM(2+LENGTH(COUNTRYCODE)) as "SPACE_USED_AS_VARCHAR(10)",
            2*COUNT(*)-SUM(2+LENGTH(COUNTRYCODE)) as "TOTAL_SPACE_SAVED_BY_VARCHAR",
            2 as "AVG_SPACE_AS_CHAR(10)",
            AVG(2+LENGTH(COUNTRYCODE)) "AVG_SPACE_AS_VARCHAR(10)",
            2-AVG(2+LENGTH(COUNTRYCODE)) as "AVG_SPACE_SAVED_BY_VARCHAR"
    FROM    WSCOMUSR.XSTORESKU
    WITH UR;

Note that I’ve changed some of those 10’s to 2’s to reflect the length change I’m considering.

SPACE_USED_AS_CHAR(2) SPACE_USED_AS_VARCHAR(10) TOTAL_SPACE_SAVED_BY_VARCHAR AVG_SPACE_AS_CHAR(10) AVG_SPACE_AS_VARCHAR(10) AVG_SPACE_SAVED_BY_VARCHAR
--------------------- ------------------------- ---------------------------- --------------------- ------------------------ --------------------------
              4044454                   8088908                     -4044454                     2                        4                         -2

  1 record(s) selected.

I can see that I could save just under 4 MB by changing this column to a CHAR(2). I’m quite sure that it’s not worth it here, but there are hundreds of little decisions like this that add up when you’re building custom tables, so it’s sure something to keep in mind.

I do review tables that my developers write, and one of the things I look for is VARCHAR(1), VARCHAR(2), and VARCHAR(3) – since these almost never make sense.

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, and is now pivoting to focus on learning MySQL. Ember shares both posts about her core skill set and her journey learning MySQL. Ember lives in Denver and work from home

One comment

  1. Thanks for the great post.

    Changing the datatype from varchar to char for a given column may result in

    a) data mismatch if the same column is being used in local or join predicate.
    b) varchar returns value without padding where as char returns back with padding

    Good developer would handle these two things with trim function. Esp it would be difficult to review the entire code if it is already in production.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.