One of my items to check before go-live is to see that data pruning is in place. Yes, before a website my company supports even goes live, I have at least made the attempt to start pruning the data that hasn’t been collected yet. I saw this video today: http://smarterquestions.org/2012/02/video-marc-hebert-of-estuate-on-cleaning-data-cholesterol-from-the-enterprise/, and thought I’d write a short post inspired by it. Go check it out and see what you think.
In e-commerce databases, there are some tables in which the data quickly becomes obsolete. I’ve got one in many of my WebSphere Commerce databases in which the data is obsolete after 10 minutes(CACHEIVL). It literally has no imaginable purpose after that time. There are others where records with certain values for certain columns that are obsolete after days or weeks (STAGLOG, Guest users in USERS/MEMBER, abandoned shopping carts in ORDERS/ORDERITEMS, etc).
Would anyone deny that all other things being equal, a smaller database performs faster than a larger database?
I like how the video equates all the extra data to cholesterol – it builds up in your database systems and slows down all the ‘good’ data that you actually need and want to access.
In e-commerce databases it frequently takes a committee made up of a DBA, developers, a project manager or two, and someone from the business all to decide on the right data retention guidelines. In general terms, you must make decisions on how long to keep things like abandoned shopping carts (both for registered and guest users), users who have not logged in for months or years, and some application specific data like session data and cache clearing data if your application stores it in the database.
Keeping your e-commerce database from getting ‘clogged up’ with all that unused data is not the only reason to prune data. The other reason is how long it takes to delete data. While inserts to MEMBER/USERS tables in WebSphere databases are generally nice and fast, deletes are not. I’ve had systems where it took me 0.4 seconds to delete a single member. Granted this was on older hardware and DB2 8, but still – that means I can only delete 150 users per minute. AND I’ve seen them added at a rate of 800 per minute or more on medium-sized implementations. Yes, indeed, I can add users faster than I can delete them. There are some database and application design issues in there, of course – If not set up right, WebSphere Commerce can create guest users when they are not needed (see information on the isGeneric method like this tech note: http://www-01.ibm.com/support/docview.wss?uid=swg21316161) (note: IBM has removed this technote, and I cannot find it elsewhere). That’s an education issue for developers. And the level of RI in WebSphere databases is pretty high – over 300 tables depend either directly or indirectly on the MEMBER/USERS tables, and all foreign keys cascade on delete – that’s one reason why deletions can be so slow.
So whether you’re using an application with a vended tool like DBCLEAN for WebSphere Commerce, or you’ve got a custom application you have to write the scripts for yourself, don’t forget to implement data pruning early in the application/database’s life to keep the ‘data cholesterol’ from slowing things down.
For posts on a few specific pruning areas in WebSphere Commerce, see my <a href=https://datageek.blog/en/?s=data+pruning”>posts on Data Pruning/DBClean.