One of my items to check before go-live is to see that data pruning is in place. Yes, before a website my company supports even goes live, I have at least made the attempt to start pruning the data that hasn’t been collected yet. I saw this video today: http://smarterquestions.org/2012/02/video-marc-hebert-of-estuate-on-cleaning-data-cholesterol-from-the-enterprise/, and thought I’d write a short post inspired by it. Go check it out and see what you think.
In e-commerce databases, there are some tables in which the data quickly becomes obsolete. I’ve got one in many of my WebSphere Commerce databases in which the data is obsolete after 10 minutes(CACHEIVL). It literally has no imaginable purpose after that time. There are others where records with certain values for certain columns that are obsolete after days or weeks (STAGLOG, Guest users in USERS/MEMBER, abandoned shopping carts in ORDERS/ORDERITEMS, etc).
Would anyone deny that all other things being equal, a smaller database performs faster than a larger database?
I like how the video equates all the extra data to cholesterol – it builds up in your database systems and slows down all the ‘good’ data that you actually need and want to access.
In e-commerce databases it frequently takes a committee made up of a DBA, developers, a project manager or two, and someone from the business all to decide on the right data retention guidelines. In general terms, you must make decisions on how long to keep things like abandoned shopping carts (both for registered and guest users), users who have not logged in for months or years, and some application specific data like session data and cache clearing data if your application stores it in the database.
Keeping your e-commerce database from getting ‘clogged up’ with all that unused data is not the only reason to prune data. The other reason is how long it takes to delete data. While inserts to MEMBER/USERS tables in WebSphere databases are generally nice and fast, deletes are not. I’ve had systems where it took me 0.4 seconds to delete a single member. Granted this was on older hardware and DB2 8, but still – that means I can only delete 150 users per minute. AND I’ve seen them added at a rate of 800 per minute or more on medium-sized implementations. Yes, indeed, I can add users faster than I can delete them. There are some database and application design issues in there, of course – If not set up right, WebSphere Commerce can create guest users when they are not needed (see information on the isGeneric method like this tech note: http://www-01.ibm.com/support/docview.wss?uid=swg21316161) (note: IBM has removed this technote, and I cannot find it elsewhere). That’s an education issue for developers. And the level of RI in WebSphere databases is pretty high – over 300 tables depend either directly or indirectly on the MEMBER/USERS tables, and all foreign keys cascade on delete – that’s one reason why deletions can be so slow.
So whether you’re using an application with a vended tool like DBCLEAN for WebSphere Commerce, or you’ve got a custom application you have to write the scripts for yourself, don’t forget to implement data pruning early in the application/database’s life to keep the ‘data cholesterol’ from slowing things down.
For posts on a few specific pruning areas in WebSphere Commerce, see my <a href=https://datageek.blog/en/?s=data+pruning”>posts on Data Pruning/DBClean.
Fortunately you stressed this point and I was able to keep focus on it, otherwise it could have slipped through the cracks, Thank you! I recommend getting it on the project plan so it willl get implemented and tested. It will save you more than headaches later.
Quick question because I haven’t found much on the subject…..what do you suggest for archiving WCS data like Orders, Orderitems, Address, etc.? We have long-term reporting needs and I have a script that archives data, but I am wondering if there is a better way. Thanks!
WebSphere Commerce doesn’t have any tools for archiving data. The options tend to be the same as with any DB2 database – custom scripts (using load from cursor or whatever) or replication. Last week on the DB2 Night show, Elder was presenting on an interesting method where you could declare a cursor on a remote database without even setting up federation. (http://www.dbisoftware.com/blog/db2nightshow.php?id=339)
IBM will support you using reads on the standby in an HADR pair and charge you an extra licensing cost to boot, but that only solves reporting needs and not actual archiving. I have seen replication used with even Commerce accessing the reporting/ODS database to retrieve things like older orders. I’m amazed at how rare such things are, unfortunately. I’m sure a fan of archiving, even if you’re just moving data to different tables in the same database, because it addresses some of the same issues as actually deleting the data does.
Intesting that this just showed up – a DB2 Tech Talk about data archiving (from the db2 perspective, not the Commerce perspective): https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/db2_tech_talk_on_march_29_about_data_acrhiving1?lang=en&utm_source=twitterfeed&utm_medium=twitter
THANK YOU for passing that along!! The timing is perfect.
Nice Post .. I can see the lack of process in most of the project on pruning..