This is one of those blog entries that I write and I know it will be obsolete and I will show my own ignorance and cringe at it in five years, but I have to write it anyway. I know that Iam biased. I come from a relational database background, and am working hard to open my mind the new data techniques that seem so popular these days.
This is also a new direction for me to come at blogging. I am used to blogging from a place of expertise, filling in the gaps with research and experimentation, but I am by no means an expert in data science. I learn so much on every topic I look into.
I come at data science from the perspective of a traditional database administrator for a relational database. I think that many DBAs have a hard time understanding why data science is even needed as a new approach. We’ve been involved with projects solving business or science problems with data for decades. However the traditional DBA comes at it from an old-school organization of roles and the software development life cycle.
The Role of the DBA
I’m not surprised that there are a variety of definitions of data science and of a data scientist. While it is much more opaque to those outside of the database administration specialty, there are a large number of different types of DBAs and roles that DBAs play. Largely, database administrators are in control of Relational Databases, and we tend to think that the relational database model offers a lot of advantages. This may be a prejudice that I continue to hold as I explore and learn more about data science. I still tend to think there are a significant number of data science problems that could be solved just by writing the right query.
Relational databases are highly dependent on a well-defined schema (or organization of data) up front. This can be seen either as a disadvantage or as an advantage. On the negative side, it requires work up front to find the right schema for the data and to make the data conform to that schema. There may be a lot of data cleaning, often done through ETL (Export, Transform, Load) or ELT (Export, Load, Transform) cycles. Part of the work of a data scientist is the proper cleaning of data and getting it into the right format anyway. A significant chunk of the work of data science seems to be in similar data cleaning cycles, though they may be done later rather than when the data is initially placed into a system.
DBA roles are usually easily divided into two worlds, though many DBAs stradle the line and play in both worlds. The DBA’s universe is split into:
- Physical or Systems DBA Tasks – The systems DBA focuses on things that involve the RDBMS software and the underlying hardware. These are the DBAs who install, upgrade, and patch the RDBMS. Often they’re the ones who lay out the database on disk and understand the underlying I/O sub-systems. They help manage and tune memory consumption. This role is usually heavily involved in planning for high availability or disaster recovery, and is likely to be the one responsible for backups and restores. Occasionally, the systems DBA is called in to help identify poorly performing SQL and alter the SQL or the database structure for higher performance. If there is no DBA involved, this role is most likely to be taken on by systems administrators for better or for worse. System administrators may not understand the nuances of database recovery or the many options for high availability and disaster recovery at the database level.
- Logical or Development DBA Tasks – The logical DBA is more likely to be the one who is helping to turn the logical layout of a database into actual statements to create and alter objects. They logical DBA often writes SQL, and should be involved in code reviews and tuning of SQL. Some logical DBAs may actually be writing code. This part of the DBA role is most likely to be taken on by developers themselves either entirely or in part. Involving a trained DBA instead has great benefits for performance and ensuring that concurrency is at a maximum.
I expect I’ll understand more nuaces like these about the many roles in data sciece as I learn more
The way I’m defining data science for myself at this point is bringing technology, analysis/statistics, and domain expertise together to solve problems with data. That is an over-simplification, and there is a strong factor of the scientific method involved as well – an aspect that doesn’t always find it’s way into practical old-school IT. The current practice of data science on the technology side is heavy on the use of the Python and R languages. Combining data science with big data techniques brings in tools like hadoop and spark. From the DBA’s perspective it is interesting seeing the prevalence of the use of a simple CSV format for much data, something that DBAs tend to look down on. I’m still surprised at the back-flips that business users still make Microsoft Excel do that lend themselves much more naturally to a relational database format.
When it comes to relational databases in data science, I see PostgreSQL mentioned a lot, along with MYSQL and even MS SQL Server. It is interesting to me that large proprietary database management systems like the one I’ve spent 17 years building a career on are dismissed or seen as sources to pull data out of without being the real partners in data manipulation that they could be.
It is also interesting to me to see the ways that Machine Learning and AI have been co-opted to mean techniques using statistical models to work with data. It feels to me like this is so much shallower than I would have taken them to mean. That computer isn’t the one learning – the data scientist is, as they apply different statistical models and refine the statistical models they were already using.
It seems to me that there are three main roles in data science, all of which I need to understand better.
- Technology expert or data engineer – the variety of skills that are represented in this role seem to cover much of the range of skills in the modern IT organization from systems administrators to DBAs and programmers. Sometimes the bulk of these skills may be in one person or they may be fulfilled by multiple domain experts
- Statistical modeling expert This person is a genius at math and understands the models being used from that perspective. They may also have an expertise in presenting data in a visual way.
- Business domain expert – This role is critical to ask the right questions and understand what the answers might mean and serve the critical communications bridge between mathematicians/engineers and business consumers of the analyses generated. They may also play a role in helping to present data in a meaningful way.
There may be other roles that I’m missing here, and nuances of the roles as well.
I’m new to data science and absorbing all I can. I have created a page on data science learning resources that I’ll be updating as I go to share resources that I find.
Please comment below to share what I’m missing here and what I can understand better.