Big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools (Wikipedia.) The broad definition of big Data often leads one to believe that the key to addressing the Big Data is to quickly collect large volumes of data from varying source systems in an effort to extract value from that mass of data.
Many organizations have tried to deal with the Big Data challenge by employing legacy tools that were really never designed to deal (end to end) with these types of Big Data challenges. To properly address today’s Big Data challenges, new tools, new techniques, and new approaches, must be employed. Legacy tools and techniques for collecting and organizing data focus on but one aspect of the Big Data problem – volume. Complexity was addressed with modeling techniques and querying standards against Big Data that was primarily in structured form. However, Big Data problems are more far reaching than volume and data characteristics such as disparancy, velocity, and (often most important) complexity of unstructured must be addressed.
Disparant data problems are often dealt with by normalizing data that is found is different systems and in different formats into a common data model. The issue this often creates is a loss of data fidelity as the normalized data has been altered from its original format. Additionally, this process is very time consuming and often extremely expensive. Some organizations choose not to normalize the data into a common model and prefer rather to find a way to model and store each different data format. This creates an added challenge in finding a means to run analytics against these different data formats.
The velocity issue centers on the shear rate at which new data is being loading into existing systems and new systems coming on line containing data that must be factored in to deal with the overall Big Data issue. This is truly a large scale issue in today’s economy with all of the merger and acquisitions where new systems are being brought on from new entities and siloed departments are making decisions in a bit of a vacuum to employ new tools for HR, payroll, finance, procurement, etc.
Data complexity seems to be the one area that appears to be at the center of most of the Big Data problem facing today’s larger organizations. To truly address the Big Data challenge, one must also find a means to collect, organize, and derive value from external data such as social media, third party data, internet data, and internal data sources such as unstructured content and transaction data among others. This seems to create the most confusion for many organizations and an area where most need guidance.
So, what is really the solution for dealing with the Big Data issue?
To effectively deal with the Big Data issue, an organization must look to and also beyond transactional, structured data. We must look the fastest growing area of available data – unstructured data (the most prevalent being the high volume of social media from blogs, business and personal information sharing sites, etc.) Massive amounts of data is not new to many organizations and the concept of Big Data, while a new “term of art”, is nothing ground breaking or a recent. Massive volumes of data have been a reality for some time. The real Big Data issue is in relation to the unstructured content being generating in massive volumes. Far too many organizations and their advisors are looking to data warehousing approaches to solve issues for complex structured, unstructured, and semi-structured data and falling short.
Those that will be truly successful in dealing with the Big Data problem will need to have a deep understanding of unstructured and semi-structured data and the technologies, processes, and data governance that must be employed to deal with it (without setting aside the need to incorporate structured data into the equation.)
Big Data is real and dealing with all aspects of the overall problems of disparancy, velocity, and complexity are the only true means to get ahead of the problem to derive true value from it.