In the first blog post of this series, I’ve discussed how common practices from the “old” world of Enterprise Content Management can be applied in the “new” Big Data world. Using these practices will create a better governable Big Data environment.
In the previous blog post I’ve discussed Information Life Cycles and how retention schemes will help to control data growth by purging data not longer of value. This blog will write about two much discussed topics in the Big Data world: privacy and pre-processing.
Some experts in Big Data state that it is all about security and governance. I agree. The question is: What data are you allowed to collect? In a pure Big Data scenario, you just ingest raw data without really caring about the precise content. The extraction and analysis of the content only starts when you want to analyze the data to gain insights. But the legislation, certainly the one about Personal Identifiable Information (PII), restricts the data you can store. You can only store personal information when this data will be used for a well-defined and restricted purpose. European law also states that the data owners must give permission to use this data. And only employees that have to work with the data are allowed to access the data. In this case, you’re really obliged to know the contents of the data you ingest in your Big Data system. And furthermore, you have to keep track of that data, because the owners are entitled to ask what personal data is stored. In the near future owners can even ask you to erase their data. The penalties for failure to comply with the European Union requirements, such as data breach reporting, include penalties up to 5% of annual gross revenue. Can your Big Data environment cater to these requirements?
In the ECM-world, all data to be stored has to be classified. Classification can be done by adding metadata to the data files, describing the contents amongst others. Or by putting the data in a file or folder tree or hierarchy. The purpose of classification is to describe the contents of data, in most cases, documents. This helps finding the documents and controlling the data. When documents and data contain PII, you’ll have to explicitly classify the data as being personal. Then you can impose classification schemes that limits the access to PII to limited and well-defined groups of users. And audit the use of the data, to spot at the least possible data breaches. So I think it’s quite essential to really know what you’re ingesting in your Big Data system. It might contain data that is restricted. Governance and compliance might influence who can access the data, for what purpose, and when to erase the data which is no longer of use.
Anonymizing the personal data can be a strategy to make the data less personal. So less data management rules apply. This can be quite complex, again, you need to know what you’re ingesting in the Big Data environment. And it requires pre-processing of the data. I’ll be discussing pre-processing now.
Pre-processing has always been a part of ECM solutions. It’s called document capture and consisted of document conversion, mostly scanning, classification, adding metadata and text recognition. In their Big Data Architecture, IBM uses the “Pre-process raw data pattern”. This pattern processes the data before storing (back) in the big data storage systems. The steps performed during document capture return, in their own specific form, into this pattern. Like structuring data into a known format, extracting data, adding meaning. The pattern has the following features, amongst others:
- Document and text classification
- Feature extraction
- Image and text segmentation
These features are tuned for unstructured data and should be familiar to any ECM expert. However, it is also advisable to think about performing the same steps for structured data. The pattern allows for this:
“Before the data can be analyzed, it has to be in a format that can be used for entity resolution or for querying required data. Such pre-processed data can be stored in a storage system. Although pre-processing is often thought of as trivial, it can be very complex and time-consuming.”
Big data solutions are mostly dominated by Hadoop systems and technologies based on MapReduce, which are out-of-the-box solutions for distributed storage and processing. But from an ECM point-of-view, these technologies should be enhanced with reformatting and retrieving context and meaning.
In the next blog post, I’ll discuss the last two topics: preservation of data and versioning of data or data sets.