Automatic metadata: Why actually?

Reinoud Kaasschieter

2019-03-26

As long as I am in the field of Enterprise Content Management (ECM), there have always been discussions about metadata of documents. Metadata award is the means to search to find documents. It is an important tool to start a case or a process with a document. But the discussion always reveals which and how much metadata is needed, who should record it and when should it be done.

Before we can determine by whom and when metadata should be recorded, we will first dive deeper into the metadata phenomenon. The term metadata means nothing other than data about documents. The Rotterdam city archive distinguishes three groups of metadata that can be assigned to documents. Other layouts are possible, but for the sake of convenience, I take that from the archive. They recognize:

Descriptive metadata (identification, interpretation, authentication, finding);
Administrative or management metadata (authorization, logistic data, ownership, formal origin, accountability of management activities) and;
Technical metadata (software, hardware, storage format).

The first group mainly consists of data about the original context of the documents; the second group consists of data for the archive system, the third group of technical data of the system with which the information is created and managed.

The last group of metadata, the technical, generally presents a few problems. This is automatically filled within the ECM systems by the system itself. The first two groups are often discussed. Not only among experts, but also among users of ECM systems. With users, the topic of conversation is often about the usefulness and necessity of metadata.

With the associated question how much input is necessary. More specifically: users wonder how practical the metadata is and how much work it takes to add the correct metadata to the documents. And whether the users themselves see the usefulness in their daily work. The discussion does not focus on the usefulness of certain metadata, but rather on the amount of work it takes to correctly and completely fill in this metadata.

I don’t want to describe here which metadata is necessary or not. This is highly dependent on the organization, the standards that have been chosen and the processes that go through the documents. What I want to go into is the amount of metadata that is used and that must be filled by employees.

Knowledge workers

Traditionally, incoming documents are processed by a mailroom. Here the first assignment of values of metadata takes place. The employees ensure that the descriptive and administrative metadata are filled. However, in some cases, the postal worker has insufficient knowledge to fully describe documents. A knowledge worker, for example, the first handler of the document, must then supplement this information.

Nowadays documents often arrive directly at the knowledge workers. For example, in the case of e-mails that are sent directly to employees. Or employees prepare documents themselves, which must also be described with metadata.

Administrative metadata can also provide large amounts of data fields. With this metadata, among other things, the use of documents is recorded. In the paper age, these data were kept in the archive folder or a special form (the minute). Now they are being held as metadata. For example, write metadata models for government archives dozens of fields for all of which have to be filled in. Some can be filled in automatically because they are system fields. But others must be filled in manually.

The discussion about the usefulness and necessity of metadata is particularly important for knowledge workers when they are asked to provide documents with all sorts of metadata. It is not the primary task of these employees to describe documents in the metadata. Describing documents is seen as an extra burden, especially when employees have to add data that are not directly relevant to their own work. Let me give some examples:

An insurer asked the medical team for insurance doctors to provide documents with five mandatory and ten optional features. This was initially seen as the minimum necessary. But a few years later it appeared that the optional fields were barely filled in and used. With the five mandatory fields, one could work well: the optional fields were removed from the system as unnecessary.
At a ministry, the project managers found out that practitioners did not feel like filling out all kinds of metadata fields. They started working around the system by continuing to use network drives. In the new building of the system, it was therefore decided to minimize the metadata fields to one: the name of the document.
An ECM project failed at another ministry because the users were actually unwilling to fill in all kinds of archive-related document characteristics: too much work. While the archive staff lacked the knowledge and time to do it yourself. The flood of documents and delays in processing the documents was simply too great.

In general, it can be said that a maximum of five data fields are filled in by users when they upload a new document in a system. Less is better, more leads to a greater chance of acceptance problems.

Automatic filling of metadata

When a lot of metadata is needed for any reason and it is not possible to enter it manually, what then? We can decide to drastically reduce the number of data fields, but we can also choose to fill fields automatically. The technology is there, but sometimes it lacks confidence in the operation of such systems. ‘Auto-classification’ is put on the market as the only method to fill metadata, because the flood of documents and other unstructured data is too large to process manually. But how many documents are there every day to be processed? Maybe not that much.

If it is not possible, for whatever reason, to get metadata filled, an automatic filling is perhaps the only solution. Because sooner or later, these data are needed to: Be able to retrieve documents, to reconstruct their use or to manage them and to delete them in a controlled manner in order to control the size of the archive.

It is also important to realize that errors can be made when manually entering data. It remains human work, of course. People can accidentally make mistakes, such as typing errors, reading errors or otherwise. But errors can also occur because users no sense to do everything carefully: they fill in but what, do not take time to understand the document content, feel put under pressure, and so on.

An old way to avoid mistakes is to the same metadata to fill out multiple times. The system compares the results and signals differences. But that does mean a doubling of the effort of the employees. So I have seen little use of this method. Also, since prudential not get around is, it pays to reduce the risk of incorrect metadata this way? And how do I measure that?

The goal of automatic classification is not to get a lot of small human errors from field contents. Because errors can also occur during machine recognition of text. The main purpose of automatic classification is to automatically process the large bulk of documents with their large amount of attributes. The human effort to do all this manually can simply be too big. Machines can do the work that otherwise will not be done. Automatic classification can provide documents with those characteristics that are useful but cannot be filled in manually.

To be able to automatically determine the metadata of documents, different methods are possible. Some are described below, in order of increasing technical complexity.

Form recognition

Form recognition is the automatic reading out of forms. To do this, the form must first be recognized, after which the data fields can be read out on the basis of the format recognized and known by the machine. Strangely enough, this applies to both electronic and paper documents. With paper documents, of course, the form must first be digitized by, for example, scanning. Then the text fields can be converted.

The text fields can again be used to fill the metadata of the document. This can sometimes be immediate and sometimes a check or edit must take place before the values can be used as metadata. Form recognition is (almost) standard in document input systems. These systems are learning, ie they learn to recognize the different forms themselves using a set of sample forms.

Links with external systems

Often, metadata of documents can be looked up in other systems. For example, order data, such as the order number, can be removed from an order system and stored as metadata with the document. In this way, the order number can be saved at the time of the arrival of the order document. This can be important if the external systems do not keep track of order data history and, for example, only ‘know’ current orders. Document archives are usually not the source system for all kinds of data. We assume that the source system contains the ‘truth’, in English: ‘single source of truth’. That is why we can simply copy the values from source systems to our document archive. After all, the documents with their metadata are stored unchangeably in an archive.

Use in case systems

Metadata concerning the use of documents, which often form part of the set of metadata for archiving, can be determined automatically in several ways. When it is strictly about who has seen or edited which document, the logging of the document management system is a good source of information. By acquiring the log data in the metadata, it can be held.

If the context of use is really defined, it is usually necessary to support the handling of documents with information technology. This can be a case system, but also process control: ‘workflow’, a CRM system and the like. These systems then contain the context of the use, the case or the process. The context describes who, where and when, for example in which processes, documents have been used. The context can also be a file: all documents in the file then belong to the same context. The manual filling of context data with documents is often seen as too laborious by employees. By letting employees work in process, the context is known and can be added automatically to the documents and therefore the ‘burden’ to remove the employee.

Text analysis

Forms have a structure. This structure is expressed in the format of the form, such as in a form or in an XML file. This structure makes it possible to derive metadata from the document. But what do you do if you only have text and you need to provide metadata with value based on reading and interpreting?

Post office staff or knowledge workers read a text and interpret it. They interpret the text on the basis of the content, but also on the basis of previously acquired knowledge about the subject of the text. This interpretation enables them to determine the correct metadata for a document.

At this moment computers are starting to learn how to interpret texts. This enables them to automatically provide large quantities of documents with metadata.

Classifying on the basis of text analysis can be done at different levels. The easiest way is that text elements are recognized on the basis of formal criteria. For example, an Iban number can be recognized by the bank account number. Or a document attribute can be found because it immediately follows the text: ‘Our characteristic’.

But it can also be more advanced. In the very near future, it is possible to interpret text content so that the meaning can be determined automatically. And to classify on the basis of that interpretation. The outcome of such an interpretation could be: ‘This e-mail goes with a 90 percent probability about a complaint about product XYZ’. Based on this outcome, a complaints procedure about a product can be started. The result can again be used to classify the e-mail document as a ‘product complaint’, and so on.

Does a machine do this qualitatively better than a human? We are not that far yet. But, as written at the beginning, these tools are especially useful when large amounts of documents and metadata need to be processed. It’s not so much about improving people, it’s about relieving people’s work by having computers do the work, having computers fill in metadata, all of which are useful and necessary, for which no people can be found.

With the advent of advanced tools to automatically classify documents, it has become possible to process large quantities of documents and metadata. This makes the discussion about the usefulness and necessity of metadata easier. If the classification is possible automatically, metadata can be used better. The acceptance of ECM systems is also getting better because the need for manual work can be greatly reduced. And so the quality of the metadata also improves. The automatic classification technique has become mature. It is time to look at how automatic classification can help improve ECM systems.

This opinion was written in collaboration with John Christiaanse, senior consultant in the field of ECM and information lifecycle governance at Capgemini.

Reinoud Kaasschieter

Expert in Business Information Management