The PULSAR principles of AI-ready data

James Hinchliffe
24 Aug 2022

Does FAIR go far enough to provide AI-readiness? Not quite – but it’s a great start. How can we build on a FAIR data foundation to be truly ready to make good use of AI?

For many R&D organizations, the desire to do new things with old data often leads to excitement about the potential of artificial intelligence and machine learning meeting the reality of legacy data that isn’t fit for this new purpose. This is often the lightbulb moment where the idea of data management takes off.
In this article we will explain the six PULSAR principles of AI-ready data and show how FAIR data brings you closer to true AI-readiness.

P is for Provenanced
These days, many rich, public data sets are available (like UniProt [1], ChemBL [2] and Open PHACTS [3] in life sciences) that organizations are using to enrich internal data and tackle research problems on a much bigger scale. When machine learning feeds into that work, ensuring that model predictions are reproducible is critical and requires a robust provenance chain showing what data was used to inform a model, where it came from and how it was generated.
The authors of FAIR anticipated this and accounted for provenance explicitly within the reusability principles, which states that data should be associated with information about its origin and processing. Truly FAIR data automatically covers the ‘provenanced’ principle – that’s a good start!

U is for Unbiased
There are many well-known stories about biased AI systems causing terrible consequences for real people. Usually, AI systems are biased because they were trained on biased data – often data that contained hidden biases that were not obvious upfront.
Detecting bias in data is challenging, and FAIR does not have all the answers. But through findability, you can make your search for appropriate input data broad, and through accessibility, you can be more confident that you’ve obtained everything available. Then your data profile is less likely to have blind spots – and FAIR will have helped you to avoid one of AI’s biggest mistakes.

L is for Legal
Do you, and your AI, have the legal right to use a given data set? For example, with personal data, it’s fine to collect personal data provided you tell people you collect it from what you’ll do with it (‘transparent processing’). But AI projects often make secondary use of data, beyond its original research purpose. Are you covered by the original terms of consent?
One of FAIR’s reusability principles specifically states that human- and machine-readable conditions of reuse should be included in metadata. So, while the machine-readable aspect is probably still a work in progress, at least AI system owners should be able to take an informed view on the appropriateness of truly FAIR data they consume.

S is for Standardized
Everyone appreciates that standardization reduces problematic data variability and, while standardization may not enforce all quality aspects, it does prompts data practitioners to consider quality. Of course, some AI projects specifically act on unstructured data, e.g. when processing natural language. Here, standardization of the outputs, rather than the inputs, is the key, for example when concluding that two scientific papers are discussing the same disease even if they refer to it using different nomenclature.
Standardization is baked into FAIR’s interoperability principles, which recommend standardization of the way we physically store data (e.g. as triples in a triple store or tables in a relational database), the data exchange format (e.g. using OWL or JSON-LD) and the actual meaning of the data (e.g. using a public or industry data standard).

A is for Activated
Activated data is ready to use – for example, the data sets you’re going to feed to your AI system are either joined together or ready to be joined. Big data and AI often generate novel insights from the combination of historically siloed data types – for example, chemistry and biology data in a search for new medicines – but connecting data sets from multiple domains can be surprisingly complicated.
FAIR’s interoperability principle is the key here. With interoperable data, close attention should have been paid already to those key joining points on the edges of data sets and data models, building in interdisciplinary interoperability from the start.

R is for Readable
…and, of course, machine-readable. Interoperability is the main FAIR principle relevant to machine-readability, and while this is partly obvious, it’s not just about physical data formats; truly reusable data should codify the context in which it was generated so that the machine draws the right conclusions. This is usually the biggest challenge in FAIRification work, especially in specialist areas that lack pre-existing data standards or rely heavily on written descriptive text. Providing a long-term, robust solution often means developing new data capture systems and processes that properly codify tacit knowledge that otherwise would be left in explanatory paragraphs, research plans, published papers or sometimes not even written down at all.

To be truly AI-ready, your data should satisfy the PULSAR principles – and applying the FAIR principles as a first step means a lot of the work is already done. Indeed, “the ultimate goal of FAIR is to optimize the reuse of data” [4]. The end of FAIR is the beginning of AI-readiness.
Capgemini’s many years of experience with FAIR and data management will help you truly embrace becoming a data-driven R&D organization. CLICK HERE.