Explore our latest thought leadership, ideas, and insights on the issues that are shaping the future of business and society.
Choose a partner with intimate knowledge of your industry and first-hand experience of defining its future.
Discover our portfolio – constantly evolving to keep pace with the ever-changing needs of our clients.
Become part of a diverse collective of free-thinkers, entrepreneurs and experts – and help us to make a difference.
See our latest news, and stories from across the business, and explore our archives.
We are a global leader in partnering with companies to transform and manage their business by harnessing the power of technology.
Our number one ranked think-tank
Explore our brands
Explore our technology partners
For many R&D organizations, the desire to do new things with old data often leads to excitement about the potential of artificial intelligence and machine learning meeting the reality of legacy data that isn’t fit for this new purpose. This is often the lightbulb moment where the idea of data management takes off.In this article we will explain the six PULSAR principles of AI-ready data and show how FAIR data brings you closer to true AI-readiness.P is for ProvenancedThese days, many rich, public data sets are available (like UniProt , ChemBL  and Open PHACTS  in life sciences) that organizations are using to enrich internal data and tackle research problems on a much bigger scale. When machine learning feeds into that work, ensuring that model predictions are reproducible is critical and requires a robust provenance chain showing what data was used to inform a model, where it came from and how it was generated.The authors of FAIR anticipated this and accounted for provenance explicitly within the reusability principles, which states that data should be associated with information about its origin and processing. Truly FAIR data automatically covers the ‘provenanced’ principle – that’s a good start!U is for UnbiasedThere are many well-known stories about biased AI systems causing terrible consequences for real people. Usually, AI systems are biased because they were trained on biased data – often data that contained hidden biases that were not obvious upfront.Detecting bias in data is challenging, and FAIR does not have all the answers. But through findability, you can make your search for appropriate input data broad, and through accessibility, you can be more confident that you’ve obtained everything available. Then your data profile is less likely to have blind spots – and FAIR will have helped you to avoid one of AI’s biggest mistakes.L is for LegalDo you, and your AI, have the legal right to use a given data set? For example, with personal data, it’s fine to collect personal data provided you tell people you collect it from what you’ll do with it (‘transparent processing’). But AI projects often make secondary use of data, beyond its original research purpose. Are you covered by the original terms of consent?One of FAIR’s reusability principles specifically states that human- and machine-readable conditions of reuse should be included in metadata. So, while the machine-readable aspect is probably still a work in progress, at least AI system owners should be able to take an informed view on the appropriateness of truly FAIR data they consume.S is for StandardizedEveryone appreciates that standardization reduces problematic data variability and, while standardization may not enforce all quality aspects, it does prompts data practitioners to consider quality. Of course, some AI projects specifically act on unstructured data, e.g. when processing natural language. Here, standardization of the outputs, rather than the inputs, is the key, for example when concluding that two scientific papers are discussing the same disease even if they refer to it using different nomenclature.Standardization is baked into FAIR’s interoperability principles, which recommend standardization of the way we physically store data (e.g. as triples in a triple store or tables in a relational database), the data exchange format (e.g. using OWL or JSON-LD) and the actual meaning of the data (e.g. using a public or industry data standard).A is for ActivatedActivated data is ready to use – for example, the data sets you’re going to feed to your AI system are either joined together or ready to be joined. Big data and AI often generate novel insights from the combination of historically siloed data types – for example, chemistry and biology data in a search for new medicines – but connecting data sets from multiple domains can be surprisingly complicated.FAIR’s interoperability principle is the key here. With interoperable data, close attention should have been paid already to those key joining points on the edges of data sets and data models, building in interdisciplinary interoperability from the start.R is for Readable…and, of course, machine-readable. Interoperability is the main FAIR principle relevant to machine-readability, and while this is partly obvious, it’s not just about physical data formats; truly reusable data should codify the context in which it was generated so that the machine draws the right conclusions. This is usually the biggest challenge in FAIRification work, especially in specialist areas that lack pre-existing data standards or rely heavily on written descriptive text. Providing a long-term, robust solution often means developing new data capture systems and processes that properly codify tacit knowledge that otherwise would be left in explanatory paragraphs, research plans, published papers or sometimes not even written down at all.ConclusionTo be truly AI-ready, your data should satisfy the PULSAR principles – and applying the FAIR principles as a first step means a lot of the work is already done. Indeed, “the ultimate goal of FAIR is to optimize the reuse of data” . The end of FAIR is the beginning of AI-readiness.Capgemini’s many years of experience with FAIR and data management will help you truly embrace becoming a data-driven R&D organization. CLICK HERE._________________ https://www.uniprot.org/ https://www.ebi.ac.uk/chembl/ http://www.openphacts.org/ https://www.go-fair.org/fair-principles/
James has been working with the R&D departments of global companies to release the potential locked away in their data since 2003. He provides clients with strategic advice on their priority IT programmes, consulting with them to develop realistic and achievable roadmaps to deliver real benefit from applying next-generation analytics methods to their data.
You may accept all cookies, or choose to manage them individually. You can change your settings at any time by clicking Cookie Settings available in the footer of every page.