In my previous blog, I announced the launch of Capgemini’s UK Graph Guild and introduced the notion of basic graphs and a number of simple use-cases. In this blog, I want to delve deeper by looking at special types of graphs called Directed Acyclic Graphs (DAGs) and their applications.
What are Directed Acyclic Graphs (or DAGs)?
To understand what is meant by a “directed acyclic graph”, it is best to break-up the phrase into its component parts
i) A “graph” is a structure comprising of:
- A set of nodes; and
- A set of edges, where an edge connects one node to another. An edge between two nodes only exists if there is a relationship (of some sort) between them.
For example, a graph might look something like this:
ii) A “directed graph” is one in which the edges have a direction, represented by adding an arrow to the edge to show the direction of the relationship. Directed graphs are very important as they can be used to represent asymmetrical relationships, as opposed to symmetrical relationships that exist in undirected graphs.
For example, a directed graph can be used to represent a particular website, where sites or web-pages are nodes and the edges are the hyperlinks from one page to another.
iii) A “cyclic graph” is a graph in which it is possible to find at least one node, and by following a sequence of directed edges eventually return to the same node. The graph in Figure 3 is cyclic, as we can start at Node B, go to nodes C, D and E and return to B.
An “acyclic graph” is a graph in which it is not possible to find at least one cyclic path. By reversing the direction of the relationship between nodes B and E in Figure 3 we obtain an acyclic graph as it is not possible to start at node B (or any other node for that matter) and return to it by following a directed sequence of edges.
Putting these concepts together, a DAG is a graph in which all the edges are directed, such that it is impossible to find a node and follow a sequence of edges that eventually loops back to the same node. Phew, that’s a mouth-full.
Key applications of DAGs
A key property of DAGs is that they have what is known as a “topological ordering”, which means that the nodes of a DAG can be put into a linear sequence with the nodes given an “ordering”, specifically nodes at the beginning of the sequence have a “lower value” than nodes at the end of the sequence. For example, a topological ordering of the DAG from Figure 4 could look like this:
This topological ordering property, as well as other key properties, make DAGs very efficient at a number of tasks (such as finding the shortest path from one node to another) and are the reason DAGs have a wide range of use-cases. In fact, DAGs are far more ubiquitous than you might imagine, being not only a critical data structure in mathematics, statistics and data science, but they appear almost everywhere and in some unexpected places. A number of use-cases deserve specific mention:
- DAGs are used in project management to plan, design, and implement complex projects or tasks. For example, DAGs are used in popular projects such as Apache Airflow (a workflow management system originally developed by Airbnb) and in Apache Spark. For instance, in Spark, DAGs are used to represent a chain of Resilient Distributed Dataset (RDD) dependencies:
- A promising application of DAGs is in the development of faster and cheaper distributed ledgers. Despite the hype, distributed ledgers such as “blockchain” have failed to be widely adopted, due in large part to their poor scalability, low speed and high transaction costs. For example, the Bitcoin blockchain (which uses a linear sequence of blocks) only manages to process 4 to 7 transactions per second, which is simply not viable for wide scale adoption.
However, a DAG based distributed ledger (due to its graph structure) can process hundreds of thousands of transactions per second, and do so with far lower transaction costs. This could enable use cases such as P2P energy trading to become entirely viable, a feat that has so far eluded current blockchains.
- DAGs can be used to identify confounding and sources of bias, which is particularly important in medical and clinical studies. For example, consider a clinical study with the objective of identifying the relationships between “screen time” and “childhood obesity”. It might seem reasonable to hypothesize that more screen time may lead to an increased risk of childhood obesity:
However, higher screen time probably doesn’t cause obesity directly – it is more likely there is an intermediate process (being a reduction in physical activity) that is responsible for weight gain.
However, there are factors (called confounders) that influence both the amount of screen time and the risk of obesity. Indeed the authors of this study identified low parental education to be such an influence.
It is the use of DAGs that allow the easy identification of confounders and spurious relationships – without it erroneous conclusions would easily be reached, such as falsely attributing screen time as the sole factor that causes obesity.
I hope I’ve shown that DAGs appear in a range of situations, and given a glimpse of their properties that makes them so very powerful. If you would like to know more about Capgemini’s UK Graph Guild, or if you want to know how we can help you apply graphs to your data, please get in touch.
Calum Chalmers is a senior data scientist in the Insights & Data practice, with over 20 years’ analytical and machine learning experience. He first fell in love with graph theory when studying mathematics at Glasgow University and at the University of Warwick.