Causal Discovery : Does The Cockerel Crowing Cause The Sun To Rise?

Republished By Plato

Followers: 0

10 Lines of Python code to automate causal discovery that you have got to see

The focus of my recent research has been causal inference driven by the increasing requests I get from customers to move beyond machine learning predictions to answering “what-if?” type questions to drive impact and outcomes.

One of the things that intrigued me initially was — “How are causal diagrams constructed?”. In many of the online examples they seem to spring into life fully formed with no explanation of how the nodes and links were identified.

This led me to read “The Book of Why” by Judea Pearl and Dana Mackenzie (several times!) and one view expressed is that you cannot reverse engineer a causal diagram, also known as a Directed Acyclic Graph (DAG), from the data alone.

Data can reveal correlation but not causation, so how is it possible to “discover” causation?

Note: all the data sets in this article are licensed for public use, please see the references section at the end of the article for all sources and references.

Before we dive into causal discovery please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.

Consider the value counts from the following binary data set …

Note: the sun rise / cockerel data set is synthetic data created by the author, please see the references section at the end of the article for the full source and reference details.

We can see that the cockerel crows when the sun rises in 90.25% of observations but how do we know if the sun rise causes the cock to crow or the cockerel crowing causes the sun to rise based on the data alone?

The number of instances when the cockerel crows when the sun isn’t rising or when the sun rises but the cockerel stays silent are virtually identical (50,000 vs. 47,500) so the causal answer cannot be found by comparing the relative volumes of data.

One approach could be to look at the the temporal aspect. If the sun rises consistently before the cockerel crows that will be a good indicator of causality but what if our cockerel is an early riser?

The answer is to consult the domain experts. If we could assemble a team with an astrologer, a physicist and a poultry farmer they would conclude that the sun is causing the cockerel to crow and not the other way around!

Undeterred by suns rising and cockerels crowing I set off in search of something that was already out there that might be able to perform the magic of causal discovery based on the data.

There is an algorithm called “NOTEARS” that does claim to be able to implement causal discovery, so let’s give it a try and see what it can do …

Note: the smallpox data set is synthetic data created by the author, please see the references section at the end of the article for the full source and reference details.

This dataset is modelling the cause-and-effect relationships in smallpox vaccinations. Let’s run the NOTEARS algorithm against it and see what it can do …

[('Reaction?', 'Vaccination?'),
('Smallpox?', 'Reaction?'),
('Death?', 'Reaction?'),
('Death?', 'Smallpox?')]

If we were to visualise the results produced by NOTEARS it would look like this …

It doesn’t look very good does it? According to NOTEARS death is causing Smallpox. The temporal aspect tells us that smallpox comes first so it cannot be caused by death. Nor can smallpox cause the reaction (the vaccine causes the reaction) and the reaction certainly cannot cause the vaccination.

Using the knowledge of domain experts we can easily establish that the vaccination has a causal effect on whether the patient has a reaction and whether they go on to develop the disease and the reaction and smallpox both have a causal link with death …

So, we have established that even in a very simple causal model the NOTEARS causal discovery algorithm does not produce correct results. Also, a reader of one of my previous articles pointed out that NOTEARS does not work in Python 3.9 and as I am about to upgrade this is another big problem.

If you want to read more about the unsuitability of NOTEARS for causality here is an excellent academic paper — https://arxiv.org/pdf/2104.05441.pdf (Marcus Kaiser and Maksim Sipos).

Before we give up on fully automated causal discovery, let’s take a look at an alterative to NOTEARS, the “Lasso” algorithm –

Note: the census income / graduate earnings data set is licensed for public use, please see the references section at the end of the article for the full source and reference details.

Oh dear, Lasso is a disaster! It has predicted that everything is causing everything else and that the nodes cause themselves as well!

My final attempts were to try the GES, GIES and LINGAM algorithms but these all require R libraries. I do not use R and even if I did manage to get the configuration right I would never be able to create portable code that other data scientists could use.

The causal discovery libraries and algorithms that are available do not work and this has re-enforced the view expressed in “The Book of Why” i.e. that causality cannot be reverse engineered from data alone.

This conclusion led to me developing my own approach …

Cards on the table, I have not written a causal discovery algorithm. Rather my algorithm implements correlative discovery with a twist (no pun intended!).

When I had given up on the causal discovery libraries I still needed a way to visualise causation as a starting point for the conversation with domain experts.

I reasoned that I could calculate the correlations between the features / nodes easily and that this would at least be a good starting point.

I developed my thinking as follows — in causal discovery models we usually identify the “effect” i.e. the data feature we are interested in (like the “target” in machine learning predictions). In the smallpox example this is “Death?”, and in the graduate earnings example it is earnings “greaterThank50k”.

Therefore if there are correlations between any feature and the “effect” the causal direction must be from the other features to the effect as it is at the “end-of-the-line”.

My next step was to develop a recursive algorithm that can be visualised as follows …

Duplicates and bi-directional links must be forcibly excluded. I also wanted to be able to explicitly include or exclude connections (edges) and to explicitly exclude features (nodes).

These are the 10 lines of recursive Python code that I came up with to implement causal discovery

To show how the solution works I have selected some data on hotel bookings cancellation data.

Let’s start by reading in the data and taking a peek at the correlations …

Note: the hotel bookings data set is licensed for public use, please see the references section at the end of the article for the full source and reference details.

The first run of the discovery algorithm involves keeping the iterations down to 1 so we are just looking at the correlations (which should be causal) with the “effect” i.e. whether the hotel booking is cancelled or not …

OK, that’s not a bad start, let’s increase the number of iterations / recursions / layers to 3, tweak the correlation threshold a bit and see what we get …

OK, not too bad either, but it is a bit “busy” so the next step is to exclude some of the nodes we suspect might be causing some of the noise (note: in a real-world business case we would be talking to our domain experts at this stage).

This is starting to look better. Remember that the my algorithm is confident that links to “IsCancelled” are causal because it is the “effect”, hence nothing comes after it.

The other layers of the tree are just correlative though, the direction of the arrows were added based simply on the order that the algorithm found them in.

Working with the domain experts (or my own experience of hotel bookings in this instance!) I noticed the following –

The link from “DifferentRoomAssigned” to “LeadTime” is in the wrong direction because a long booking lead time increases the probability that the room booking will change and not vice-versa.
The correlation between “BookingChanges” and “DifferentRoomAssigned” is below the threshold but it may be a significant causally so it needs to be included.

The next attempt instructs the algorithm to make these corrections –

An “unobserved confounder” is a factor that we believe is impacting on our causal model by providing a link between some of the nodes, but that has not been measured and hence has not been included in the diagram.

My experience of going to hotels and my knowledge of the hotels dataset made me suspect that there is an “unobserved confounder” affecting the hotels data.

There are two hotels in the data — a “city” hotel and a “resort” hotel. This lead me to hypothesise that the unobserved confounder in this causal model is the “reason for the hotel stay”.

My hypothesis is that holiday-makers primarily stay at the resort hotel and business people and city-breakers primarily stay at the city hotel.

Furthermore I have hypothesised that these two groups have different behaviours and that this is the “unobserved confounder” (unobserved because the data does not capture the “reason for the stay”).

NOTEARS and other similar algorithms cannot account for “unobserved confounders” but the algorithm I have developed can take them into account by having them explicitly included as follows …

The result of the final iteration is a plausible, readable and understandable visualisation for the causality in the hotel data that I would be confident to explore and refine with the domain experts. These are the key points …

There is an “unobserved confounder” which is the primary reason for the stay (holiday vs. business / city break).
The “Country” has a causal effect on the unobserved confounder — those travelling from some countries are more likely to be on holiday.
The same applies to “TotalGuests”. Large parties are more likely to be holidaying, singles to be on business trips and doubles on city breaks.
The unobserved “reason for stay” has a causal effect on “LeadTime”, “HotelType” and “DepositType”.
“LeadTime”, “DifferentRoomAssigned”, “HotelType”, “DepositType” and “Reason for Stay” (U) all have a causal effect on “IsCanceled”.

Causal Discovery is the process of going beyond correlative links to discovering causal links (i.e. on which end of the a correlative line does the arrow belong?). This is not possible, or at least prohibitively difficult, to automate with algorithms like NOTEARS because the answers do not exist in the data alone.

However, there is still a critical need to visualise causation so that it can be explored with domain experts in order to build a causal diagram (also referred to as a Directed Acyclic Graph or DAG).

In this article an algorithm based on augmenting correlative discovery has been proposed using just 10 lines of Python code that can be used with domain experts to repeatedly refine the sensitivity (threshold) and to correct, add or remove the links (arrows) and exclude irrelevant nodes.

In future articles these causal diagrams will be used to build a full blown causal machine learning model that can answer complex “what if?” type questions from customers and business leaders.

If you enjoyed this article please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.

Sun Rising / Cockerel Crowing Data

Smallpox Data

Graduate Earnings / Census Income Data

Hotel Bookings Data

Causal Discovery : Does the Cockerel Crowing Cause the Sun to Rise? Republished from Source https://towardsdatascience.com/causal-discovery-does-the-cockerel-crowing-cause-the-sun-to-rise-f4308453ecfa?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

Time Stamp: November 1, 2022November 2, 2022