Introduction to Probability and an application in Tableau

In business we tend to focus a lot on statistics — mostly summary statistics that describe a dataset. What’s the average? What’s the minimum? What’s the maximum?

Looking at the Tableau forums and on Twitter, there are a lot of questions about how to calculate x statistic. Or how do I calculate a confidence interval? How do I calculate an error bar? How do I fit a linear model/trend?

While all these applications are important, what might get lost is the foundation: probability. Almost every statistical model is based on probability, specifically conditional probability.

In this post, we’ll walk you through conditional probability and how to use and understand them in the context of Tableau.

What is probability?

The best way to understand probability is by flipping a fair-sided coin. There are two possible outcomes: heads or tails. What’s the probability of the coin landing heads? You may know intuitively that it’s 50%. But how does that work out in math? We can define probability as

Probability = (# of ways it can happen) / (total number of outcomes)

In this flipping of the coin example, we see that the Probability of Heads P(H) = ½ = 50%.

In a more formal definition, the # of ways it can happen is called an event and total number of outcomes is called the sample space.

Terminology

Sample Space: The sample space is the set of all possible outcomes.

Examples: 
Rolling a six-sided die: {1,2,3,4,5,6}
The customer segments in Sample – Superstore {Consumer, Corporate, Home Office}

Event: An event is a subset of the sample space.

Examples:
Roll a die: the outcome is even {2, 4, 6}.
Pick one of the customer segments in Sample – Superstore at random {Corporate}

Union, Intersection and Mutually Exclusive Events

union is another way of saying OR. How many customers buy Furniture OR Office Supplies?

sgpfw

An intersection is another way of saying AND. How many customers buy Furniture AND Office Supplies?

sgpfw-1

Mutually exclusive events (also called disjoint events) is when two or more events can’t happen at the same time. Easiest explanation of this is the toss of a coin. It can either be heads or tails but it can’t be both.

Independent and Dependent Events

Independent Events

Two events are said to be independent of each other if their outcomes don’t affect each other.

A few examples:

1.) Getting a heads after tossing a fair coin AND getting a 3 after throwing a die.

2.) Choosing a marble from a jar AND landing on heads after tossing a fair coin.

3.) The probability of it raining today AND the probability of my garbage being collected today. (Garbage will be collected regardless)

4.) And for my #SportsVizSunday brothers, the chances of your favorite team winning today AND you wearing your lucky jersey.

(Independent events is one of the key foundations of linear regression, but we won’t get into that now…)

Dependent Events

Two events are said to be dependent if the probability of one event affects the likelihood of another event. A good example of this would be drawing 2 cards from a deck of 52 playing cards.

If on your first draw you had an ace and you put that aside, the probability of drawing an ace on the second draw has changed because you drew an ace the first time. What does that calculation look like?

There are 4 Aces in a deck of 52 cards

On your first draw, the probability of getting an ace is given by:

If we don’t return this card into the deck, the probability of drawing an ace on the second pick is given by

As you can clearly see, the above two probabilities are different, so we say that the two events are dependent. The likelihood of the second event depends on what happens in the first event.

Conditional Probability

We have already defined dependent and independent events and seen how probability of one event relates to the probability of the other event.

Having those concepts in mind, we can now look at conditional probability.

Conditional probability deals with further defining dependence of events by looking at probability of an event given that some other event first occurs.

Conditional probability is denoted by the following:

So why is all this important? And how does it apply to Tableau?

Bringing it back to Tableau

Believe it or not, the superstore dataset is full of probabilities and independent and dependent events. Let’s explore some. Open up Tableau and the superstore dataset.

Example 1

Drag customer to rows and drag Profit to the Abc field. Do a sort by profit. Let’s assume that the customers don’t know each other. Does the profit amount for Tamara Chand impact the profit amount for Hunter Lopez? If they don’t know each other, the answer is no, and these would be considered independent events.

probability_capture1

Example 2

A lot of times in retail we want to know which products are most likely bought with other products. A lot of times this is to determine which products we can cross-sell together. This is a form of conditional probability called market basket analysis.

To keep it simple, in our Superstore dataset, we are trying to find what is the probability of a product from office supplies appearing in a random furniture order?
First we need to get a second copy of orders. Click on data source and do a drag a second orders sheet to the pane. Do a self join on orders where OrderID = OrderID1
We need to figure out how many distinct orders there are for each category.

probability_capture2

 

2.) Create a new calculated field called “Distinct Orders by Category” and we are going to use a Fixed LOD calc. : {fixed [Category]: COUNTD([Order ID])}

probability_capture3

3.) Next we need to find the probability or percentage that each category appears with every other category.
Create new calculated field called “Percentage of Orders” with the following calculation:

COUNTD([Order ID])/min([Distinct Orders by Category])

probability_capture4

4.) Now create the viz. Drag Category (under the product hierarchy) to Rows and Category (Orders1) to Columns.

5.) Drag the Percentage of Orders calculation to the Abc or Text mark. And now you have a matrix of categories and the probability of it appearing in another order.

probability_capture5

How you read this is from left to right. The count distinct transactions are along the rows, so you can see that Office Supplies appear in 54% of all Furniture orders. Conversely, Furniture only appears in 25% of all Office Supply orders.

Whether you realized it or not, we just calculated a conditional probability on categories in the Superstore dataset.

In closing

I hope you learned a little about probability in this post and how to apply them in Tableau. On the next post we will build upon these foundations and discuss Bayes’ theorem.

That’s it for now!

-Paul

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s