Creating Custom Regions using PostGIS and Tableau

Welcome back, friends!

With the release of Tableau 2019.2 and the native support of PostGIS, I’ve been working on how to incorporate this new functionality into my day-to-day work.

This is similar to a problem that came up at work: How do you create custom outlines of regions based on store locations? I know this might seem trivial, but you could have situations where stores even in the same zip code are in different regions.

So how do you solve this? PostGIS to the rescue!

Getting started

If you aren’t familiar with PostGIS or how to get started with it, go back and read my blog post on it here.

I pulled down a list of Starbucks locations from POI Factory, and ran a simple clustering algorithm to simulate regions. You can download the Excel file here if you want to follow along.

First I’m going to upload these into PostGIS. I created a table with the following structure:

postgis_regions_01

Then, I clicked on the table, clicked Import/Export and import your data (after saving my file as a csv from Excel).

Next, lets create our geometry column. I have a tendency of importing my lat and long in as separate columns, and then running a couple SQL statements:

alter table starbucks add column geom geometry(Point, 4326);
update starbucks set geom=st_SetSrid(st_MakePoint(longitude, latitude), 4326);
create index starbucks_geom_idx on starbucks using gist (geom);

What this does in a few words:

1.) Adds a geometry column.

2.) Populates the new geometry column with the lat long fields.

3.) Creates a spatial index on the new geometry column.

Visualizing this in Tableau, we see something like this:

postgis_regions_02

Creating the custom regions

To create the custom regions, were going to use Voronoi Polygons. From Wikipedia, a Voronoi diagram is a partitioning of a plane into regions based on distance to points in a specific subset of the plane.

In plain English, we will be creating a region around a point, until it reaches another point. Luckily, there’s a function for this in PostGIS.

First we’re going to find the centroids for each region:

SELECT region, (ST_Dump(ST_Centroid(ST_collect(geom)))).geom::Geometry(point, 4326) AS geom
FROM starbucks
GROUP BY 1;

postgis_regions_03

Now, use a voronoi diagram to get actual dividing edges between the region centroids:

SELECT (ST_Dump(ST_VoronoiPolygons(ST_collect(geom)))).geom::Geometry(polygon, 4326) AS geom
FROM (SELECT region, (ST_Dump(ST_Centroid(ST_collect(geom)))).geom::Geometry(point, 4326) AS geom
FROM starbucks
GROUP BY 1) region_centers;

postgis_regions_04

As you can see you get some weird stuff, that covers a lot of the ocean, etc. Let’s intersect it with the outline of the United States. I’m also going to create a table with a spatial index to speed up performance:

create table starbucks_voronoi AS
SELECT (ST_Dump(ST_VoronoiPolygons(ST_collect(centroid_geom)))).geom::Geometry(Polygon, 4326) AS geom
FROM (SELECT region, (ST_Dump(ST_Centroid(ST_collect(geom)))).geom::Geometry(point, 4326) AS centroid_geom
FROM starbucks
GROUP BY 1) region_centers;

CREATE INDEX starbucks_voronoi_gix
ON starbucks_voronoi USING GIST (geom);

SELECT (ST_Dump(ST_Intersection(a.geom, b.geom))).geom::Geometry(Polygon, 4326) AS geom
FROM tl_2018_us_state a
inner JOIN starbucks_voronoi b
on 1=1

You should get something that looks like this:

postgis_regions_05

Bringing it all together

Let’s tie our region numbers back to the new custom regions we created:

postgis_regions_06

The results aren’t the prettiest, but not every “sub” region is populated with a Starbucks location.

Now adding our Starbucks locations back in:

postgis_regions_08

postgis_regions_07

 

That’s it. I’ll work on cleaning this up (including getting rid of the state outlines), but that gives you an idea of how to do it.

Happy mapping!

-Paul

Advertisements

Common Geospatial Tasks using PostGIS

Back in January, I gave a quick introduction on how to get started with PostGIS

In this post, I’m going to talk about how to do simple geospatial or geoprocessing tasks in PostGIS. With the upcoming native support coming in Tableau, I thought this would be a great opportunity to show the power and opportunity of PostGIS.

A little background

When I first got into geospatial 10+ years ago, the only software on the market was ESRI’s ArcMap. In the last few years, with the influx of open-source software, I have transitioned about 90% of my processes away from ESRI and to either QGIS or a database system.

When you are dealing with large datasets, you start to realize that ArcMap has performance issues when it comes to spatial joins and other geoprocessing tasks.

Getting Started

To get started, let’s look at some national park location data. I found a dataset on Kaggle that has the necessary location data.

First, you need to set up the table structure in PostGIS. I named the table national_parks and here’s how I set up the table:

postgis_2_02

Next, we have to load the data. Under tables, find your national_parks table and right click on it and follow the prompts. You after you import the data, the geom field will be empty.

To fill that geometry column in, we just have to run a simple update statement.

update national_parks set geom=st_SetSrid(st_MakePoint(“Longitude”, “Latitude”), 4326);

Now we have data in our database and we can start doing some basic PostGIS calculations.

Find places within x Distance

This is one of the most common calculations when it comes to analyzing data. We always want to know which places are within x location. With PostGIS this is pretty easy, it’s just a simple query. Let’s find the National parks within 1000 miles of Pittsburgh.

SELECT a.”Park_Name”,
a.”Latitude”,
a.”Longitude”,
ST_DistanceSphere(‘SRID=4326;POINT(-80.0505401 40.431478)’::geometry,geom)/1609.344 as st_distanceSphere_in_miles
from public.national_parks a
where ST_DistanceSphere(‘SRID=4326;POINT(-80.0505401 40.431478)’::geometry,geom)/1609.344 <= 1000
order by st_distanceSphere_in_miles

The key part of this query is the ST_DistanceSphere function. This function calculates the distance in meters between two points. The 1609.344 is to convert meters to miles.

Point in Polygon

Let’s say that you are questioning which state each national park falls in. From the previous post, I showed how to import a shapefile, now we are going to do a special join to find which state the centroid falls in.

SELECT a.”Park_Name”, a.”State”, b.stusps, b.name
from public.national_parks a
inner join public.tl_2018_us_state b
on ST_WITHIN(a.geom, b.geom)

The ST_WITHIN function is the important part of this query. ST_WITHIN says when the point is within the state, then join the point with the state it falls in, else don’t.

Generate Random Points

Another cool thing you can do with PostGIS is to create random points within a state, county, block group, etc. The function is fairly simple:

SELECT stusps, name, geom, (ST_Dump(ST_GeneratePoints(geom, 1000))).geom::Geometry(point, 4326) AS geom_gen_points
FROM tl_2018_us_state
where stusps = ‘PA’

What does this look like? (A little sneak peak to the Tableau 2019.2 Beta)

postgis_2_03postgis_2_04

Or throwing them together on a dual-axis:

postgis_2_05

Why would you want to do this? A few reasons. Let’s say you wanted to create a dot density map, like Sarah Battersby did back in May 2018:

Or let’s say you have a 3 mile ring around a point, and you want to find the demographic attributes of that area. You could easily generate points, dump them, join them to the respective block group, etc, and then easily find the weighted demographics of that 3 mile ring.

In conclusion

This just scratches the surface of all the opportunity PostGIS. There are countless more use cases of PostGIS such as convex and concave hulls, and intersects and length. Maybe I’ll get into those next time.

-Paul

Getting Started with analyzing NFL play-by-play data using nflscrapR

Hello again!

We’re going to talk about analyzing NFL play by play data using the R package nflscrapR.

It was not long ago that NFL play-by-play data was hard to find, and if you did find it, it took a lot of clean up before you could even analyze it. Now, thanks to
Maksim Horowitz, Ron Yurko and Sam Ventura, analyzing NFL data is easier than ever.

Getting started

If you don’t have R, you need to first download and install it. I prefer the Microsoft Open version of R mainly because of the multi-threaded math libraries.

Once you download that you need an IDE (Integrated development environment). I prefer RStudio.

Once you have those downloaded and installed, open up RStudio.

To install nflscrapR, you need to install devtools before you are able to install the nflscrapR library.

install.packages("devtools")
devtools::install_github(repo = "maksimhorowitz/nflscrapR")

Once that’s installed, you can start pulling down some data. Let’s look at Super Bowl 53. The scrape_season_play_by_play will take a while. You might want to walk away for a while (or do something else) while it loads.

library(nflscrapR)

library(sqldf)

pbp_2018 <- scrape_season_play_by_play(2018, type = “post”)

sb_53 <- sqldf(“select * from pbp_2018 where game_id = ‘2019020300’”)

#Get rid of null or missing win probability rows

sb_53 <- sqldf(“select * from sb_53 where home_wp is not null and away_wp is not null”)

At this point, you can either visualize it in R using base graphics (ugh) or using ggplot2. But I’m going to export it and throw it into Tableau.

To export out the sb_53, you can either use the write.csv command, or since Tableau supports statistical file formats, we can use R’s RData format.

save(sb_53, file = “sb_53.RData”)

This will save it to your default directory (mine is Documents), but you can specify a directory by adding the path in front of your file name.

Visualizing in Tableau

Open up Tableau and click on Statistical file and choose your sb_53.RData file. Click on Sheet 1 to get started.

Let’s look at some win probability data. We need to clean it up a little.

First, make Game Seconds Remaining a Dimension.

1.) Drag Game Seconds Remaining to Columns and make it Continuous.

2.) Drag Home Wp to Rows. Right click -> Measure -> Maximum.

3.) If it’s not already a line under marks, make it a line.

4.) Right click on the Game Seconds Remaining axis and click Reversed.

You should see something like this:

nflscrapR_01

5.) Drag Away Wp to Rows. Right click -> Measure -> Maximum.

6.) Right click on Away Wp in your rows field, click Dual Axis. Then Right click on the Away Wp and synchronize the axes.

nflscrapR_02

The rest is just formatting.

nflscrapR_03

Here’s a link to the workbook on Tableau Public if you want to see how I formatted it.

This is just scratching the surface with what’s possible with nflscrapR, but this will get you started.

Reach out with any questions!

-Paul

 

What is Expected Value?

Welcome back friends!

Today we are going to talk about another probability and statistics concept called Expected Value.

Expected value is what you think it is: the return you can expect given the data or knowledge you already have.

Investors use expected value all the time. They make decisions to buy/sell a stock based on what the expected value is for that stock.

Without thinking about it, we intuitively use expected value when we make everyday decisions. We factor in the benefits and risks of a decision and if it has a positive expected value, we usually are in favor of that decision, else we are against it.

For example, when we take on a new project or (a blog post in this case), we view the expected value in terms of personal development and other career benefits as higher than the cost in terms of time and/or sanity.

Likewise, anyone who reads a lot knows that most books they choose will have minimal impact on them, while a few books will change their lives and be of tremendous value.

Looking at the required time and money as a cost, reading books has a positive expected value.

Back to math

Expected value informs us what we think “the long-term” average will be after adding many more trials or records.

For example, if we flip a quarter 10 times, it’s probably not going to be 50/50. If you flip it 100 times, it’s still probably not going to be 50/50, but closer. But the more and more you flip the coin, the closer you will get to the 50/50 expected value of coin flipping.

Another great example of this is rolling a die. As we increase the sample size, we see that the probability of rolling any value approaches 1/6.

Notation

I had to throw in some notation here for completeness. For discrete data (non-continuous), it’s pretty easy. It’s just the weighted average of the possible values and their respective probabilities.

μ = E(X) = ∑[x·P(x)]

where

μ = mean

E(X) = expected value

x = an outcome

P(x) = probability of that outcome

A company makes electronic gadgets. One out of every 50 gadgets is faulty, but the company doesn’t know which ones are faulty until a buyer complains. Suppose the company makes a $3 profit on the sale of any working gadget, but suffers a loss of $80 for every faulty gadget because they have to repair the unit.

E(X) = 49/50 • 3 + 1/50 • (-80)

= 147/50 – 80/50

= 67/50

= 1.34

The expected value is $1.34 on every gadget made, and since its positive, we can expect the company to make a profit.

Finding the expected value of a continuous variable – like one from a normal distribution – is a little more complex involving calculus. We’ll get into that later.

That’s all for now!

-Paul

10X Your Life

Never reduce a target. Instead, increase actions. When you start rethinking your targets, making up excuses, and letting yourself off the hook, you are giving up on your dreams! — Grant Cardone

A book I just finished reading is the 10X Rule: The Only Difference Between Success and Failure by Grant Cardone.

41wkgwtxo7l._sx329_bo1204203200_

The premise is simple: Success is your duty and obligation and the only way to get there is to take massive amounts of action. By nature, he contends, we don’t set our goals high enough, because we are taught early in life to set obtainable goals. Where does that get us? We typically set goals where we are comfortable, and that doesn’t help us grow.

I found Cardone’s advice to be both simple and life-changing with actionable measures that could be applied to all facets of life.

I became so fired up and motivated as I listened to The 10X Rule, that I started setting goals, multiplying them, better utilizing my time, dreaming bigger and achieving more.

Key Concepts of the 10X Rule

Because I highly recommend this book for anyone looking to change their mindset and increase success, and their position in life, I will only provide a summary.

Why?

In order to feel the full impact, you should read The 10X Rule for yourself. Here is a sneak peek:

Set bigger goals

The majority of us get stuck at normal levels of output because we don’t set high enough goals. We set easily attainable goals, reach them, yet we still feel unsatisfied. What if we multiplied our goals by 10X? For example, if you planned to produce one new original viz or learn one new technique in Tableau, you instead set a goal of 10 new vizzes or new techniques. Set bigger goals — your marriage goals, your workout goals, your Tableau goals and more. Setting higher targets for yourself will yield greater results.

Take massive action

Goals require follow through. Create 10X the vizzes you planned to produce, write 10X the blog posts you planned to write. Set high enough goals then take massive action to fulfill your true potential. Don’t let setbacks stop you, view them only as obstacles that you will overcome.

Don’t be average and change your thinking

In order to achieve greatness — it could be a better marriage, running a marathon, becoming a Zen Master — you need to believe that 10X is possible. Your thoughts and actions are the real reason you are where you are right now. Success is not something that doesn’t happen to you, it’s something that happens because of you.

Don’t keep limiting yourself, know that 10Xing your life is possible and go after your goals!!

Application in my life

Like I said earlier, I started applying this at work and in my life in general. One of the great chapters in his book is one on the 32 differences or traits between successful and unsuccessful people.

Many of those I took to heart, such as “having a can do attitude”, “Love Challenges” and “Be Uncomfortable.” But the one I liked the most was “Commit first and figure out the details later.”

Recently, there was was a problem I was approached with at work. I had no clue on how I was going to figure it out or how I was going to get the data to figure it out. But I committed. I didn’t care. I eventually figured it the solution and hopefully it leads to a great business opportunity.

In conclusion, I loved Cardone’s direct style and this really hit home. I highly recommend this book.

Cardone believes that we are all capable of bigger thinking, setting higher targets, taking massive action and realizing our full potential. No matter our background, where we started or where we came from, we all have the ability and choice to believe that 10Xing our life is possible. Although you may face adversity,  Cardone encourages you to keep going. Success is not only possible, it’s also your duty, obligation and responsibility.

-Paul

Introduction to Probability and an application in Tableau

In business we tend to focus a lot on statistics — mostly summary statistics that describe a dataset. What’s the average? What’s the minimum? What’s the maximum?

Looking at the Tableau forums and on Twitter, there are a lot of questions about how to calculate x statistic. Or how do I calculate a confidence interval? How do I calculate an error bar? How do I fit a linear model/trend?

While all these applications are important, what might get lost is the foundation: probability. Almost every statistical model is based on probability, specifically conditional probability.

In this post, we’ll walk you through conditional probability and how to use and understand them in the context of Tableau.

What is probability?

The best way to understand probability is by flipping a fair-sided coin. There are two possible outcomes: heads or tails. What’s the probability of the coin landing heads? You may know intuitively that it’s 50%. But how does that work out in math? We can define probability as

Probability = (# of ways it can happen) / (total number of outcomes)

In this flipping of the coin example, we see that the Probability of Heads P(H) = ½ = 50%.

In a more formal definition, the # of ways it can happen is called an event and total number of outcomes is called the sample space.

Terminology

Sample Space: The sample space is the set of all possible outcomes.

Examples: 
Rolling a six-sided die: {1,2,3,4,5,6}
The customer segments in Sample – Superstore {Consumer, Corporate, Home Office}

Event: An event is a subset of the sample space.

Examples:
Roll a die: the outcome is even {2, 4, 6}.
Pick one of the customer segments in Sample – Superstore at random {Corporate}

Union, Intersection and Mutually Exclusive Events

union is another way of saying OR. How many customers buy Furniture OR Office Supplies?

sgpfw

An intersection is another way of saying AND. How many customers buy Furniture AND Office Supplies?

sgpfw-1

Mutually exclusive events (also called disjoint events) is when two or more events can’t happen at the same time. Easiest explanation of this is the toss of a coin. It can either be heads or tails but it can’t be both.

Independent and Dependent Events

Independent Events

Two events are said to be independent of each other if their outcomes don’t affect each other.

A few examples:

1.) Getting a heads after tossing a fair coin AND getting a 3 after throwing a die.

2.) Choosing a marble from a jar AND landing on heads after tossing a fair coin.

3.) The probability of it raining today AND the probability of my garbage being collected today. (Garbage will be collected regardless)

4.) And for my #SportsVizSunday brothers, the chances of your favorite team winning today AND you wearing your lucky jersey.

(Independent events is one of the key foundations of linear regression, but we won’t get into that now…)

Dependent Events

Two events are said to be dependent if the probability of one event affects the likelihood of another event. A good example of this would be drawing 2 cards from a deck of 52 playing cards.

If on your first draw you had an ace and you put that aside, the probability of drawing an ace on the second draw has changed because you drew an ace the first time. What does that calculation look like?

There are 4 Aces in a deck of 52 cards

On your first draw, the probability of getting an ace is given by:

If we don’t return this card into the deck, the probability of drawing an ace on the second pick is given by

As you can clearly see, the above two probabilities are different, so we say that the two events are dependent. The likelihood of the second event depends on what happens in the first event.

Conditional Probability

We have already defined dependent and independent events and seen how probability of one event relates to the probability of the other event.

Having those concepts in mind, we can now look at conditional probability.

Conditional probability deals with further defining dependence of events by looking at probability of an event given that some other event first occurs.

Conditional probability is denoted by the following:

So why is all this important? And how does it apply to Tableau?

Bringing it back to Tableau

Believe it or not, the superstore dataset is full of probabilities and independent and dependent events. Let’s explore some. Open up Tableau and the superstore dataset.

Example 1

Drag customer to rows and drag Profit to the Abc field. Do a sort by profit. Let’s assume that the customers don’t know each other. Does the profit amount for Tamara Chand impact the profit amount for Hunter Lopez? If they don’t know each other, the answer is no, and these would be considered independent events.

probability_capture1

Example 2

A lot of times in retail we want to know which products are most likely bought with other products. A lot of times this is to determine which products we can cross-sell together. This is a form of conditional probability called market basket analysis.

To keep it simple, in our Superstore dataset, we are trying to find what is the probability of a product from office supplies appearing in a random furniture order?
First we need to get a second copy of orders. Click on data source and do a drag a second orders sheet to the pane. Do a self join on orders where OrderID = OrderID1
We need to figure out how many distinct orders there are for each category.

probability_capture2

 

2.) Create a new calculated field called “Distinct Orders by Category” and we are going to use a Fixed LOD calc. : {fixed [Category]: COUNTD([Order ID])}

probability_capture3

3.) Next we need to find the probability or percentage that each category appears with every other category.
Create new calculated field called “Percentage of Orders” with the following calculation:

COUNTD([Order ID])/min([Distinct Orders by Category])

probability_capture4

4.) Now create the viz. Drag Category (under the product hierarchy) to Rows and Category (Orders1) to Columns.

5.) Drag the Percentage of Orders calculation to the Abc or Text mark. And now you have a matrix of categories and the probability of it appearing in another order.

probability_capture5

How you read this is from left to right. The count distinct transactions are along the rows, so you can see that Office Supplies appear in 54% of all Furniture orders. Conversely, Furniture only appears in 25% of all Office Supply orders.

Whether you realized it or not, we just calculated a conditional probability on categories in the Superstore dataset.

In closing

I hope you learned a little about probability in this post and how to apply them in Tableau. On the next post we will build upon these foundations and discuss Bayes’ theorem.

That’s it for now!

-Paul

Where Should I Put My Filters? – Part 2

Welcome back! It’s been a while since Part 1 of this series. But after getting distracted with some set actions and PostGIS, I’m finally looping back around to filters.

Special thanks to Kevin Flerlage for his help and guidance on this post!

The last time I discussed filters, we were talking about putting filters on the top, left and right of your dashboard:

filters_screenshot2

But what if you want to hide them? What if they are taking up valuable real estate or are simply an eye sore? Well, there’s a solution.

There are two ways to solve this problem. The first is to use collapsible containers. The second is to use dashboard actions.

Collapsible containers, made famous by Robert Rouse, is the harder of the two solutions, but I’m still going to write a blog post in the future that covers it. Since dashboard actions just came out in Tableau 2018.3, we’re going to talk about that method instead.

A slight introduction

We’re going to building something in Tableau that is called a Hamburger menu. These are mostly found on websites and mobile apps. The reason they are called hamburger menus is because the image resembles a hamburger:

Usually, they are hiding some sort of hidden menu or navigation. For example:

Getting started

Let’s get back to our Superstore dashboard. You can download it here if you want to follow along.

1.) Take the starter and duplicate it.

2.) On the first dashboard, remove the container at the top for the all the filters and the filters logo. Some things may shift a little, so you might have to move some things around to make it to your liking. You should be left with something like this:

dashboard actions_1

3.) Take the hamburger icon () and save it to somewhere on your computer.

4.) Take a button from the objects menu on the left side of the dashboard menu, and drag a button on the left side of the title.

5.) Select the down triangle, and select Edit.

6.) Under Navigate to, select your second dashboard with filters.

7.) Under image, find your hamburger menu. If you want to put a tooltip on your hamburger, you can do that to. When you are done, click OK.

Second dashboard

On the second dashboard, we will do a similar process.

1.) Take the X icon () and save it somewhere on your computer.

2.) Take a button from the objects menu on the left side of the dashboard menu, and drag a button on the left side of the filter icon. (You can really place it anywhere you want).

3.) Select the down triangle, and select Edit.

4.) Under Navigate to, select your first dashboard.

5.) Under image, find your hamburger menu. If you want to put a tooltip on your X, you can do that to. When you are done, click OK.

That’s it. Here’s the workbook for you to download and see what else I did.

The pro: This is much easier to do than collapsing and hiding containers.

The con: You have to maintain 2 dashboards.

That’s all for now!

Until next time,

Paul