Common Geospatial Tasks using PostGIS

Back in January, I gave a quick introduction on how to get started with PostGIS

In this post, I’m going to talk about how to do simple geospatial or geoprocessing tasks in PostGIS. With the upcoming native support coming in Tableau, I thought this would be a great opportunity to show the power and opportunity of PostGIS.

A little background

When I first got into geospatial 10+ years ago, the only software on the market was ESRI’s ArcMap. In the last few years, with the influx of open-source software, I have transitioned about 90% of my processes away from ESRI and to either QGIS or a database system.

When you are dealing with large datasets, you start to realize that ArcMap has performance issues when it comes to spatial joins and other geoprocessing tasks.

Getting Started

To get started, let’s look at some national park location data. I found a dataset on Kaggle that has the necessary location data.

First, you need to set up the table structure in PostGIS. I named the table national_parks and here’s how I set up the table:

postgis_2_02

Next, we have to load the data. Under tables, find your national_parks table and right click on it and follow the prompts. You after you import the data, the geom field will be empty.

To fill that geometry column in, we just have to run a simple update statement.

update national_parks set geom=st_SetSrid(st_MakePoint(“Longitude”, “Latitude”), 4326);

Now we have data in our database and we can start doing some basic PostGIS calculations.

Find places within x Distance

This is one of the most common calculations when it comes to analyzing data. We always want to know which places are within x location. With PostGIS this is pretty easy, it’s just a simple query. Let’s find the National parks within 1000 miles of Pittsburgh.

SELECT a.”Park_Name”,
a.”Latitude”,
a.”Longitude”,
ST_DistanceSphere(‘SRID=4326;POINT(-80.0505401 40.431478)’::geometry,geom)/1609.344 as st_distanceSphere_in_miles
from public.national_parks a
where ST_DistanceSphere(‘SRID=4326;POINT(-80.0505401 40.431478)’::geometry,geom)/1609.344 <= 1000
order by st_distanceSphere_in_miles

The key part of this query is the ST_DistanceSphere function. This function calculates the distance in meters between two points. The 1609.344 is to convert meters to miles.

Point in Polygon

Let’s say that you are questioning which state each national park falls in. From the previous post, I showed how to import a shapefile, now we are going to do a special join to find which state the centroid falls in.

SELECT a.”Park_Name”, a.”State”, b.stusps, b.name
from public.national_parks a
inner join public.tl_2018_us_state b
on ST_WITHIN(a.geom, b.geom)

The ST_WITHIN function is the important part of this query. ST_WITHIN says when the point is within the state, then join the point with the state it falls in, else don’t.

Generate Random Points

Another cool thing you can do with PostGIS is to create random points within a state, county, block group, etc. The function is fairly simple:

SELECT stusps, name, geom, (ST_Dump(ST_GeneratePoints(geom, 1000))).geom::Geometry(point, 4326) AS geom_gen_points
FROM tl_2018_us_state
where stusps = ‘PA’

What does this look like? (A little sneak peak to the Tableau 2019.2 Beta)

postgis_2_03postgis_2_04

Or throwing them together on a dual-axis:

postgis_2_05

Why would you want to do this? A few reasons. Let’s say you wanted to create a dot density map, like Sarah Battersby did back in May 2018:

Or let’s say you have a 3 mile ring around a point, and you want to find the demographic attributes of that area. You could easily generate points, dump them, join them to the respective block group, etc, and then easily find the weighted demographics of that 3 mile ring.

In conclusion

This just scratches the surface of all the opportunity PostGIS. There are countless more use cases of PostGIS such as convex and concave hulls, and intersects and length. Maybe I’ll get into those next time.

-Paul

Advertisements

Getting Started with analyzing NFL play-by-play data using nflscrapR

Hello again!

We’re going to talk about analyzing NFL play by play data using the R package nflscrapR.

It was not long ago that NFL play-by-play data was hard to find, and if you did find it, it took a lot of clean up before you could even analyze it. Now, thanks to
Maksim Horowitz, Ron Yurko and Sam Ventura, analyzing NFL data is easier than ever.

Getting started

If you don’t have R, you need to first download and install it. I prefer the Microsoft Open version of R mainly because of the multi-threaded math libraries.

Once you download that you need an IDE (Integrated development environment). I prefer RStudio.

Once you have those downloaded and installed, open up RStudio.

To install nflscrapR, you need to install devtools before you are able to install the nflscrapR library.

install.packages("devtools")
devtools::install_github(repo = "maksimhorowitz/nflscrapR")

Once that’s installed, you can start pulling down some data. Let’s look at Super Bowl 53. The scrape_season_play_by_play will take a while. You might want to walk away for a while (or do something else) while it loads.

library(nflscrapR)

library(sqldf)

pbp_2018 <- scrape_season_play_by_play(2018, type = “post”)

sb_53 <- sqldf(“select * from pbp_2018 where game_id = ‘2019020300’”)

#Get rid of null or missing win probability rows

sb_53 <- sqldf(“select * from sb_53 where home_wp is not null and away_wp is not null”)

At this point, you can either visualize it in R using base graphics (ugh) or using ggplot2. But I’m going to export it and throw it into Tableau.

To export out the sb_53, you can either use the write.csv command, or since Tableau supports statistical file formats, we can use R’s RData format.

save(sb_53, file = “sb_53.RData”)

This will save it to your default directory (mine is Documents), but you can specify a directory by adding the path in front of your file name.

Visualizing in Tableau

Open up Tableau and click on Statistical file and choose your sb_53.RData file. Click on Sheet 1 to get started.

Let’s look at some win probability data. We need to clean it up a little.

First, make Game Seconds Remaining a Dimension.

1.) Drag Game Seconds Remaining to Columns and make it Continuous.

2.) Drag Home Wp to Rows. Right click -> Measure -> Maximum.

3.) If it’s not already a line under marks, make it a line.

4.) Right click on the Game Seconds Remaining axis and click Reversed.

You should see something like this:

nflscrapR_01

5.) Drag Away Wp to Rows. Right click -> Measure -> Maximum.

6.) Right click on Away Wp in your rows field, click Dual Axis. Then Right click on the Away Wp and synchronize the axes.

nflscrapR_02

The rest is just formatting.

nflscrapR_03

Here’s a link to the workbook on Tableau Public if you want to see how I formatted it.

This is just scratching the surface with what’s possible with nflscrapR, but this will get you started.

Reach out with any questions!

-Paul

 

What is Expected Value?

Welcome back friends!

Today we are going to talk about another probability and statistics concept called Expected Value.

Expected value is what you think it is: the return you can expect given the data or knowledge you already have.

Investors use expected value all the time. They make decisions to buy/sell a stock based on what the expected value is for that stock.

Without thinking about it, we intuitively use expected value when we make everyday decisions. We factor in the benefits and risks of a decision and if it has a positive expected value, we usually are in favor of that decision, else we are against it.

For example, when we take on a new project or (a blog post in this case), we view the expected value in terms of personal development and other career benefits as higher than the cost in terms of time and/or sanity.

Likewise, anyone who reads a lot knows that most books they choose will have minimal impact on them, while a few books will change their lives and be of tremendous value.

Looking at the required time and money as a cost, reading books has a positive expected value.

Back to math

Expected value informs us what we think “the long-term” average will be after adding many more trials or records.

For example, if we flip a quarter 10 times, it’s probably not going to be 50/50. If you flip it 100 times, it’s still probably not going to be 50/50, but closer. But the more and more you flip the coin, the closer you will get to the 50/50 expected value of coin flipping.

Another great example of this is rolling a die. As we increase the sample size, we see that the probability of rolling any value approaches 1/6.

Notation

I had to throw in some notation here for completeness. For discrete data (non-continuous), it’s pretty easy. It’s just the weighted average of the possible values and their respective probabilities.

μ = E(X) = ∑[x·P(x)]

where

μ = mean

E(X) = expected value

x = an outcome

P(x) = probability of that outcome

A company makes electronic gadgets. One out of every 50 gadgets is faulty, but the company doesn’t know which ones are faulty until a buyer complains. Suppose the company makes a $3 profit on the sale of any working gadget, but suffers a loss of $80 for every faulty gadget because they have to repair the unit.

E(X) = 49/50 • 3 + 1/50 • (-80)

= 147/50 – 80/50

= 67/50

= 1.34

The expected value is $1.34 on every gadget made, and since its positive, we can expect the company to make a profit.

Finding the expected value of a continuous variable – like one from a normal distribution – is a little more complex involving calculus. We’ll get into that later.

That’s all for now!

-Paul

10X Your Life

Never reduce a target. Instead, increase actions. When you start rethinking your targets, making up excuses, and letting yourself off the hook, you are giving up on your dreams! — Grant Cardone

A book I just finished reading is the 10X Rule: The Only Difference Between Success and Failure by Grant Cardone.

41wkgwtxo7l._sx329_bo1204203200_

The premise is simple: Success is your duty and obligation and the only way to get there is to take massive amounts of action. By nature, he contends, we don’t set our goals high enough, because we are taught early in life to set obtainable goals. Where does that get us? We typically set goals where we are comfortable, and that doesn’t help us grow.

I found Cardone’s advice to be both simple and life-changing with actionable measures that could be applied to all facets of life.

I became so fired up and motivated as I listened to The 10X Rule, that I started setting goals, multiplying them, better utilizing my time, dreaming bigger and achieving more.

Key Concepts of the 10X Rule

Because I highly recommend this book for anyone looking to change their mindset and increase success, and their position in life, I will only provide a summary.

Why?

In order to feel the full impact, you should read The 10X Rule for yourself. Here is a sneak peek:

Set bigger goals

The majority of us get stuck at normal levels of output because we don’t set high enough goals. We set easily attainable goals, reach them, yet we still feel unsatisfied. What if we multiplied our goals by 10X? For example, if you planned to produce one new original viz or learn one new technique in Tableau, you instead set a goal of 10 new vizzes or new techniques. Set bigger goals — your marriage goals, your workout goals, your Tableau goals and more. Setting higher targets for yourself will yield greater results.

Take massive action

Goals require follow through. Create 10X the vizzes you planned to produce, write 10X the blog posts you planned to write. Set high enough goals then take massive action to fulfill your true potential. Don’t let setbacks stop you, view them only as obstacles that you will overcome.

Don’t be average and change your thinking

In order to achieve greatness — it could be a better marriage, running a marathon, becoming a Zen Master — you need to believe that 10X is possible. Your thoughts and actions are the real reason you are where you are right now. Success is not something that doesn’t happen to you, it’s something that happens because of you.

Don’t keep limiting yourself, know that 10Xing your life is possible and go after your goals!!

Application in my life

Like I said earlier, I started applying this at work and in my life in general. One of the great chapters in his book is one on the 32 differences or traits between successful and unsuccessful people.

Many of those I took to heart, such as “having a can do attitude”, “Love Challenges” and “Be Uncomfortable.” But the one I liked the most was “Commit first and figure out the details later.”

Recently, there was was a problem I was approached with at work. I had no clue on how I was going to figure it out or how I was going to get the data to figure it out. But I committed. I didn’t care. I eventually figured it the solution and hopefully it leads to a great business opportunity.

In conclusion, I loved Cardone’s direct style and this really hit home. I highly recommend this book.

Cardone believes that we are all capable of bigger thinking, setting higher targets, taking massive action and realizing our full potential. No matter our background, where we started or where we came from, we all have the ability and choice to believe that 10Xing our life is possible. Although you may face adversity,  Cardone encourages you to keep going. Success is not only possible, it’s also your duty, obligation and responsibility.

-Paul

Introduction to Probability and an application in Tableau

In business we tend to focus a lot on statistics — mostly summary statistics that describe a dataset. What’s the average? What’s the minimum? What’s the maximum?

Looking at the Tableau forums and on Twitter, there are a lot of questions about how to calculate x statistic. Or how do I calculate a confidence interval? How do I calculate an error bar? How do I fit a linear model/trend?

While all these applications are important, what might get lost is the foundation: probability. Almost every statistical model is based on probability, specifically conditional probability.

In this post, we’ll walk you through conditional probability and how to use and understand them in the context of Tableau.

What is probability?

The best way to understand probability is by flipping a fair-sided coin. There are two possible outcomes: heads or tails. What’s the probability of the coin landing heads? You may know intuitively that it’s 50%. But how does that work out in math? We can define probability as

Probability = (# of ways it can happen) / (total number of outcomes)

In this flipping of the coin example, we see that the Probability of Heads P(H) = ½ = 50%.

In a more formal definition, the # of ways it can happen is called an event and total number of outcomes is called the sample space.

Terminology

Sample Space: The sample space is the set of all possible outcomes.

Examples: 
Rolling a six-sided die: {1,2,3,4,5,6}
The customer segments in Sample – Superstore {Consumer, Corporate, Home Office}

Event: An event is a subset of the sample space.

Examples:
Roll a die: the outcome is even {2, 4, 6}.
Pick one of the customer segments in Sample – Superstore at random {Corporate}

Union, Intersection and Mutually Exclusive Events

union is another way of saying OR. How many customers buy Furniture OR Office Supplies?

sgpfw

An intersection is another way of saying AND. How many customers buy Furniture AND Office Supplies?

sgpfw-1

Mutually exclusive events (also called disjoint events) is when two or more events can’t happen at the same time. Easiest explanation of this is the toss of a coin. It can either be heads or tails but it can’t be both.

Independent and Dependent Events

Independent Events

Two events are said to be independent of each other if their outcomes don’t affect each other.

A few examples:

1.) Getting a heads after tossing a fair coin AND getting a 3 after throwing a die.

2.) Choosing a marble from a jar AND landing on heads after tossing a fair coin.

3.) The probability of it raining today AND the probability of my garbage being collected today. (Garbage will be collected regardless)

4.) And for my #SportsVizSunday brothers, the chances of your favorite team winning today AND you wearing your lucky jersey.

(Independent events is one of the key foundations of linear regression, but we won’t get into that now…)

Dependent Events

Two events are said to be dependent if the probability of one event affects the likelihood of another event. A good example of this would be drawing 2 cards from a deck of 52 playing cards.

If on your first draw you had an ace and you put that aside, the probability of drawing an ace on the second draw has changed because you drew an ace the first time. What does that calculation look like?

There are 4 Aces in a deck of 52 cards

On your first draw, the probability of getting an ace is given by:

If we don’t return this card into the deck, the probability of drawing an ace on the second pick is given by

As you can clearly see, the above two probabilities are different, so we say that the two events are dependent. The likelihood of the second event depends on what happens in the first event.

Conditional Probability

We have already defined dependent and independent events and seen how probability of one event relates to the probability of the other event.

Having those concepts in mind, we can now look at conditional probability.

Conditional probability deals with further defining dependence of events by looking at probability of an event given that some other event first occurs.

Conditional probability is denoted by the following:

So why is all this important? And how does it apply to Tableau?

Bringing it back to Tableau

Believe it or not, the superstore dataset is full of probabilities and independent and dependent events. Let’s explore some. Open up Tableau and the superstore dataset.

Example 1

Drag customer to rows and drag Profit to the Abc field. Do a sort by profit. Let’s assume that the customers don’t know each other. Does the profit amount for Tamara Chand impact the profit amount for Hunter Lopez? If they don’t know each other, the answer is no, and these would be considered independent events.

probability_capture1

Example 2

A lot of times in retail we want to know which products are most likely bought with other products. A lot of times this is to determine which products we can cross-sell together. This is a form of conditional probability called market basket analysis.

To keep it simple, in our Superstore dataset, we are trying to find what is the probability of a product from office supplies appearing in a random furniture order?
First we need to get a second copy of orders. Click on data source and do a drag a second orders sheet to the pane. Do a self join on orders where OrderID = OrderID1
We need to figure out how many distinct orders there are for each category.

probability_capture2

 

2.) Create a new calculated field called “Distinct Orders by Category” and we are going to use a Fixed LOD calc. : {fixed [Category]: COUNTD([Order ID])}

probability_capture3

3.) Next we need to find the probability or percentage that each category appears with every other category.
Create new calculated field called “Percentage of Orders” with the following calculation:

COUNTD([Order ID])/min([Distinct Orders by Category])

probability_capture4

4.) Now create the viz. Drag Category (under the product hierarchy) to Rows and Category (Orders1) to Columns.

5.) Drag the Percentage of Orders calculation to the Abc or Text mark. And now you have a matrix of categories and the probability of it appearing in another order.

probability_capture5

How you read this is from left to right. The count distinct transactions are along the rows, so you can see that Office Supplies appear in 54% of all Furniture orders. Conversely, Furniture only appears in 25% of all Office Supply orders.

Whether you realized it or not, we just calculated a conditional probability on categories in the Superstore dataset.

In closing

I hope you learned a little about probability in this post and how to apply them in Tableau. On the next post we will build upon these foundations and discuss Bayes’ theorem.

That’s it for now!

-Paul

Where Should I Put My Filters? – Part 2

Welcome back! It’s been a while since Part 1 of this series. But after getting distracted with some set actions and PostGIS, I’m finally looping back around to filters.

Special thanks to Kevin Flerlage for his help and guidance on this post!

The last time I discussed filters, we were talking about putting filters on the top, left and right of your dashboard:

filters_screenshot2

But what if you want to hide them? What if they are taking up valuable real estate or are simply an eye sore? Well, there’s a solution.

There are two ways to solve this problem. The first is to use collapsible containers. The second is to use dashboard actions.

Collapsible containers, made famous by Robert Rouse, is the harder of the two solutions, but I’m still going to write a blog post in the future that covers it. Since dashboard actions just came out in Tableau 2018.3, we’re going to talk about that method instead.

A slight introduction

We’re going to building something in Tableau that is called a Hamburger menu. These are mostly found on websites and mobile apps. The reason they are called hamburger menus is because the image resembles a hamburger:

Usually, they are hiding some sort of hidden menu or navigation. For example:

Getting started

Let’s get back to our Superstore dashboard. You can download it here if you want to follow along.

1.) Take the starter and duplicate it.

2.) On the first dashboard, remove the container at the top for the all the filters and the filters logo. Some things may shift a little, so you might have to move some things around to make it to your liking. You should be left with something like this:

dashboard actions_1

3.) Take the hamburger icon () and save it to somewhere on your computer.

4.) Take a button from the objects menu on the left side of the dashboard menu, and drag a button on the left side of the title.

5.) Select the down triangle, and select Edit.

6.) Under Navigate to, select your second dashboard with filters.

7.) Under image, find your hamburger menu. If you want to put a tooltip on your hamburger, you can do that to. When you are done, click OK.

Second dashboard

On the second dashboard, we will do a similar process.

1.) Take the X icon () and save it somewhere on your computer.

2.) Take a button from the objects menu on the left side of the dashboard menu, and drag a button on the left side of the filter icon. (You can really place it anywhere you want).

3.) Select the down triangle, and select Edit.

4.) Under Navigate to, select your first dashboard.

5.) Under image, find your hamburger menu. If you want to put a tooltip on your X, you can do that to. When you are done, click OK.

That’s it. Here’s the workbook for you to download and see what else I did.

The pro: This is much easier to do than collapsing and hiding containers.

The con: You have to maintain 2 dashboards.

That’s all for now!

Until next time,

Paul

Using Set Actions to Drill Down from State to Zip Codes

Welcome back, friends! I was inspired by Bethany Lyons’ (@tablyze) presentation at #TC18, and how she built a 3-level map drilldown for Median Price for Properties sold in 2017-2018.

In her presentation, she didn’t really go into how it was built, but it was really a summary of the different use cases. Luckily, Craig Dewar has a pretty cool video on YouTube, walking you through the steps to recreate this. I give full credit to Craig.

The What?

If you don’t fully understand what we’re building, here’s a working gif:

ezgif.com-resize

In full disclosure, because some zip codes don’t fit into counties nicely, we need to do a little cleanup — AND because of this county boundaries aren’t going to look the same as we traditionally know them.

Getting Started

For this example, we need two files: a file of zip codes, and a state FIPS file.

1.) For the zip codes file, we are going to use the most recent file provided by the HUD USPS ZIP Code Crosswalk. For zips that cross over multiple counties, it has a ratio of residents, businesses, other and total. What I did was sort by tot_ratio descending (so you get the primary zip/county relationship) in Excel and then “Remove Duplicates.” You can either do that or just download the file here.

2.) In the zip/county, we don’t have state name. So we need to bring that in. You can find this file with a simple Google search, but it’s also on data.world here.

Into Tableau

Open up Tableau, and load your zip/county Excel file and your text file into a new datasource. The first two digits of the county FIPS code is the state FIPS code, so we’re going to  connect the files on a custom join calculation:

int(left([County],2)) = int([State Fips])

drilldown_map_9

 

On to sets!

1.) Once you are on your sheet, rename the field “County” to County FIPS (You don’t really have to do this, but it keeps my mind straight).

2.) Create a Hierarchy of State Name, County FIPS, and Zip, and rename it Location Hierarchy.

3.) Create a set of State Name, call it “Selected States” and pick a few.

4.) Next, we need to calculate how many states we have in our entire dataset. We need to put it in a level of detail calc to make sure we are counting over our dataset. Create a calculated field called “Number of States” with the following calculation:

{countd([State Name])}

5.) To work out the logic if we selected all the states or just some of them, create a “Selected State” calculated field:

if [Selected States] then [State Name] end

6.) Create another calculated field where we count the selected states, call it “Number of Selected States”:

{countd([Selected State])}

7.) Create a “Show Counties” calculated field. This will be a boolean field (T/F), telling us when to show county instead of state.

[Number of Selected States]<[Number of States]

8.) Create a “Conditional County” calculated field. This will show the County FIPS, if Show Counties is true:

if [Show Counties] then [County FIPS] end

9.) Drag your Location hierarchy onto the Details mark, and it should populate the map. You can also drag your “Selected States” set onto color so you can see easily which ones are in or out of the set (for testing purposes). The mark type will be Automatic (a circle). That is fine for now.

10.) Create a new Set from of Conditional County called “Selected Conditional Counties”, and select a few (for testing purposes).

11.) Create a new calculated field called “Show Zip”. Essentially, you are making sure it is not null and it’s selected.

{max([Selected Conditional Counties])}
and
{countd([Selected Conditional Counties])}>1

12.) Create a new calculated field called “Filter Set”.

if [Show Zip] then [Selected Conditional Counties]
ELSEIF [Show Counties] then [Selected States]
else TRUE END

13.) Drag Filter Set to filters, and select True. You should see the few counties you selected in your “Selected Conditional Counties” set.

14.) Create a new calculated field called Geographic Mark:

if [Show Zip] then [Zip]
elseif [Show Counties] then [County FIPS]
else [State Name]
end

This is where the magic happens, and why the hierarchy we did in step 2 is really important.

15.) On your Geographic Mark calculated field, click on the Abc field to the left of it,  click Geographic Role > Create From > Zip.

drilldown_map_10

16.) Do the same process for State Name and County FIPS.

17.) Drag Conditional County, and Geographic Mark to the Detail mark type.

18.) Go into your Selected Conditional Counties and Select All. Do the same for Selected States, so you can see the whole country.

19.) Drag “Number of Records” to the Color mark type — and magically all of your circles turn to states!

Creating the set actions

1.) The first set action we will create is on Selected States.

Click on Worksheet > Actions > Add Actions > Change Set Values.

Rename it to Show Counties. Run Action on Select. Select your set, and Add all Values to Set.

drilldown_map_11

(This is the opposite action I usually do, but like anything in Tableau, there are multiple ways to solve the same problem.)

2.) Do the same for Selected Conditional Counties.

Now if you click on any state, you should have the drill down capability!

Here’s a link to the workbook on Tableau Public.

Happy vizzing (and mapping)!

-Paul