How to Use Tableau Prep and R to scrape Starbucks locations

Hello Friends!

To be honest, I have been holding out using Tableau Prep for a long time. There are a few reasons for this, but with the release of Tableau Prep 2019.3 with R and Python integration, my use is starting to be swayed.

In this blog post, I will show you how to scrape Starbucks locations off their store locator to visualize and analyze in Tableau.

Getting Started

To get started, we need to look at Starbucks store locator on their website and use the web developer feature in Chrome or Firefox.

Open up one of these browsers, and navigate to their store locator: Starbucks Store Locator. You should see something like this:

starbucks_scrape_01

If you click on the info for each location, you will see the Store Name, address, hours, amenities, etc.

starbucks_scrape_02

So how do we get this info without a manual copy/paste into a spreadsheet? Source code to the rescue!

1.) On the side of Chrome, click on the three vertical dots at the top right, click on More Tools and Developer Tools.

starbucks_scrape_03

2.) Now, this next part is a little trial and error, and not really straight forward. Some store locators are driven off web services that you can tap into, others just have some code in their source code. Starbucks has the latter. In the Element section of DevTools, I do a search (Ctrl + F) for the store I’m looking for. In this case, I’m looking for the “Pittsburgh Marriott City Ctr Lobby” to see if I can find something I can work with.

On the 5th hit, I find what I’m looking for: STRUCTURED JSON!!

starbucks_scrape_04

All we need to do is scrape this, parse the JSON, and we have our data.

Bringing in R/Building the R function

If you don’t have R, there are numerous blogs on how to get it installed and up and going. For Tableau Prep, you need to download/install a library called Rserve, which allows Tableau to talk to R. That is all spelled out here.

In R, we’re going to rely on a couple libraries:

library(rvest) ## Easy web scraper
library(jsonlite) ## parse the json into a dataframe
library(stringr) ## some string functions
library(tidyverse)
library(dplyr)
library(sqldf) ## writing sql instead of source R to query data frames
options(sqldf.driver = "SQLite")

Once we have these loaded, lets write some code and get some data!

## Pull the URL and html into R
starbucks_baseurl <- "https://www.starbucks.com/store-locator?place=15219"

page_html  script:nth-child(5)") %>%
    html_text %>%
    str_split("window.__INTL_MESSAGES")

What we are doing here:

1.) Defining a new variable called json.

2.) Finding the html node for our json.

3.) Converting it to text.

4.) Stripping out the data we don’t need (str_split). If you look, you will see that there are multiple JSON definitions under that script tag, and we only need the first one. The second one starts with window.__INTL_MESSAGES, so we are splitting the string on that.

We need to clean up our JSON a little more before we can parse it.

json_clean <- substr(json[[1]][1],1,nchar(json[[1]][1])-1); ## removes a semi-colon at the end of the JSON
json_clean %
str_remove("window.__BOOTSTRAP = ") %>% ## removes the window.__BOOTSTRAP = string so we have clean JSON
trimws() ## helper function to remove white space at the beginning and end of our JSON.

Cool, now we are ready to parse some JSON!

df_json &lt;- fromJSON(json_clean, simplifyDataFrame = TRUE)

#Extract out the store locations
df_json2 &lt;- df_json$storeLocator$locationState$locations;

#build a data frame with some necessary data<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
df_json3 &lt;- data.frame(name = as.character(df_json2$name)
,brand = as.character(df_json2$brandName)
,latitude = df_json2$coordinates$latitude
,longitude = df_json2$coordinates$longitude
,storeNumber = df_json2$storeNumber
,address = df_json2$address$streetAddressLine1
,city = df_json2$address$city
,state = df_json2$address$countrySubdivisionCode
,zip = df_json2$address$postalCode
,open_status = df_json2$open
);

But what if we want multiple zip codes?

At this point, we are ready to wrap this into a function and go into Tableau Prep. But I want to loop through a list of zip codes and create a comprehensive list. Easy! let’s just wrap a for loop around the whole thing and append the results to each other. In the end, we get something like this:

starbucks_scrape_tst <- function(df) {

  library(rvest)
  library(dplyr)
  library(sqldf) ## writing sql instead of source R to query data frames
  options(sqldf.driver = "SQLite")
  library(tidyverse)
  library(stringr)
  library(jsonlite)  

  df_json4 <- data.frame();

for(i in nrow(df)){
  starbucks_baseurl <- paste0("https://www.starbucks.com/store-locator?place=", df$zip_code[i]);

  page_html <- read_html(starbucks_baseurl)

  json %
    html_nodes("body > script:nth-child(5)") %>%
    html_text %>%
    str_split("window.__INTL_MESSAGES")

  json_clean <- substr(json[[1]][1],1,nchar(json[[1]][1])-1);
  json_clean %
    str_remove("window.__BOOTSTRAP = ") %>%
    trimws()

  df_json <- fromJSON(json_clean, simplifyDataFrame = TRUE)

  #Extract out the store locations
  df_json2 <- df_json$storeLocator$locationState$locations;

  df_json3 <- data.frame(name = as.character(df_json2$name)
                         ,brand = as.character(df_json2$brandName)
                         ,latitude = df_json2$coordinates$latitude
                         ,longitude = df_json2$coordinates$longitude
                         ,storeNumber = df_json2$storeNumber
                         ,address = df_json2$address$streetAddressLine1
                         ,city = df_json2$address$city
                         ,state = df_json2$address$countrySubdivisionCode
                         ,zip = df_json2$address$postalCode
                         ,open_status = df_json2$open
  );

  df_json4 <- rbind(df_json4,df_json3);
  return(df_json4);
}
}

For Tableau Prep, we also need a little helper function. We need to tell Tableau what our data frame is made of, and the datatype of each.

getOutputSchema <- function() {

return (data.frame (
name = prep_string (),
brand = prep_string(),
latitude = prep_decimal (),
longitude = prep_decimal (),
storeNumber = prep_string(),
address = prep_string(),
city = prep_string(),
state = prep_string(),
zip = prep_string(),
open_status = prep_bool()
));
}

Into Tableau Prep, finally!

Let’s open up Tableau Prep and pull our data.

First, we need to find a list of zip codes. There are a few out there, but I found this one.

Next, to keep things simple, I filtered to a list of only Pittsburgh zip codes, and renamed the field zip to zip_code.

I then added an aggregate step to get a unique list of zip codes, since zip codes cross county lines in some cases.

starbucks_scrape_05

I then added a script step off the aggregate step. Up comes this dialog:

starbucks_scrape_06

We need to connect to Rserve first. So go back into R and run

library(Rserve)
<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>Rserve()

This will launch our R server.

Click on “Connect to Rserve Server” and this dialog should appear. If your port is empty, the default port for Rserve is 6311.

starbucks_scrape_07

Load your R file (I called mine starbucks_scrape.R) and tell Tableau your scrape function (mine is called starbucks_scrape_tst)

I then set up an output step to a hyper file, and opened it up in desktop:

starbucks_scrape_08

And finally, we have all the Starbucks locations for the Pittsburgh area.

Reach out with any questions!

Happy scraping!

-Paul

 

 

 

Full R-code

getOutputSchema <- function() { 

  return (data.frame (
    name = prep_string (),
    brand = prep_string(),
    latitude = prep_decimal (),
    longitude = prep_decimal (),
    storeNumber = prep_string(),
    address = prep_string(),
    city = prep_string(),
    state = prep_string(),
    zip = prep_string(),
    open_status = prep_bool()
  ));
}

starbucks_scrape_tst <- function(df) {

  library(rvest)
  library(dplyr)
  library(sqldf) ## writing sql instead of source R to query data frames
  options(sqldf.driver = "SQLite")
  library(tidyverse)
  library(stringr)
  library(jsonlite)  

  df_json4 <- data.frame();

for(i in nrow(df)){
  starbucks_baseurl <- paste0("https://www.starbucks.com/store-locator?place=", df$zip_code[i]);

  page_html <- read_html(starbucks_baseurl)

  json %
    html_nodes("body > script:nth-child(5)") %>%
    html_text %>%
    str_split("window.__INTL_MESSAGES")

  json_clean <- substr(json[[1]][1],1,nchar(json[[1]][1])-1);
  json_clean %
    str_remove("window.__BOOTSTRAP = ") %>%
    trimws()

  df_json <- fromJSON(json_clean, simplifyDataFrame = TRUE)

  #Extract out the store locations
  df_json2 <- df_json$storeLocator$locationState$locations;

  df_json3 <- data.frame(name = as.character(df_json2$name)
                         ,brand = as.character(df_json2$brandName)
                         ,latitude = df_json2$coordinates$latitude
                         ,longitude = df_json2$coordinates$longitude
                         ,storeNumber = df_json2$storeNumber
                         ,address = df_json2$address$streetAddressLine1
                         ,city = df_json2$address$city
                         ,state = df_json2$address$countrySubdivisionCode
                         ,zip = df_json2$address$postalCode
                         ,open_status = df_json2$open
  );

  df_json4 <- rbind(df_json4,df_json3);
  return(df_json4);
}
}

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s