Wednesday, June 11, 2014

Visualizing Bus Stops with rCharts

I wanted to create a quick visualization of Bloomington IL bus stops. This data is in pdf file format spread across multiple files. The first step, before any mapping can occur, is downloading those files, parsing them to get the bus stop locations and times.

First, I need to get a list of all of the files. This was a little complicated by the fact that the URL for the buses didn't play nice with some of the usual html R tools (RCurl). Alas, the httr package was the solution. First get the html dump, then look for a table with the id of fsvItemsTable and make your way down the tree to get the hrefs for all of the files. I imagine there's a way to avoid the grep at the end of this snippet, but it works, so I stopped...
EDIT: The url to get the list of pdf files appears to show different content to different users. I've uploaded all of the pdf files to the git repo, so if you wish to execute the code and this snippet doesn't work, you can skip it and move to the next part...


Next, use this list of links to download the files. Again, the normal way of doing things, download.file(), failed, but downloader::download() did work.


Now that we have a directory filled with pdf files, what do we do with it? Well, there's a function called readPDF() in the tm package that can be used to read the data in a pdf file. And using code ripped straight from stack overflow, it was pretty easy to get the data.


This leaves you with a single string for each row of data in the pdf table. A little grepping will separate the data in to separate columns in a data frame. See the full code linked at the bottom of the page for these details.

Now we must geocode the bus stop locations so we can plot them on a map. For this, the ggmap package has a simple function called geocode().

At the end of all of this, we finally have a data set to map. Here's what it looks like...


...well, not exactly. To use the toGeoJSON() function in the rCharts package, the df must be transformed into a list. Also, I add in a color for each route so we can tell them apart on the map, and format the text for the tooltip for each point.


Again, in keeping with using other people's code, I reused some code that Ramnath Vaidyanathan had done for the foodborne chicago map a while back to create a leaflet map of the bus stops. He is the author of the rCharts package, is super helpful via twitter and github with random issues, and is doing a tutorial at useR_2014 in LA. I can't wait to meet him... The last part of this code snippet creates a github gist out of it. I had some trouble using it on my network, so I just used the .save() method to create an html file and copy-pasted it as a gist.


And here's the result. There's still some work to be done on the geocoding end of things. As you can see if you click on a dot on the map, the location doesn't always line up with where the map tooltip says it should be.

All of the code can be found on github.

4 comments:

  1. Thanks for sharing! I didn't know that one can read PDF files that easily into R.
    I tried to execute your code, but unfortunately the first xpathSApply statement returned NULL. Am I doing something wrong?

    ReplyDelete
    Replies
    1. I can't seem to figure out why that url seems to deliver different content to different users. I've uploaded all of the pdf files to the git repo, so if you clone it, you'll have them all and can execute the code from the pdf reading point on... https://github.com/corynissen/bloomington-bus-stops/tree/master/data

      Delete
    2. Thanks for looking into this.

      Delete
  2. Works on my machine, but I tried a machine on EC2 and get NULL also. It looks like the link (http://www.district87.org/pages/Bloomington_School_District_87/Parents_and_Students/Bus_Routes/Bloomington_High_School) is returning different results depending on your location or something. I'll keep digging...

    ReplyDelete

Note: Only a member of this blog may post a comment.