First, I need to get a list of all of the files. This was a little complicated by the fact that the URL for the buses didn't play nice with some of the usual html R tools (RCurl). Alas, the httr package was the solution. First get the html dump, then look for a table with the id of fsvItemsTable and make your way down the tree to get the hrefs for all of the files. I imagine there's a way to avoid the grep at the end of this snippet, but it works, so I stopped...
EDIT: The url to get the list of pdf files appears to show different content to different users. I've uploaded all of the pdf files to the git repo, so if you wish to execute the code and this snippet doesn't work, you can skip it and move to the next part...
Next, use this list of links to download the files. Again, the normal way of doing things, download.file(), failed, but downloader::download() did work.
Now that we have a directory filled with pdf files, what do we do with it? Well, there's a function called readPDF() in the tm package that can be used to read the data in a pdf file. And using code ripped straight from stack overflow, it was pretty easy to get the data.
This leaves you with a single string for each row of data in the pdf table. A little grepping will separate the data in to separate columns in a data frame. See the full code linked at the bottom of the page for these details.
Now we must geocode the bus stop locations so we can plot them on a map. For this, the ggmap package has a simple function called geocode().
At the end of all of this, we finally have a data set to map. Here's what it looks like...
...well, not exactly. To use the toGeoJSON() function in the rCharts package, the df must be transformed into a list. Also, I add in a color for each route so we can tell them apart on the map, and format the text for the tooltip for each point.
Again, in keeping with using other people's code, I reused some code that Ramnath Vaidyanathan had done for the foodborne chicago map a while back to create a leaflet map of the bus stops. He is the author of the rCharts package, is super helpful via twitter and github with random issues, and is doing a tutorial at useR_2014 in LA. I can't wait to meet him... The last part of this code snippet creates a github gist out of it. I had some trouble using it on my network, so I just used the .save() method to create an html file and copy-pasted it as a gist.
And here's the result. There's still some work to be done on the geocoding end of things. As you can see if you click on a dot on the map, the location doesn't always line up with where the map tooltip says it should be.
All of the code can be found on github.
Thanks for sharing! I didn't know that one can read PDF files that easily into R.
ReplyDeleteI tried to execute your code, but unfortunately the first xpathSApply statement returned NULL. Am I doing something wrong?
I can't seem to figure out why that url seems to deliver different content to different users. I've uploaded all of the pdf files to the git repo, so if you clone it, you'll have them all and can execute the code from the pdf reading point on... https://github.com/corynissen/bloomington-bus-stops/tree/master/data
DeleteThanks for looking into this.
DeleteWorks on my machine, but I tried a machine on EC2 and get NULL also. It looks like the link (http://www.district87.org/pages/Bloomington_School_District_87/Parents_and_Students/Bus_Routes/Bloomington_High_School) is returning different results depending on your location or something. I'll keep digging...
ReplyDelete