Thursday, January 8, 2015

Using rvest to Scrape an HTML Table

I recently had the need to scrape a table from wikipedia. Normally, I'd probably cut and paste it into a spreadsheet, but I figured I'd give Hadley's rvest package a go.

The first thing I needed to do was browse to the desired page and locate the table. In this case, it's a table of US state populations from wikipedia. Rvest needs to know what table I want, so (using the Chrome web browser), I right clicked and chose “inspect element”. This splits the page horizonally. As you hover over page elements in the html on the bottom, sections of the web page are highlighted on the top.

Hovering over the blue highlighted line will cause the table on top to be colored blue. This is the element we want. I clicked on this line, and choose “copy XPath”, then we can move to R.

First step is to install rvest from CRAN.

install.packages("rvest")

Then we it's pretty simple to pull the table into a dataframe. Paste that XPath into the appropriate spot below.

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
  html_table()
population <- population[[1]]

head(population)
##   Rank in\nthe Fifty\nStates,\n2014
## 1                           !000001
## 2                           !000002
## 3                           !000003
## 4                           !000004
## 5                           !000005
## 6                           !000006
##   Rank in all\nstates\n& terri-\ntories,\n2010 State or territory
## 1                                      !000001         California
## 2                                      !000002              Texas
## 3                                      !000004            Florida
## 4                                      !000003           New York
## 5                                      !000005           Illinois
## 6                                      !000006       Pennsylvania
##   Population estimate for\nJuly 1, 2014 Census population,\nApril 1, 2010
## 1                            38,802,500                        37,253,956
## 2                            26,956,958                        25,145,561
## 3                            19,893,297                        18,801,310
## 4                            19,746,227                        19,378,102
## 5                            12,880,580                        12,830,632
## 6                            12,787,209                        12,702,379
##   Census population,\nApril 1, 2000 Seats inU.S. House,\n2013–2023
## 1                        33,871,648                        !000053
## 2                        20,851,820                        !000036
## 3                        15,982,378                        !000027
## 4                        18,976,457                        !000027
## 5                        12,419,293                        !000018
## 6                        12,281,054                        !000018
##   Presi-\ndential\nElectors\n2012–\n2020
## 1                                !000055
## 2                                !000038
## 3                                !000029
## 4                                !000029
## 5                                !000020
## 6                                !000020
##   2014 Estimated pop.\nper\nHouse seat
## 1                              732,123
## 2                              748,804
## 3                              736,789
## 4                              731,342
## 5                              715,588
## 6                              710,401
##   2010 Census pop.\nper\nHouse\nseat[4] 2000 Census pop.\nper\nHouse\nseat
## 1                               702,905                            639,088
## 2                               698,487                            651,619
## 3                               696,345                            639,295
## 4                               717,707                            654,361
## 5                               712,813                            653,647
## 6                               705,688                            646,371
##   Percent\nof total\nU.S. pop.,\n2014[5]
## 1                                 12.17%
## 2                                  8.45%
## 3                                  6.24%
## 4                                  6.19%
## 5                                  4.04%
## 6                                  4.01%

There's some work to be done on column names, but this is a pretty pain free way to scrape a table. As usual, a big shout out to Hadley Wickham for making this so easy for us.

8 comments:

  1. rvest is a nice framework for many folks. Thanks for sharing. This particular task can also be handled very easily with the XML package via: library(XML); readHTMLTable(url, which=1)

    ReplyDelete
    Replies
    1. Can confirm it works... df <- XML::readHTMLTable(url, which=1, stringsAsFactors=F)
      Yeah, rvest is probably overkill for this task, but the XML package has defeated me in the past, so I try to keep my distance.

      Delete
  2. I am trying desperately to find a way of scraping review data from Tripadvisor.

    http://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html looked very promising but I am having difficulty following the code and, with the help of others, got as far as retrieving some of the data. However, certain 'properties' ie. ID Number and review title worked for only 9 out of 10 of the reviews on a page.

    I am not an r boffin so I was glad to see that Hadley Wickham's code looked a lot more straight forward using rvest (https://github.com/hadley/rvest/blob/37006b94dea4035f7d949f29b6449d0278884969/demo/tripadvisor.R). I am still battling to get anything out of it. I was wondering if you could advise me on

    Error: could not find function "%>%"
    Error: could not find function "html_nodes"

    I assume this means that I have not successfully installed rvest?

    > library(rvest)
    Error in get(method, envir = home) :
    lazy-load database 'C:/Users/DNLCA_000/Documents/R/win-library/3.2/rvest/R/rvest.rdb' is corrupt
    In addition: Warning messages:
    1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4], :
    restarting interrupted promise evaluation
    2: In get(method, envir = home) :
    restarting interrupted promise evaluation
    3: In get(method, envir = home) : internal error -3 in R_decompress1
    Error: package or namespace load failed for ‘rvest’

    I would truly appreciate some advice on how I could proceed from here as I am not sure where to look for help in this matter.

    ReplyDelete
    Replies
    1. I'm not sure why your installation of rvest on Windows didn't work. Maybe try using devtools: install.packages("devtools"); library("devtools"); install_github("hadley/rvest")

      Delete
  3. A minor typo, I think, install.packageS ?

    ReplyDelete
  4. Hi Corry,

    Thanks for this wonderful blog. I am a novice trying to learn R and web scraping. I have basic knowledge of R like installing packages and dataframes etc. I tried to follow the steps using code that you have provided and unfortunately, when i reach the line html_table() and hit enter, i get following error:

    Error in UseMethod("html_table") :
    no applicable method for 'html_table' applied to an object of class "xml_nodeset"
    In addition: Warning message:
    'html' is deprecated.
    Use 'read_html' instead.
    See help("Deprecated")

    Will you be able to help what is the mistake i am doing?

    Regards
    Mihir

    ReplyDelete
    Replies
    1. """Error in UseMethod("html_table")"""
      """""""""""""""""""""""""""""""""""""""

      "html() %>%" - old method, no good, fukaka.
      "read_html() %>%" - good method.
      I think rvest updated.

      Delete
  5. Hey!! Great post.. love the rvest for data mining!!
    Hope you can help me here with a problem. So long, I can scrap mostly anything from this website I'm working with, except the price. I have tried from 'h4' to '//*[@id="plpContent"]/div[1]/a/h4' to '. priceOffer', 'before', etc.
    Please, I would be so happy if you can help me with this one!!!
    This is the link: https://www.carulla.com/browse?Ntt=gelatina&No=0&Nrpp=80
    Thanks and congrats!!! Hope you have a Happy New Year!

    ReplyDelete

Note: Only a member of this blog may post a comment.