I recently had the need to scrape a table from wikipedia. Normally, I'd probably cut and paste it into a spreadsheet, but I figured I'd give Hadley's rvest package a go.
The first thing I needed to do was browse to the desired page and locate the table. In this case, it's a table of US state populations from wikipedia. Rvest needs to know what table I want, so (using the Chrome web browser), I right clicked and chose “inspect element”. This splits the page horizonally. As you hover over page elements in the html on the bottom, sections of the web page are highlighted on the top.
Hovering over the blue highlighted line will cause the table on top to be colored blue. This is the element we want. I clicked on this line, and choose “copy XPath”, then we can move to R.
First step is to install rvest from CRAN.
install.packages("rvest")
Then we it's pretty simple to pull the table into a dataframe. Paste that XPath into the appropriate spot below.
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
html_table()
population <- population[[1]]
head(population)
## Rank in\nthe Fifty\nStates,\n2014
## 1 !000001
## 2 !000002
## 3 !000003
## 4 !000004
## 5 !000005
## 6 !000006
## Rank in all\nstates\n& terri-\ntories,\n2010 State or territory
## 1 !000001 California
## 2 !000002 Texas
## 3 !000004 Florida
## 4 !000003 New York
## 5 !000005 Illinois
## 6 !000006 Pennsylvania
## Population estimate for\nJuly 1, 2014 Census population,\nApril 1, 2010
## 1 38,802,500 37,253,956
## 2 26,956,958 25,145,561
## 3 19,893,297 18,801,310
## 4 19,746,227 19,378,102
## 5 12,880,580 12,830,632
## 6 12,787,209 12,702,379
## Census population,\nApril 1, 2000 Seats inU.S. House,\n2013–2023
## 1 33,871,648 !000053
## 2 20,851,820 !000036
## 3 15,982,378 !000027
## 4 18,976,457 !000027
## 5 12,419,293 !000018
## 6 12,281,054 !000018
## Presi-\ndential\nElectors\n2012–\n2020
## 1 !000055
## 2 !000038
## 3 !000029
## 4 !000029
## 5 !000020
## 6 !000020
## 2014 Estimated pop.\nper\nHouse seat
## 1 732,123
## 2 748,804
## 3 736,789
## 4 731,342
## 5 715,588
## 6 710,401
## 2010 Census pop.\nper\nHouse\nseat[4] 2000 Census pop.\nper\nHouse\nseat
## 1 702,905 639,088
## 2 698,487 651,619
## 3 696,345 639,295
## 4 717,707 654,361
## 5 712,813 653,647
## 6 705,688 646,371
## Percent\nof total\nU.S. pop.,\n2014[5]
## 1 12.17%
## 2 8.45%
## 3 6.24%
## 4 6.19%
## 5 4.04%
## 6 4.01%
There's some work to be done on column names, but this is a pretty pain free way to scrape a table. As usual, a big shout out to Hadley Wickham for making this so easy for us.
rvest is a nice framework for many folks. Thanks for sharing. This particular task can also be handled very easily with the XML package via: library(XML); readHTMLTable(url, which=1)
ReplyDeleteCan confirm it works... df <- XML::readHTMLTable(url, which=1, stringsAsFactors=F)
DeleteYeah, rvest is probably overkill for this task, but the XML package has defeated me in the past, so I try to keep my distance.
I am trying desperately to find a way of scraping review data from Tripadvisor.
ReplyDeletehttp://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html looked very promising but I am having difficulty following the code and, with the help of others, got as far as retrieving some of the data. However, certain 'properties' ie. ID Number and review title worked for only 9 out of 10 of the reviews on a page.
I am not an r boffin so I was glad to see that Hadley Wickham's code looked a lot more straight forward using rvest (https://github.com/hadley/rvest/blob/37006b94dea4035f7d949f29b6449d0278884969/demo/tripadvisor.R). I am still battling to get anything out of it. I was wondering if you could advise me on
Error: could not find function "%>%"
Error: could not find function "html_nodes"
I assume this means that I have not successfully installed rvest?
> library(rvest)
Error in get(method, envir = home) :
lazy-load database 'C:/Users/DNLCA_000/Documents/R/win-library/3.2/rvest/R/rvest.rdb' is corrupt
In addition: Warning messages:
1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4], :
restarting interrupted promise evaluation
2: In get(method, envir = home) :
restarting interrupted promise evaluation
3: In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘rvest’
I would truly appreciate some advice on how I could proceed from here as I am not sure where to look for help in this matter.
I'm not sure why your installation of rvest on Windows didn't work. Maybe try using devtools: install.packages("devtools"); library("devtools"); install_github("hadley/rvest")
DeleteA minor typo, I think, install.packageS ?
ReplyDeleteHi Corry,
ReplyDeleteThanks for this wonderful blog. I am a novice trying to learn R and web scraping. I have basic knowledge of R like installing packages and dataframes etc. I tried to follow the steps using code that you have provided and unfortunately, when i reach the line html_table() and hit enter, i get following error:
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_nodeset"
In addition: Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
Will you be able to help what is the mistake i am doing?
Regards
Mihir
"""Error in UseMethod("html_table")"""
Delete"""""""""""""""""""""""""""""""""""""""
"html() %>%" - old method, no good, fukaka.
"read_html() %>%" - good method.
I think rvest updated.
Hey!! Great post.. love the rvest for data mining!!
ReplyDeleteHope you can help me here with a problem. So long, I can scrap mostly anything from this website I'm working with, except the price. I have tried from 'h4' to '//*[@id="plpContent"]/div[1]/a/h4' to '. priceOffer', 'before', etc.
Please, I would be so happy if you can help me with this one!!!
This is the link: https://www.carulla.com/browse?Ntt=gelatina&No=0&Nrpp=80
Thanks and congrats!!! Hope you have a Happy New Year!