Scraping a table using the rvest package
362 观看
1回复
11 作者的声誉
I’m totally new to web scraping and I’m exploring the potentialities of the rvest library in R.
I’m trying to scrape a table on wellbeing in Italian provinces from the following website,
install.packages('rvest')
library('rvest')
url <- 'http://www.ilsole24ore.com/speciali/qvita_2017_dati/home.shtml'
webpage <- read_html(url)
but I’m unable to identify the XPath of the table.
作者: Luca De Benedictis 的来源 发布者: 2017 年 12 月 27 日回应 1
3像
64649 作者的声誉
Even with the following, you have quite a bit of work to do. The HTML is in terrible shape.
library(rvest)
library(stringi)
library(tidyverse)
read_html("http://www.ilsole24ore.com/speciali/qvita_2017_dati/home.shtml") %>% # get the main site
html_node(xpath=".//script[contains(., 'goToDefaultPage')]") %>% # find the <script> block that dynamically loads the page
html_text() %>%
stri_match_first_regex("goToDefaultPage\\('(.*)'\\)") %>% # extract the page link
.[,2] %>%
sprintf("http://www.ilsole24ore.com/speciali/qvita_2017_dati/%s", .) %>% # prepend the URL prefix
read_html() -> actual_page # get the dynamic page
tab <- html_nodes(actual_page, xpath=".//table")[[2]] # find the actual data table
Once you do ^^ you have an HTML <table>
. It's in terrible, awful, pathetic shape and that site shld rly be ashamed of how it abuses HTML.
Go ahead and try html_table()
. It's so bad it breaks httr
.
We need to attack it by row and will need a helper function soas to not have the R code look horrible:
`%|0%` <- function(x, y) { if (length(x) == 0) y else x }
^^ will help us fill in NULL-like content with a blank ""
.
Now, we go row-by-row, extracting the <td>
values we need. This does not get all of them since I don't need this data and it needs cleaning as we'll see in a bit;
html_nodes(tab, "tr") %>%
map_df(~{
list(
posizione = html_text(html_nodes(.x, xpath=".//td[2]"), trim=TRUE) %|0% "",
diff_pos = html_text(html_nodes(.x, xpath=".//td[5]"), trim=TRUE) %|0% "",
provincia = html_text(html_nodes(.x, xpath=".//td[8]"), trim=TRUE) %|0% "",
punti = html_text(html_nodes(.x, xpath=".//td[11]"), trim=TRUE) %|0% "",
box1 = html_text(html_nodes(.x, xpath=".//td[14]"), trim=TRUE) %|0% "",
box2 = html_text(html_nodes(.x, xpath=".//td[17]"), trim=TRUE) %|0% "",
box3 = html_text(html_nodes(.x, xpath=".//td[20]"), trim=TRUE) %|0% ""
)
})
## # A tibble: 113 x 7
## posizione diff_pos provincia punti box1 box2 box3
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Lavoro e Innovazione Giustizia e Sicurezza
## 2 Diff. pos.
## 3 1 3 Belluno 583
## 4 2 -1 Aosta 578 9 63 22
## 5 3 2 Sondrio 574 4 75 1
## 6 4 3 Bolzano 572 2 4 7
## 7 5 -2 Trento 567 8 11 15
## 8 6 4 Trieste 563 6 10 2
## 9 7 9 Verbano-Cusio-Ossola 548 18 73 25
## 10 8 -6 Milano 544 1 2 10
## # ... with 103 more rows
As you can see, it misses some things and has some junk in the header, but you're further along than you were before.
作者: hrbrmstr 发布者: 2017 年 12 月 27 日来自类别的问题 :
- r 如何访问向量中的最后一个值?
- r R的优化包
- r R是否有类似引用的运算符,如Perl的qw()?
- r R中没有标题/标签的图
- r 计算移动平均线
- r Emacs ESS模式 - 评论区域的标签
- web-scraping 使用Python进行Web抓取
- web-scraping 您如何筛选刮屏?
- web-scraping 为“周期表”和所有链接搜索维基页面
- web-scraping 如何按类查找元素
- web-scraping 正文从网站提取文本例如仅提取文章标题和文本而不是网站中的所有文本
- web-scraping 使用Jsoup的Web Scraping只能运行一半的时间
- rvest 使用无限滚动刮取动态电子商务页面
- rvest 在R中刮取JavaScript生成的内容
- rvest 使用rvest从特定的html pagein R中检索评论
- rvest 使用RVest从R中的点动中抓取比赛分数
- rvest 使用RVest从Facebook Post抓取图片src
- rvest 将列表列表追加到R中单列的数据框中