What if I want to scrape the computational notes on my own website? I need a for-loop to iterate over each page, and I’d like to end up with a data set containing the following for each page:
title
the text content.
First, I need to scrape the links to direct my for-loop where to go.
library(rvest)
own_links_page <- read_html('https://christopherdishop.netlify.com/computational_notes/')
own_links <- html_nodes(own_links_page, '.title a') %>%
html_attr('href')
head(own_links)
## [1] "/computational_notes/methods_others/"
## [2] "/computational_notes/julia_cheatsheet/"
## [3] "/computational_notes/skeleton/"
## [4] "/computational_notes/sims_rccp/"
## [5] "/computational_notes/regression_composite/"
## [6] "/computational_notes/ma_compute/"
Now I have a vector of links directing me to relevant pages. Normally, the next step would be to use read_html()
and specify one value within own_links
to tell it where to go. That won’t work here because the links stored within own_links
are relative, and read_html()
cannot handle relative paths.
So, in the next step I will use Hadley’s jump_to
function which can handle relative paths.
Start on the link homepage, then use the first value stored within own_links
to navigate to the first web page I want to scrape.
# jump_to() requires a html_session rather than read_html page
own_links_page <- html_session('https://christopherdishop.netlify.com/computational_notes/')
scrape_page1 <- own_links_page %>%
jump_to(own_links[1])
Then, scrape the text within that post.
html_node(scrape_page1, '.content.container p') %>% html_text
## [1] "I recently created my first slide deck using Garrick Aden-Buie’s awesome template called “Gentle Ggplot2.” You can find the presentation here and the source code on my GitHub page."
Cool, everything works for a single page. Now iterate over every page and store the results in a data frame.
library(tidyverse)
df3 <- data.frame(
'page_title' = c(rep('store', length(own_links))),
'text' = c(rep('store', length(own_links)))
)
df3 <- df3 %>%
mutate_if(is.factor, as.character)
for(i in 1:length(own_links)){
Sys.sleep(0.5)
navigate_page <- own_links_page %>%
jump_to(own_links[i])
title <- gsub('/computational_notes/', '', own_links[i])
title <- gsub('/', '', title)
text <- html_node(navigate_page, '.content.container p') %>% html_text
df3[i, 'page_title'] <- title
df3[i, 'text'] <- text
}
head(df3)
## page_title
## 1 methods_others
## 2 julia_cheatsheet
## 3 skeleton
## 4 sims_rccp
## 5 regression_composite
## 6 ma_compute
## text
## 1 I recently created my first slide deck using Garrick Aden-Buie’s awesome template called “Gentle Ggplot2.” You can find the presentation here and the source code on my GitHub page.
## 2 I started using Julia for my computational models and recently created a cheat sheet to house all of my common commands.
## 3 I recently created a course skeleton for a research methods or statistics course. The website allows you to incorporate Rmarkdown and dynamic documents to better demonstrate interactive coding.
## 4 Simulating dynamic processes is slow in R. Using the Rcpp function, we can incorporate C++ code to improve performance.
## 5 One way to think about regression is as a tool that takes a set of predictors and creates a weighted, linear composite that maximally correlates with the response variable. It finds a way to combine multiple predictors into a single thing, using regression weights, and the weights are chosen such that, once the single composite is formed, it maximally correlates with the outcome.
## 6 I’m taking comps pretty soon so this is my summary document regarding meta-analysis.
Awesome. How about the sentiment across the various posts?
library(sentimentr)
sent_df <- sentiment(df3$text)
library(ggplot2)
sent_df$title <- df3$page_title
sent_df <- sent_df %>%
mutate(color = ifelse(sentiment < 0, 'negative', 'positive'))
Sentiment by post in order of date posted.
ggplot(sent_df, aes(x = title, y = sentiment, fill = color)) +
geom_bar(stat = 'identity') +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.title = element_blank())
Sentiment by post arranged by strength of sentiment.
sent_df %>%
arrange(sentiment) %>%
mutate(ordered_titles = 1:nrow(sent_df)) %>%
ggplot(aes(x = ordered_titles, y = sentiment, fill = color)) +
geom_bar(stat = 'identity') +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.title = element_blank())