Example 3 - Scrape My Website

What if I want to scrape the computational notes on my own website? I need a for-loop to iterate over each page, and I’d like to end up with a data set containing the following for each page:

title
the text content.

First, I need to scrape the links to direct my for-loop where to go.

library(rvest)
own_links_page <- read_html('https://christopherdishop.netlify.com/computational_notes/')
own_links <- html_nodes(own_links_page, '.title a') %>%
                            html_attr('href')

head(own_links)

## [1] "/computational_notes/methods_others/"      
## [2] "/computational_notes/julia_cheatsheet/"    
## [3] "/computational_notes/skeleton/"            
## [4] "/computational_notes/sims_rccp/"           
## [5] "/computational_notes/regression_composite/"
## [6] "/computational_notes/ma_compute/"

Now I have a vector of links directing me to relevant pages. Normally, the next step would be to use read_html() and specify one value within own_links to tell it where to go. That won’t work here because the links stored within own_links are relative, and read_html() cannot handle relative paths.

So, in the next step I will use Hadley’s jump_to function which can handle relative paths.

Start on the link homepage, then use the first value stored within own_links to navigate to the first web page I want to scrape.

# jump_to() requires a html_session rather than read_html page

own_links_page <- html_session('https://christopherdishop.netlify.com/computational_notes/')

scrape_page1 <- own_links_page %>%
  jump_to(own_links[1])

Then, scrape the text within that post.

html_node(scrape_page1, '.content.container p') %>% html_text

## [1] "I recently created my first slide deck using Garrick Aden-Buie’s awesome template called “Gentle Ggplot2.” You can find the presentation here and the source code on my GitHub page."

Cool, everything works for a single page. Now iterate over every page and store the results in a data frame.

library(tidyverse)

df3 <- data.frame(
  'page_title' = c(rep('store', length(own_links))),
  'text' = c(rep('store', length(own_links)))

  )

df3 <- df3 %>%
  mutate_if(is.factor, as.character)

for(i in 1:length(own_links)){
  Sys.sleep(0.5)
  
  navigate_page <- own_links_page %>%
    jump_to(own_links[i])
  
  
  title <- gsub('/computational_notes/', '', own_links[i])
  title <- gsub('/', '', title)
  text <- html_node(navigate_page, '.content.container p') %>% html_text
  
  df3[i, 'page_title'] <- title
  df3[i, 'text'] <- text
  
}

head(df3)

##             page_title
## 1       methods_others
## 2     julia_cheatsheet
## 3             skeleton
## 4            sims_rccp
## 5 regression_composite
## 6           ma_compute
##                                                                                                                                                                                                                                                                                                                                                                                             text
## 1                                                                                                                                                                                                           I recently created my first slide deck using Garrick Aden-Buie’s awesome template called “Gentle Ggplot2.” You can find the presentation here and the source code on my GitHub page.
## 2                                                                                                                                                                                                                                                                       I started using Julia for my computational models and recently created a cheat sheet to house all of my common commands.
## 3                                                                                                                                                                                              I recently created a course skeleton for a research methods or statistics course. The website allows you to incorporate Rmarkdown and dynamic documents to better demonstrate interactive coding.
## 4                                                                                                                                                                                                                                                                        Simulating dynamic processes is slow in R. Using the Rcpp function, we can incorporate C++ code to improve performance.
## 5 One way to think about regression is as a tool that takes a set of predictors and creates a weighted, linear composite that maximally correlates with the response variable. It finds a way to combine multiple predictors into a single thing, using regression weights, and the weights are chosen such that, once the single composite is formed, it maximally correlates with the outcome.
## 6                                                                                                                                                                                                                                                                                                           I’m taking comps pretty soon so this is my summary document regarding meta-analysis.

Awesome. How about the sentiment across the various posts?

library(sentimentr)
sent_df <- sentiment(df3$text)

library(ggplot2)
sent_df$title <- df3$page_title
sent_df <- sent_df %>%
  mutate(color = ifelse(sentiment < 0, 'negative', 'positive'))

Sentiment by post in order of date posted.

ggplot(sent_df, aes(x = title, y = sentiment, fill = color)) + 
  geom_bar(stat = 'identity') + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(legend.title = element_blank())

Sentiment by post arranged by strength of sentiment.

sent_df %>%
  arrange(sentiment) %>%
  mutate(ordered_titles = 1:nrow(sent_df)) %>%
  ggplot(aes(x = ordered_titles, y = sentiment, fill = color)) + 
  geom_bar(stat = 'identity') + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(legend.title = element_blank())