A thesaurus is not a species of dinosaur

Rselenium and email R Script

November 23, 2019

Let’s see how to scrape a page using Rselenium and schedule a task using windows Task scheduler. This is useful for periodic status updates.

Since I’m using Chrome for this, download the chrome driver and install it. This is essential for Rselenium.

Basically, Rselenium communicates with the Chrome driver and controls the Chrome Browser instance.

library(sendmailR)
library(dplyr)
library(RSelenium)
library(xtable)

I’m using sendmailR for emails and xtable for formatting HTML tables in the email body. Install these libraries from CRAN.

setwd("C://driver for chrome")
line_break_function <- function(x){

 gsub("$<$br$>$","<br>",x)
}


options(xtable.sanitize.text.function = line_break_function)

This above is a hack that I wrote for line breaks to appear properly in xtable HTML table output.

&#35;rD <- rsDriver(browser = "chrome",chromever = "76.0.3809.25",port =5767L)
&#35;rD <- rsDriver(browser = "chrome")
rD <- RSelenium::rsDriver(browser = "chrome",
                     chromever =
                       system2(command = "wmic",
                               args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value',
                               stdout = TRUE,
                               stderr = TRUE) %>%
                       stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>%
                       magrittr::extract(!is.na(.)) %>%
                       stringr::str_replace_all(pattern = "\\.", replacement = "\\\\.") %>%
                       paste0("^",  .) %>%
                       stringr::str_subset(string = binman::list_versions(appname = "chromedriver") %>%
                                             dplyr::last()) %>%
                       as.numeric_version() %>%
                       max() %>%
                       as.character())

remDr <- rD[['client']]

Here we initialise the rsDriver.Note the commented parts. Those work but randomly stop working when the Chrome version on your system changes.So I got a solution for this from StackOverflow to fix.

remDr$navigate("")

Here we navigate to a sample html file with a table.I’m using a table to demonstrate data aggregation and functions

name Job_no Description Job type Date issued
Jill 60 Do this 1 TypeA 22/12/2018
Jill 80 Do this 3 TypeB 22/10/2018
Jill 70 Do this 5 TypeB 24/10/2018
Jill 50 Do this 7 TypeD 22/04/2018
Eve 10 Do this 2 TypeA 2/08/2018
Eve 900 Do this 22 TypeA 12/10/2018
Eve 900 Do this 22 TypeC 22/08/2018

As you can see in the table above, we have some columns which we want to aggregate

remDr$navigate("https:// wherever there is data")

Sys.sleep(120)

Above , we have defined where the browser should point to . At this point , a Chrome instance should be up and load site defined. Adding some Sleep time is necessasry to let the site load.

headerlist <- c("Name","Job_no","Description" ,"Job type","Date issued"  )

webElem5<- remDr$findElement('xpath', '//*[@id="T301444200"]')
pp <- webElem5$findChildElements(using = 'tag name','tr')

We have defined headers for data aggregation. Some sites have table ids for every table .Some just have one table. You can target that with “.table” If you have a table id, to find it,it could be as easy as right-clicking on the table and inspecting it . I have used the xpath definition.You can use others too. If it is too nested inside, try expanding all the elements successively and find the id of the table

Basically you want to target the table in any manner possible.The findelement method parses the DOM tree using the target

ee <-lapply(tail(pp,-1), function(rr){

ff <- rr$findChildElements('tag name','td')
er <- lapply(ff, function(ty) {
    if(as.integer(strsplit(ty$getElementValueOfCssProperty("width")[[1]],"px")[[1]]) == 0)
{ return(NA)}
{return("visible")}
 
})
eee2 <- ff[!is.na(er)]  
hu <- lapply(eee2, function(jj){

jj$getElementText()
})

ok <-  data.frame(t(unlist(lapply(hu,"[[",1))))
colnames(ok) <- headerlist
ok

})

cv <- bind_rows(ee)  

In the above code, we also handle table columns which are rendered invisible due to column widthe being equal to 0 and then convert the table to a dataframe.

Lets see how to summarise the data

t1 <- cv %>%
group_by(name) %>%
count() %>%
setNames(c("name",format(Sys.time(), format = "%d/%m/%Y "))) %>%
filter(Assignee %in% c("Jill,"Eve"))

In the above code block, we use dplyr’s count to find the number of jobs against each person and show it with today’s date

t2 <- cv %>%
filter(Assignee %in% c("Jill,"Eve")) %>%
group_by(Status) %>%
count() %>%
setNames(c("Status",format(Sys.time(), format = "%d/%m/%Y ")))

In the above code block, we use dplyr’s count to find the number of jobs grouped by status and show it with today’s date

t3 <- cv %>%
filter(Assignee %in% c("Jill,"Eve")) %>%
group_by(Assignee) %>%
summarise(`work:working days pending` =
          paste(1:length(`Description`),')',`Job type`,"-",`Summary`,
                "-",
                paste((round(as.numeric(as.character(difftime(Sys.time(),
                                                              as.POSIXct(`Date`, format = "%m/%d/%Y %I:%M:%S %p"), units = "days")))) -
                         floor(round(as.numeric(as.character(difftime(Sys.time(),
                                                                      as.POSIXct(`Date`, format = "%m/%d/%Y %I:%M:%S %p"), units = "days"))))/7)*2)       
                      
                      ,"days")
, collapse = ",<br>"))

In the above code block, we use dplyr’s summarise to find the to find the number of jobs against each person and how many days has it been pending

subject <- paste0("Status at ",format(Sys.time(), format = "%H:%M %p")," IST")
from <- "ddddd@mmmmm.com"
to_user <- "ffff@mmmmm.com"

sendmail_options(smtpServer="yyy.zzzz.com")
body = mime_part(paste('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
                   <html xmlns="http://www.w3.org/1999/xhtml">
                   <head>
                   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
                   <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
                   </head>
                   <body>', print(xtable(t1), type = 'html'),'<br>',print(xtable(t3), type = 'html'),'<br>',
                   print(xtable(t2), type = 'html'),
                   '</body>
                   </html>'))
body[["headers"]][["Content-Type"]] <- "text/html"

bodyWithAttachment <- list(body)
sendmail(from=from,to=to_user,subject=subject,msg=bodyWithAttachment)
remDr$close()

The above code block initialises and sends the email


Narayanan Iyer

Written by Narayanan Iyer who lives and works in Mumbai. Full time R and shiny enthusiast , he spends way too much time on HN. Fluent in R,shiny, docker, Python, Javascript and C# and 6 other human languages.

You can contact him at narayanan iyer 22 at gmail dot com