Rselenium and email R Script

November 23, 2019

Let’s see how to scrape a page using Rselenium and schedule a task using windows Task scheduler. This is useful for periodic status updates.

Since I’m using Chrome for this, download the chrome driver and install it. This is essential for Rselenium.

Basically, Rselenium communicates with the Chrome driver and controls the Chrome Browser instance.

library(sendmailR)
library(dplyr)
library(RSelenium)
library(xtable)

I’m using sendmailR for emails and xtable for formatting HTML tables in the email body. Install these libraries from CRAN.

setwd("C://driver for chrome")
line_break_function <- function(x){

 gsub("$<$br$>$","<br>",x)
}


options(xtable.sanitize.text.function = line_break_function)

This above is a hack that I wrote for line breaks to appear properly in xtable HTML table output.

&#35;rD <- rsDriver(browser = "chrome",chromever = "76.0.3809.25",port =5767L)
&#35;rD <- rsDriver(browser = "chrome")
rD <- RSelenium::rsDriver(browser = "chrome",
                     chromever =
                       system2(command = "wmic",
                               args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value',
                               stdout = TRUE,
                               stderr = TRUE) %>%
                       stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>%
                       magrittr::extract(!is.na(.)) %>%
                       stringr::str_replace_all(pattern = "\\.", replacement = "\\\\.") %>%
                       paste0("^",  .) %>%
                       stringr::str_subset(string = binman::list_versions(appname = "chromedriver") %>%
                                             dplyr::last()) %>%
                       as.numeric_version() %>%
                       max() %>%
                       as.character())

remDr <- rD[['client']]

Here we initialise the rsDriver.Note the commented parts. Those work but randomly stop working when the Chrome version on your system changes.So I got a solution for this from StackOverflow to fix.

remDr$navigate("")

Here we navigate to a sample html file with a table.I’m using a table to demonstrate data aggregation and functions

name	Job_no	Description	Job type	Date issued
Jill	60	Do this 1	TypeA	22/12/2018
Jill	80	Do this 3	TypeB	22/10/2018
Jill	70	Do this 5	TypeB	24/10/2018
Jill	50	Do this 7	TypeD	22/04/2018
Eve	10	Do this 2	TypeA	2/08/2018
Eve	900	Do this 22	TypeA	12/10/2018
Eve	900	Do this 22	TypeC	22/08/2018

As you can see in the table above, we have some columns which we want to aggregate

remDr$navigate("https:// wherever there is data")

Sys.sleep(120)

Above , we have defined where the browser should point to . At this point , a Chrome instance should be up and load site defined. Adding some Sleep time is necessasry to let the site load.

headerlist <- c("Name","Job_no","Description" ,"Job type","Date issued"  )

webElem5<- remDr$findElement('xpath', '//*[@id="T301444200"]')
pp <- webElem5$findChildElements(using = 'tag name','tr')

We have defined headers for data aggregation. Some sites have table ids for every table .Some just have one table. You can target that with “.table” If you have a table id, to find it,it could be as easy as right-clicking on the table and inspecting it . I have used the xpath definition.You can use others too. If it is too nested inside, try expanding all the elements successively and find the id of the table

Basically you want to target the table in any manner possible.The findelement method parses the DOM tree using the target

ee <-lapply(tail(pp,-1), function(rr){

ff <- rr$findChildElements('tag name','td')
er <- lapply(ff, function(ty) {
    if(as.integer(strsplit(ty$getElementValueOfCssProperty("width")[[1]],"px")[[1]]) == 0)
{ return(NA)}
{return("visible")}
 
})
eee2 <- ff[!is.na(er)]  
hu <- lapply(eee2, function(jj){

jj$getElementText()
})

ok <-  data.frame(t(unlist(lapply(hu,"[[",1))))
colnames(ok) <- headerlist
ok

})

cv <- bind_rows(ee)

In the above code, we also handle table columns which are rendered invisible due to column widthe being equal to 0 and then convert the table to a dataframe.

Lets see how to summarise the data

t1 <- cv %>%
group_by(name) %>%
count() %>%
setNames(c("name",format(Sys.time(), format = "%d/%m/%Y "))) %>%
filter(Assignee %in% c("Jill,"Eve"))

In the above code block, we use dplyr’s count to find the number of jobs against each person and show it with today’s date

t2 <- cv %>%
filter(Assignee %in% c("Jill,"Eve")) %>%
group_by(Status) %>%
count() %>%
setNames(c("Status",format(Sys.time(), format = "%d/%m/%Y ")))

In the above code block, we use dplyr’s count to find the number of jobs grouped by status and show it with today’s date

t3 <- cv %>%
filter(Assignee %in% c("Jill,"Eve")) %>%
group_by(Assignee) %>%
summarise(`work:working days pending` =
          paste(1:length(`Description`),')',`Job type`,"-",`Summary`,
                "-",
                paste((round(as.numeric(as.character(difftime(Sys.time(),
                                                              as.POSIXct(`Date`, format = "%m/%d/%Y %I:%M:%S %p"), units = "days")))) -
                         floor(round(as.numeric(as.character(difftime(Sys.time(),
                                                                      as.POSIXct(`Date`, format = "%m/%d/%Y %I:%M:%S %p"), units = "days"))))/7)*2)       
                      
                      ,"days")
, collapse = ",<br>"))

In the above code block, we use dplyr’s summarise to find the to find the number of jobs against each person and how many days has it been pending

subject <- paste0("Status at ",format(Sys.time(), format = "%H:%M %p")," IST")
from <- "ddddd@mmmmm.com"
to_user <- "ffff@mmmmm.com"

sendmail_options(smtpServer="yyy.zzzz.com")
body = mime_part(paste('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
                   <html xmlns="http://www.w3.org/1999/xhtml">
                   <head>
                   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
                   <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
                   </head>
                   <body>', print(xtable(t1), type = 'html'),'<br>',print(xtable(t3), type = 'html'),'<br>',
                   print(xtable(t2), type = 'html'),
                   '</body>
                   </html>'))
body[["headers"]][["Content-Type"]] <- "text/html"

bodyWithAttachment <- list(body)
sendmail(from=from,to=to_user,subject=subject,msg=bodyWithAttachment)
remDr$close()

The above code block initialises and sends the email