Rselenium and email R Script
November 23, 2019
Let’s see how to scrape a page using Rselenium and schedule a task using windows Task scheduler. This is useful for periodic status updates.
Since I’m using Chrome for this, download the chrome driver and install it. This is essential for Rselenium.
Basically, Rselenium communicates with the Chrome driver and controls the Chrome Browser instance.
library(sendmailR)
library(dplyr)
library(RSelenium)
library(xtable)
I’m using sendmailR for emails and xtable for formatting HTML tables in the email body. Install these libraries from CRAN.
setwd("C://driver for chrome")
line_break_function <- function(x){
gsub("$<$br$>$","<br>",x)
}
options(xtable.sanitize.text.function = line_break_function)
This above is a hack that I wrote for line breaks to appear properly in xtable HTML table output.
#rD <- rsDriver(browser = "chrome",chromever = "76.0.3809.25",port =5767L)
#rD <- rsDriver(browser = "chrome")
rD <- RSelenium::rsDriver(browser = "chrome",
chromever =
system2(command = "wmic",
args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value',
stdout = TRUE,
stderr = TRUE) %>%
stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>%
magrittr::extract(!is.na(.)) %>%
stringr::str_replace_all(pattern = "\\.", replacement = "\\\\.") %>%
paste0("^", .) %>%
stringr::str_subset(string = binman::list_versions(appname = "chromedriver") %>%
dplyr::last()) %>%
as.numeric_version() %>%
max() %>%
as.character())
remDr <- rD[['client']]
Here we initialise the rsDriver.Note the commented parts. Those work but randomly stop working when the Chrome version on your system changes.So I got a solution for this from StackOverflow to fix.
remDr$navigate("")
Here we navigate to a sample html file with a table.I’m using a table to demonstrate data aggregation and functions
name | Job_no | Description | Job type | Date issued | |
---|---|---|---|---|---|
Jill | 60 | Do this 1 | TypeA | 22/12/2018 | |
Jill | 80 | Do this 3 | TypeB | 22/10/2018 | |
Jill | 70 | Do this 5 | TypeB | 24/10/2018 | |
Jill | 50 | Do this 7 | TypeD | 22/04/2018 | |
Eve | 10 | Do this 2 | TypeA | 2/08/2018 | |
Eve | 900 | Do this 22 | TypeA | 12/10/2018 | |
Eve | 900 | Do this 22 | TypeC | 22/08/2018 |
As you can see in the table above, we have some columns which we want to aggregate
remDr$navigate("https:// wherever there is data")
Sys.sleep(120)
Above , we have defined where the browser should point to . At this point , a Chrome instance should be up and load site defined. Adding some Sleep time is necessasry to let the site load.
headerlist <- c("Name","Job_no","Description" ,"Job type","Date issued" )
webElem5<- remDr$findElement('xpath', '//*[@id="T301444200"]')
pp <- webElem5$findChildElements(using = 'tag name','tr')
We have defined headers for data aggregation. Some sites have table ids for every table .Some just have one table. You can target that with “.table” If you have a table id, to find it,it could be as easy as right-clicking on the table and inspecting it . I have used the xpath definition.You can use others too. If it is too nested inside, try expanding all the elements successively and find the id of the table
Basically you want to target the table in any manner possible.The findelement method parses the DOM tree using the target
ee <-lapply(tail(pp,-1), function(rr){
ff <- rr$findChildElements('tag name','td')
er <- lapply(ff, function(ty) {
if(as.integer(strsplit(ty$getElementValueOfCssProperty("width")[[1]],"px")[[1]]) == 0)
{ return(NA)}
{return("visible")}
})
eee2 <- ff[!is.na(er)]
hu <- lapply(eee2, function(jj){
jj$getElementText()
})
ok <- data.frame(t(unlist(lapply(hu,"[[",1))))
colnames(ok) <- headerlist
ok
})
cv <- bind_rows(ee)
In the above code, we also handle table columns which are rendered invisible due to column widthe being equal to 0 and then convert the table to a dataframe.
Lets see how to summarise the data
t1 <- cv %>%
group_by(name) %>%
count() %>%
setNames(c("name",format(Sys.time(), format = "%d/%m/%Y "))) %>%
filter(Assignee %in% c("Jill,"Eve"))
In the above code block, we use dplyr’s count to find the number of jobs against each person and show it with today’s date
t2 <- cv %>%
filter(Assignee %in% c("Jill,"Eve")) %>%
group_by(Status) %>%
count() %>%
setNames(c("Status",format(Sys.time(), format = "%d/%m/%Y ")))
In the above code block, we use dplyr’s count to find the number of jobs grouped by status and show it with today’s date
t3 <- cv %>%
filter(Assignee %in% c("Jill,"Eve")) %>%
group_by(Assignee) %>%
summarise(`work:working days pending` =
paste(1:length(`Description`),')',`Job type`,"-",`Summary`,
"-",
paste((round(as.numeric(as.character(difftime(Sys.time(),
as.POSIXct(`Date`, format = "%m/%d/%Y %I:%M:%S %p"), units = "days")))) -
floor(round(as.numeric(as.character(difftime(Sys.time(),
as.POSIXct(`Date`, format = "%m/%d/%Y %I:%M:%S %p"), units = "days"))))/7)*2)
,"days")
, collapse = ",<br>"))
In the above code block, we use dplyr’s summarise to find the to find the number of jobs against each person and how many days has it been pending
subject <- paste0("Status at ",format(Sys.time(), format = "%H:%M %p")," IST")
from <- "ddddd@mmmmm.com"
to_user <- "ffff@mmmmm.com"
sendmail_options(smtpServer="yyy.zzzz.com")
body = mime_part(paste('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
</head>
<body>', print(xtable(t1), type = 'html'),'<br>',print(xtable(t3), type = 'html'),'<br>',
print(xtable(t2), type = 'html'),
'</body>
</html>'))
body[["headers"]][["Content-Type"]] <- "text/html"
bodyWithAttachment <- list(body)
sendmail(from=from,to=to_user,subject=subject,msg=bodyWithAttachment)
remDr$close()
The above code block initialises and sends the email