Using R Markdown & knitr

In the previous two posts, there were several plots and I indicated that I used R to scrape html tables from the Bureau of Labor Statistics website. This post provides a quick tutorial on how I obtained and plotted the data using R (& various packages).

I will first setup the R environment and load some of the packages that will be used. The good thing in using R Markdown is creating code in chunks for clarity and ease of debugging. Table 2 & Table 3 of the Economic News Release contain the relevant general labor force data including employment/unemployment rates for various groups. Note that Monthly data change in those tables. At the time of this writing, the tables had data for Aug 2016 - Aug 2017 period. I extracted and plotted unemployment rate data by race/ethnicity and by gender.

Code chunks in R Markdown begin & end with ```

{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, invisible=TRUE, warning=FALSE)

#load libraries
pkg<-c("XML", "RCurl", "rlist", "curl", "stringr", "gapminder", 
	"ggplot2", "xml2", "dplyr", "zoo", "readxl") 

lapply(pkg, require, character.only = TRUE)
#This code will extract the data.
EconNews <- as.data.frame(readHTMLTable(
	getURL("https://www.bls.gov/news.release/empsit.t02.htm"),
        stringsAsFactors=FALSE, trim = TRUE))

#see what the first few lines of the data look like. We also need to assign appropriate columns and rows later.
head(EconNews[,c(1,5:10)])

#looking at the first few rows of first column & columns 5-10. columns 2-4 are unadjusted data.

Some more data cleaning & adjustment: remove “,” from the data so we can conveniently convert from one to another form of data types.

EconNews[] <- lapply(EconNews, gsub, pattern=',', replacement='')

#Make columns for the table and assign those to columns 2-10 (the first column is variable names)

colName <- c("Aug2016", "Jul2017", "Aug2017", "Aug2016a", "Apr2017a", 
		  "May2017a", "Jun2017a", "Jul2017a", "Aug2017a")
			
#the suffix 'a' indicates that column's data is seasonally adjusted.
names(EconNews)[2:10] <- colName

Essentially, the code for this is the same as above but need to point to the appropriate url & name the data frame.

EconNewsLat <- as.data.frame(readHTMLTable(
	getURL("https://www.bls.gov/news.release/empsit.t03.htm"), 
	stringsAsFactors=FALSE, trim = TRUE))

EconNewsLat[]<-lapply(EconNewsLat, gsub, pattern=',', replacement='')

names(EconNewsLat)[2:10] <- colName

Names of the first column & first row of “EconNews” & “EcoNewsLat” are different, since we can’t bind rows when column names are different, rename them with the same column name or drop the names. I will remove them (really assigning a blank).

colnames(EconNewsLat)[1] <- ""
colnames(EconNews)[1] <- ""

Use dplyr’s bind_rows function to append the two tables & call the giant table “EconNewsAll”

EconNewsAll <- dplyr::bind_rows(EconNews,EconNewsLat)

#Have a look at the first few lines of the relevant columns.
head(EconNewsAll[,c(1,5:10)])

Now I can subset data by race, gender and age. I also need to rename the rows.

#grab sub-table for all White group as "W1data"
W1data <- EconNewsAll[2:9,2:10] 

#Rename rows
rownames(W1data)<- c("Civil_noninst_pop","civil_labor_force", 
		"Particip_rate", "Employed", "Emp_pop_ratio", 
		"Unemployed","Unemp_rate", "Not_in_labor_force")

#transpose table for analysis & plotting
W1data <- as.data.frame(t(W1data))

Add indicator variables for ‘Race’, ‘Qtrs’ which is really ‘month’, & ‘Class’. The new class variable indicates age & gender in case we want to look at plots by age, gender and race…

#again with the dplyr package 
W1data <- W1data %>% 
    mutate(Qtrs = as.numeric(c(seq(1:9))),
           Race = as.factor(c(rep(1,9))),
           Class = as.factor(c(rep(1,9))))

#convert variables to numeric
W1data[,1:8] <- lapply(W1data[,1:8], 
			function(x) as.numeric(as.character(x)))

All other data manipulation activities for the other race/ethnic groups similar to the one for the White group above, so I will not elaborate further. Just remember to change data frame names, select/assign appropriate rows & classes.

I have obtained sub-tables for Black/African American as ‘B1data’, for Hispanic/Latino as ‘L1data’, & for Asian as ‘A1data’. So, I will combine (append) the four tables with the following code to obtain one table for all race/ethnic groups:

WBAL1data <- dplyr:: bind_rows(W1data, B1data, A1data, L1data)

WBAL1data[,10:11] <- lapply(WBAL1data[,10:11], 
			function(x) as.factor(as.character(x)))

Time to plot the data!

#Plot for Race/ethnic groups using ggplot
#Note I am plotting subset of data as indicated by the selected rows below
All_plot <- ggplot(data=WBAL1data[c(4:9,13:18,22:27,31:36),], 
    aes(x=Qtrs, y=Unemp_rate, shape=Race, color=Race, group=Race ))+
  geom_point() + 
  geom_line() + 
  scale_x_discrete(limits =c(4:9), 
	labels=c("Aug 2016", "Apr 2017", "May 2017", "Jun 2017",
					"Jul 2017", "Aug 2017"))+ 
					
  scale_color_discrete(breaks=c(1:4), 
	labels=c("All White", "Black", "Asian", "Latino/Hispanic")) +
	
  scale_shape_discrete(breaks=c(1:4), 
	labels=c("All White", "Black", "Asian", "Latino/Hispanic")) +
	
  theme(axis.line = element_line(color="blue", linetype = "solid"), 
			   plot.title = element_text(hjust = .5)) +  
			   
  ggtitle("Adjusted Unemployment Rates by Race\n 
					(Aug 2016 - Aug 2017)") +
  
  theme(legend.title = element_blank(), 
		legend.text = element_text(size = 12), 
		legend.background = element_rect(fill="NA", 
				size=.5, linetype = "solid"))
#reveal the plot
All_plot

The above plot was for 6 time points within a year, now I will look at 10 years’ monthly data.

Time Series Data

A Time Series data of Unemployment Rates by Race/ethnicity (2007 - 2017) are available in excel format for various groups in the BLS website here. I saved each of the seasonally adjusted unemployment rates data files in my local drive. Next, I imported the datasets in R, processed and plotted as below.

WhiteAll_unemp <- as.data.frame(
	read_excel("yourPath/fileName.xlsx", skip=12))  
#'skip' first 12 rows because they are notes

#I will assign original data frame to a new data frame & drop the first column
WhiteAll_unempl <- WhiteAll_unemp[,-1]

# (re-assign) make the first column from the previous dataset to be row names like:
rownames(WhiteAll_unempl) <- WhiteAll_unemp[,1]

#create & convert to time series. 
#You will appreciate this step if you look at the data before and after the following code is executed.
 
WhiteAll_unemp2 <- ts(as.vector(t(as.matrix(WhiteAll_unempl))), 
		start = c(2007, 1), end=c(2017, 12), frequency=12)

The above four lines of code are applied to the tables for Black, Hispanic & Asian groups below (just specify appropriate path/fileName & change data frame names).

BlackAll_unemp <- as.data.frame(
	read_excel("yourPath/fileName.xlsx", skip=12))

BlackAll_unempl <- BlackAll_unemp[,-1]

rownames(BlackAll_unempl) <- BlackAll_unemp[,1]

BlackAll_unemp2 <- ts(as.vector(t(as.matrix(BlackAll_unempl))), 
           start = c(2007, 1), end=c(2017, 12), frequency=12)
HispanicAll_unemp<-as.data.frame(
		read_excel("yourPath/fileName.xlsx", skip=12))

HispanicAll_unempl <- HispanicAll_unemp[,-1]

rownames(HispanicAll_unempl) <- HispanicAll_unemp[,1]

HispanicAll_unemp2 <- ts(as.vector(t(as.matrix(HispanicAll_unempl))), 
            start = c(2007, 1), end=c(2017, 12), frequency=12)

AsianAll_unemp<-as.data.frame(
		read_excel("yourPath/fileName.xlsx", skip=12))

AsianAll_unempl <- AsianAll_unemp[,-1]

rownames(AsianAll_unempl) <- AsianAll_unemp[,1]

AsianAll_unemp2 <- ts(as.vector(t(as.matrix(AsianAll_unempl))), 
		start = c(2007, 1), end=c(2017, 12), frequency=12)

Plotting using R-package zoo…

#Use the zoo library for simple quick plot.
#data are in long form, so column bind the 4 data frames.

plot.zoo(cbind(WhiteAll_unemp2, BlackAll_unemp2, 
	HispanicAll_unemp2, AsianAll_unemp2), 
	plot.type = "single", xy.labels = TRUE,
	xlab="Year \n Data source: https://www.bls.gov/cps", 
	ylab = "Unemployment_rate",
	col = c("red", "blue", "black", "green"), 
	main = "Adjusted Seasonal Unemployment Rates by 
		Race/Ethnicity\n (2007 - 2017)")

legend("topright", inset = c(0,0), y.intersp = 1, 
	legend = c("White", "Black", "Hispanic/Latino", "Asian"),
	lty=1, bty = "n", col = c("red", "blue", "black", "green"), 
	axis(1, 2007:2017), cex = 1)

grid (NULL,NULL, lty = 6, col = "cornsilk3")
abline(h=3:17, v=2007:2017, col="cornsilk2", lty=3)

Alternatively, we can use ggplot, but first we need few more steps to prepare the data.

#prepare data in a format ggplot2 likes
ts_All_race <- (cbind(WhiteAll_unemp2, BlackAll_unemp2, 
			HispanicAll_unemp2, AsianAll_unemp2))

df_all_race <- fortify.zoo(ts_All_race, melt=TRUE, 
			yearmon_trans(format="%b%Y", n=12))
		
df_all_race <- rename(df_all_race, Year=Index, 
			unemployment_rate=Value, Race=Series)

Plot the time series (one can reduce some of the specifications below)

ggplot(data=df_all_race, aes(x=Year, y=unemployment_rate, 
		group=Race, shape=Race, color=Race)) +
		
geom_line() +
geom_point() +
scale_x_discrete(limit=c(2007:2017)) +
scale_y_discrete(limits=c(2:17)) +
  
scale_color_discrete(breaks=c("WhiteAll_unemp2", "BlackAll_unemp2", 
		"HispanicAll_unemp2", "AsianAll_unemp2"), 
		labels=c("White", "Black", "Latino", "Asian")) +
  
scale_shape_discrete(breaks=c("WhiteAll_unemp2", "BlackAll_unemp2", 
		"HispanicAll_unemp2", "AsianAll_unemp2"), 
		labels=c("White", "Black", "Latino", "Asian")) +
  
ggtitle("Adjusted Seasonal Unemployment Rates by Race/Ethnicity 
		(for 16 & older)\n(2007 - 2017)") + 

xlab("Year \n Date source: 
		https://www.bls.gov/webapps/legacy/cpsatab2.htm 
		https://www.bls.gov/webapps/legacy/cpsatab3.htm") +

theme(axis.line = element_line(color="blue", linetype = "solid"), 
			plot.title = element_text(hjust = .5), 
			legend.title = element_blank())

End.