Data Analytics with R and SQL Server

First Happy New Year and best wishes for 2015!

Yes, Data Analytics with R and SQL Server … As Data and Business Intelligence Architects I personally think we just can’t dismiss or push aside the Data Science “landscape”. Hence that’s why I registered and currently more than half way through the Data Science Specialization on Coursera from the Johns Hopkins University. Even though I have been working with R for a while I wanted to get some kind of recognition, attestation for my skills and also just because I can 😉

So the following outlines how to connect to SQL Server with R and briefly highlights some (out of so many more to explore) R functions…

[Required]
Here is what you first need to install to get started:
R
http://www.r-project.org/

RStudio Desktop
http://www.rstudio.com/products/rstudio/

[Optional – but required for this demo!]
For the purpose of this demo I am using the following:
The Adventure Works DW 2014 Database which can be downloaded here https://msftdbprodsamples.codeplex.com/. The demo (and scripts) will also work with the Adventure Works DW 2012 database.

If you decided to use and restore the Adventure Works DW database, you will need to create the following SQL Server View which will serve as our data frame (data table):

CREATE VIEW [dbo].[vw_FactSales]
AS
SELECT	p.EnglishProductName AS 'Product',
        pc.EnglishProductCategoryName AS 'ProductCategory', 
        ps.EnglishProductSubcategoryName AS 'ProductSubCategory',  
		do.FullDateAlternateKey AS 'DateOrder',
		dd.FullDateAlternateKey AS 'DateDue',
		ds.FullDateAlternateKey AS 'DateShip',
		dc.BirthDate, 
		dc.MaritalStatus, 
		dc.Gender,
		dc.YearlyIncome,
		dc.TotalChildren,
		dc.NumberChildrenAtHome,
		dc.EnglishEducation AS 'Education',
		dc.EnglishOccupation AS 'Occupation',
		dc.HouseOwnerFlag,
		dc.NumberCarsOwned,
		dc.DateFirstPurchase,
		dc.CommuteDistance,
		g.City,
        g.StateProvinceName AS 'StateProvince',
        g.EnglishCountryRegionName AS 'CountryRegion',
		g.PostalCode,
		pr.EnglishPromotionName AS 'Promotion',
		pr.EnglishPromotionType AS 'PromotionType',
		pr.EnglishPromotionCategory AS 'PromotionCategory',
        cu.CurrencyName AS 'Currency',
		st.SalesTerritoryRegion, 
		st.SalesTerritoryCountry,
		st.SalesTerritoryGroup,
		fs.OrderQuantity, 
		fs.UnitPrice, 
		fs.ExtendedAmount, 
		fs.UnitPriceDiscountPct, 
		fs.DiscountAmount, 
		fs.ProductStandardCost, 
		fs.TotalProductCost, 
		fs.SalesAmount, 
		fs.TaxAmt, 
		fs.Freight
FROM    dbo.FactInternetSales fs JOIN
 		dbo.DimProduct p ON fs.ProductKey = p.ProductKey JOIN
        dbo.DimProductSubcategory ps ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey JOIN
        dbo.DimProductCategory pc ON ps.ProductCategoryKey = pc.ProductCategoryKey JOIN       
		dbo.DimDate do ON fs.OrderDateKey = do.Datekey JOIN
		dbo.DimDate dd ON fs.DueDateKey = dd.DateKey JOIN
		dbo.DimDate ds ON fs.DueDateKey = ds.DateKey JOIN
		dbo.DimCustomer dc ON fs.CustomerKey = dc.CustomerKey JOIN
		dbo.DimGeography g ON dc.GeographyKey = g.GeographyKey JOIN
		dbo.DimPromotion pr ON fs.PromotionKey = pr.PromotionKey JOIN
        dbo.DimCurrency cu ON fs.CurrencyKey = cu.CurrencyKey JOIN
		dbo.DimSalesTerritory st ON fs.SalesTerritoryKey = st.SalesTerritoryKey 
GO

You will also need to create a SQL Server Login named RUser with db_datareader role membership to the Adventure Works DW database

R_00

Next we need create a new ODBC System DSN (I opted for 64-bit) click Add

R_01

Select SQL Server Native Client 11.0

R_02

Name it AdventureWorksDW

R_03

Provide the SQL Server Login information

R_04

Change the default database

R_05

Next

R_06

Test connectivity

R_07

Your done!

R_08

Open RStudio and let’s start exploring. (You will have a different IDE – I am using the ‘Idle Fingers’ theme. To change yours go to Tools – Global Options)

R10

First we need to install a specific package named RODBC which provides ODBC Database Access and will permit us to connect to SQL Server. For more information on the RODBC package follow this link -> http://cran.r-project.org/web/packages/RODBC/index.html

Simply type the following at the prompt in the Console window and press enter.

install.packages("RODBC")

R11

Now we need to load the RODBC package, create a connection “con” to the ODBC DSN “AdventureWorksDW” we created earlier, query the database and put the results into a data frame “df_adventureworks” and close the connection.

library(RODBC)
con <- odbcConnect("AdventureWorksDW", uid="RUser", pwd="RUser12345")
df_adventureworks <- as.data.frame(sqlQuery(con, "select * from vw_FactSales"), stringsAsFactors = FALSE)
close (con)
R12

You should see in your Environment window pane the following data frame df_adventureworks with 60398 obs. of 39 variables. Meaning 60398 rows of data with 39 attributes (columns)

R13

The most useful multipurpose function in R is summary(X) where X can be one of any number of objects, including datasets, variables, and linear models… just to name a few! The summary function has different outputs depending on what kind of object it takes as an argument. Besides being widely applicable, this method is valuable because it often provides exactly what is needed in terms of summary statistics.

Let’s give it a try - type the following command in the Console window and press enter

summary(df_adventureworks)

Take time to look at the generated output (very interesting)

R14

Now let’s look at two other functions colMeans and aggregate

Type the following commands in the Console window and press enter

colMeans(df_adventureworks["SalesAmount"])
aggregate(SalesAmount ~ ProductCategory + CountryRegion, data=df_adventureworks, FUN=sum)

colMeans returns the mean for the specified columns, in this case for SalesAmount the mean is 486.0869
aggregate splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
Those of you who are familiar with SQL Server will notice that this function “aggregate(SalesAmount ~ ProductCategory + CountryRegion, data=df_adventureworks, FUN=sum)” is somewhat similar to GROUP BY and thus the following T-SQL will return the same results.

SELECT	ProductCategory,
		CountryRegion,
		SUM(SalesAmount) AS 'SalesAmount'
FROM	dbo.vw_FactSales
GROUP BY
		ProductCategory,
		CountryRegion
ORDER BY
		CountryRegion, 
		ProductCategory
R15

On to graphics, you can create several basic graph types like density plots, dot plots, bar charts, line charts, pie charts, boxplots and scatter plots in R, I recommend you look at the following link -> http://www.statmethods.net/graphs/index.html to get you started.

But to create advanced graphics in R one needs to install the ggplot2 package.
ggplot2 is an implementation of the grammar of graphics in R. Official documentation can be found at the following link -> http://docs.ggplot2.org/current/

In order to install and use the ggplot2 library, type the following commands in the Console window

install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)

Let’s create a bar plot with the count of sold ProductCategory [Accessories, Bikes, Clothing] by CountryRegion
Type the following command in the console and press enter

ggplot (df_adventureworks, aes(CountryRegion, fill=ProductCategory)) + geom_bar()

The graph will appear in the Plot window and should look similar to the following

rgraph00

This one will create 2 separate bar plots [MaritalStatus] one for Married and other for Single with count by Occupation

ggplot(df_adventureworks, aes(Occupation) ) +  geom_histogram(color = "white") + facet_grid(MaritalStatus ~ .)
rgraph01

Before we wrap up here is the entire R script used for this demo

# Install RODBC package
install.packages("RODBC")

# Load RODBC package
library(RODBC)

# Create connection "con" to SQL Server and database using ODBC DSN
con <- odbcConnect("AdventureWorksDW", uid="RUser", pwd="RUser12345")

# Query the database and put the results into the data frame "df_adventureworks"
df_adventureworks <- as.data.frame(sqlQuery(con, "select * from vw_FactSales"), stringsAsFactors = FALSE)
close (con)

# Return summary statistics about the "df_adventureworks" data frame
summary(df_adventureworks)

# Return the mean for the "SalesAmount" column
colMeans(df_adventureworks["SalesAmount"])

# Return aggregate sum data for "SalesAmount" grouping it by "ProductCategory" and "CountryRegion"
aggregate(SalesAmount ~ ProductCategory + CountryRegion, data=df_adventureworks, FUN=sum)

# Install ggplot2 package and dependencies
install.packages("ggplot2", dependencies = TRUE)

# Load ggplot2 package
library(ggplot2)

# Plot with count (number) of sold "ProductCategory" [Accessories, Bikes, Clothing] by "CountryRegion"
ggplot (df_adventureworks, aes(CountryRegion, fill=ProductCategory)) + geom_bar()

# Plots 2 separate bar plots "MaritalStatus" one for Married and other for Single with count by "Occupation" 
ggplot(df_adventureworks, aes(Occupation) ) +  geom_histogram(color = "white") + facet_grid(MaritalStatus ~ .)

To conclude, this is certainly not the best dataset (may want to look at an earlier post for better ones) for performing advanced analytics, but serves great purpose for demonstrating how to connect to SQL Server with R and introduce some basic R functions…

If you are interested to learn more and further explore about R, I strongly recommend you start with the following resources

comments powered by Disqus