Skip to Main Content

Data Science

A guide to library resources about Data Analytics

Introduction

  • R is a programming language for data analysis

Some basic functionalities; strength of R is in its packages

  • RStudio is an integrated development environment (IDE) for R

Do all "coding" in RStudio; RStudio is not R, but it facilitates the use of R

 

Engel, C. A. (2019, February 4). R and Rstudio. GitHub. https://cengel.github.io/R-intro/backgroud.html.

Creating an "R Project"

  • Start RStudio
  • Under the File menu at top left, click on New Project, choose New Directory, then New Project
  • As directory (or folder) name enter r-intro and create project as subdirecory of your desktop folder: ~/Desktop
  • Click on Create project
  • Under the Files tab on the bottom right of the screen, click on New Folder and create a folder named data within your newly created working directory (e.g., ~/r-intro/data)
  • On the main menu go to Files > New File > R Script (or use the shortcut Shift + Ctrl/Cmd N) to open a new file
  • Save the empty script as r-intro-script.R in your working directory.

Your working directory should now look like in the following Figure: 

Importing Data

  • Install.packages(“pkgname”) before you can load them

install.packages("readr")
install.packages("readxl")

  • Load the installed packages

library(readr)
library(readxl)
   

  • Importing a saved data file into R with RStudio import tools

File/Environment -> Import Dataset -> From text, Excel……-> Enter a URL/file path

  • Importing a saved data file into R with coding

#Import a.csv data file

     hrs = read.csv(“C:/HRS.csv”, header=T, sep=“,”)

# Look at the first 6 cases to check the data

      head(hrs)

# Look at the last 6 cases to check the data

      tail(hrs)

Creating a New Data Frame and Exploring Data with Graphics

  • Creating a data frame for independent variable (IV) & dependent variable (DV)

hrs.new = data.frame(hrs$DV, hrs$IV)

attach(hrs.new) # Attach the data to R’s memory for data exploration

  • Density plot and Q-Q plot to check normality of DV

plot(density(DV), main=“Density Plot: DV”, xlab=“DV”)

qqnorm(DV, main=“Population Dist. of DV”)

  • Box plot to detect outliers of DV

boxplot(DV, ylab=“DV”)

  • Scatter plot to visualize the linear relationship between DV and IV

plot(edyrs,bmi,xlab=“IV”,ylab=“DV”)