Image By Author created on Canva

R Programming: Learn R programming as a Data Scientist

Gitanjali Mule

--

This is the second blog Data Science Specialization series. Introduction of the series is here. Data Science Specialization by John Hopkins University. If you start reading from the start of this series will help you understand the process better. If you are only interested in R programming then, you can start from here.

This is R Programming Part-1. In this blog I am covering all the basics of R programming. The nuts and bolts of the language. Blog consist of introduction of concept as well as R code snippets to understand better.

Learning R programming is a huge process. It can not be just convert into one blog. I decided to divide it into various parts so that you can consume it properly. When you are on a learning path, you should consume the concepts into shorter forms. So it will give you time to understand better. I have observed that usually people tend to learn everything on one sitting, though they think they have understood everything but when it comes to implement the understanding they can not do it. To learn a programming language, you should first understand the theory, concepts and then practice it into the IDE, in this case I recommend reading this blog along with opening RStudio in another window and trying out the concepts in the editor.

You can you Rmarkdown or R scripts to practice the coding. I am also linking GitHub repo of this blog here.

The Basics

From here we are going to start exploring the R as programming language. What makes R a programming language. These are the building blocks of R. We are going to learn about R Data types and some basic operations on these data types. All the thing we manipulate in R, all the things we encounter in R are might be called objects. Objects can be of different kinds. Everything R is an object. You will get this context of object as you proceed in the R world. But for now just keeping in mind that everything in R is an object will do.

Data Types

R objects and Attributes.

R has five atomic classes of objects, as follows:

• character

• numeric(real numbers)

• integer

• complex

• logical(TRUE/FALSE)

The most basic Object of R is Vector.

A vector can only contain the same class. for example, numeric vector can only have numbers in it it can not contain character or logical classes.

BUT The one exception to this is List, which is represented as a vector but can contain objects of different classes(indeed, that’s usually why we use them)

Empty vector can be created with vector function

x <- vector()

and a vector has three basic arguments — class of vector

• type of vector

• length of vector

class(x)## [1] “logical”typeof(x)## [1] “logical”length(x)## [1] 0

Numbers

• Numbers in R a generally treated as numeric objects (i.e. double precision real numbers)

• If you explicitly want an integer, you need to specify the L suffix

• Ex. Entering 1 gives you a numeric object, but entering 1L explicitly gives you an integer.

• There is also a special number Inf which represents infinity; e.g. 1/0 is Inf . Inf can be used in ordinary calculations; e.g. 1/ Inf is 0

• The value NaN represents an undefined value (“Not a number”) e.g., 0/0 is NaN.

Attributes:

R objects can have attributes. Not all R objects have attributes but most of the objects in R have attributes. Common types o attributes are as follows:

• names, dimnames

• dimensions (eg. matrices, arrays)

• class

• length

• other user-defined attributes/metadata

Attributes of an object can be accessed using the attributes() function

Data Types: Vectors and Lists

Vectors can be created using c() function . c is basically concatenate things together. for example I can create a vector called x as follows:

x <- c(0.1,0.2) ## numeric
x
## [1] 0.1 0.2x <- c(TRUE,TRUE,FALSE) ##logical
x
## [1] TRUE TRUE FALSEx <- c(T,T,F,F) ##logical
x
## [1] TRUE TRUE FALSE FALSEx <- c(“a”,”b”) ##character
x
## [1] “a” “b”x <- 1:10 ##integer
x
## [1] 1 2 3 4 5 6 7 8 9 10x <- c(1+0i,5–10i) ##complex
x
## [1] 1+ 0i 5–10i

vector can be created using vector() function aas follows:

x <- vector(“numeric”,length = 10)
x
## [1] 0 0 0 0 0 0 0 0 0 0

The most common method to create a vector is by using a c() function

What happens if we mix objects

we learned that vectors has to have all members of same class but what happens if we mix the classes of objects in vector. Let’s figure it out. In most of the cases R will not throw an error but it will perform coercion. examples are shown below:

In this example I passed a character object and a numeric object to form a vector x.

x <- c(“a”,3)
x
## [1] “a” “3”typeof(x)## [1] “character”

But if we look at the type of x then it shows that the vector is of character type. What happened here is that R coerced the object into the same type. What it did is, it saved the numeric object as character. So it was easy for R to save the 3 as character than saving a as numeric.

Coercion of object in R can become a problem when we are reading a large dataset. If the dataset contains mixed classes of object in same column then numeric can be saved as character and will start throwing errors if we decide to perform mathematical operations on that.

So in the data science process when we will read read dataset into R, confirming the types of each column is important. Even before you start performing any analysis on dataset confirm the classes of the objects are as per you intended it to be.

Explicit Coercion

Above we learned R implicitly coerced the object classes to make it of same type in a vector. But sometimes we want to coerce a vector explicitly. The application of explicit coercion will follow in coming blogs, but keeping in this mind we help implement it in future.

So how to explicitly coerce vectors

Here I am creating a vector of sequence from 0 to 6.

x <- 0:6
class(x)
## [1] “integer”

here the class of x is of integer but if I want to coerce the vector into numeric class then I can do that by calling a function as followed by the class that i intend to convert into.

x <- as.numeric(x)
x
## [1] 0 1 2 3 4 5 6class(x)## [1] “numeric”x <- as.logical(x)
x
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUEclass(x)## [1] “logical”

When I converted the numeric class into logical it become sequence of true and false. In R , 0 is treated as logical false and 1 onwards its logical true.

x <- 0:6
x <- as.character(x)
x
## [1] “0” “1” “2” “3” “4” “5” “6”class(x)## [1] “character”

These are were the examples of coercion where it was sensible to coerce a type of vector but what happens when we try to coerce character into number

Nonsensical coercion results in NAs

x <- c(“a”,”b”)
x
## [1] “a” “b”x <- as.numeric(x)## Warning: NAs introduced by coercionx## [1] NA NAclass(x)## [1] “numeric”

we can see that R did convert the x vector into numeric but it converted the objects into NAs and throw a warning message NAs introduced by coercion

Lists

List are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them well. List become very handy we want to carry different types of data.

list can be created using list() function. x is a list of different types of objects such as numeric, character, logical, and complex. When i printed out the x, it prints out the list a little differently. The elements of the list are indexed into double brackets. the first element if vector 1. the second element at index 2 is a vector which contains 2.8. Third element in list is character a at intex 3 and so on.

x <- list(1,2.8,”a”,TRUE, 1+4i)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2.8
##
## [[3]]
## [1] “a”
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] 1+4i

Matrices

Matrices are special types of vectors. they are not different or separate class of objects but they are just special types of vectors which have an attribute called dim() dimension.

The dimension attribute is itself an integer vector of length 2(nrow, ncol)

matrix can be created using matrix() function. This is not the only way to create matrix but one of the way to create a matrix.

x <- matrix(nrow = 2, ncol = 3)
x
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
dim(x)## [1] 2 3

the first number is number of rows and second number is number of column.

attributes(x)## $dim
## [1] 2 3

These are the aspects of matrices.

The matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.

for example I am creating a matric for number 1 to 6 with 2 rows and 3 columns. Here you can see that the construction of matrix starts from 1st column 1st row. so 1 is assigned at the upper left corner. number 2 is assigned to the 2 row of 1st column. and so on.

m <- matrix(1:6, nrow = 2, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6

matrices can also be created directly from vectors by adding a dimension attribute to it, shown as follows:

m <- 1:10
m
## [1] 1 2 3 4 5 6 7 8 9 10dim(m) <- c(2,5)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10

Here I am assigning a value to the dimension attribute of m of value 2 and 5. which translates into 2 rows and 5 columns.

So what happening here is I am telling R take this vector m and then add value of 2 and 5 to its dimension attribute, which will make a matrix of 2 rows and 5 columns.

cbind-ing and rbind-ing

Another way to create a matrix is by column-binding or row-binding. What it will do is, it will take the number of vectors and bind them together either by column or row to make it matrix.

x <- c(1,2,3)
y <- c(10,11,12)
cbind(x,y) x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x,y)## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12

Data Types: Factors

Factors are used to represent categorical data. factors can be unordered or ordered. — Factors are treated specially by modeling functions like lm() and glm() (these are linear models)

• Using factors with labels is better than using integers because factors are self-describing: having a variables that has values “Male” and “Female” is better than a variables that has values 1 and 2.

Here the factor vector is created using factor function and c() function

x <- factor(c(“yes”,”no”,”no”,”yes”,”yes”))

to know how many levels are there in the vector we can call a function called levels().

levels(x)## [1] “no” “yes”

or even if we type x the levels will be printed along with the vector

x## [1] yes no no yes yes
## Levels: no yes

If I want to know the frequency of each label, I can call a function called table on x and it will print out how many yes and no are there in the vector

table(x)## x
## no yes
## 2 3

So for example in above vector there are 2 no’s and 3 yes’.

The order of the levels can be set using the levels arguments to factor(). This can be important in linear modeling because the first level is used as the baseline level. This is specifically important if you are having ordered levels in a vector. Example is shown on how to set the order of the level

x <- factor(c(“yes”,”no”,”no”,”yes”), levels = c(“yes”,”no”))

notice, the argument levels =() given for the factor() function inside the brackets as second argument after c()

In the above example, as we have explicitly stated the levels “yes” will be the first level following “no”

Compare the following levels() function and the function we called above on x.

levels(x)## [1] “yes” “no”

Data Types: Missing Values

Missing values are denoted by NA or NaN for underlined mathematical operations.

is.na() is used to test objects if they are NA

is.nan() is use to test for NaN

NA values have a class also, so there are integer NA, character NA etc.

• A NaN value is also NA but the converse is not true. We can say NA is NaN.

x <- c(1,2,NA,10,3)
x
## [1] 1 2 NA 10 3is.na(x)## [1] FALSE FALSE TRUE FALSE FALSEis.nan(x)## [1] FALSE FALSE FALSE FALSE FALSEx <- c(2,3,NaN,10,NA)
x
## [1] 2 3 NaN 10 NAis.na(x)## [1] FALSE FALSE TRUE FALSE TRUEis.nan(x)## [1] FALSE FALSE TRUE FALSE FALSE

See the different outputs for NaN function and NA function.

Data Types: Data Frame

Data frames are used to store tabular data

• they are represented as a special type of list where every element of the list has to have the same length

• Each element of the list can e thought of as a column and the length of each element of the list is the number of rows

• Unlike matrices, data frames can store different classes of objects in each column ( just like lists): matrices must have every element be the same class.

• Data frames also have a special attribute call row.names

• Data frames are usually created by calling read.table() or read.csv()

• Can be converted to a matrix by calling data.matrix()

Creating data frames

x <- data.frame(foo=1:6,bar = c(T,T,F,F,F,T))

Here foo and bar are the names of the columns and each column will have 6 elements in it. As you can see, each column have different data types. foo has numerical data in it while bar has logical data (T is short for TRUE and F is short for FALSE)

x##   foo bar
## 1 1 TRUE
## 2 2 TRUE
## 3 3 FALSE
## 4 4 FALSE
## 5 5 FALSE
## 6 6 TRUE

Number of rows are:

nrow(x)## [1] 6

Number of columns are:

ncol(x)## [1] 2

Data Types: The Names Attribute

R objects can also have names, which is very useful for writing readable code and self-describing objects

Here I have show few examples of names() function

By default there are no names:

x <- 1:4
names(x)
## NULLnames(x) <- c(“a”,”bcd”,”ds”,”xyz”)x## a bcd ds xyz
## 1 2 3 4
names(x)## [1] “a” “bcd” “ds” “xyz”y <- data.frame(a = 1:4, b = c(“a”,”b”,”c”,”d”))names(y)## [1] “a” “b”y## a b
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
names(y) <- c(“first_column”, ”second_column”)y## first_column second_column
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d

matrices can have names as well

m <- matrix(1:4,nrow = 2,ncol=2)
m
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
dimnames(m) <- list(c(“a”,”b”),c(“c”,”d”))
m
## c d
## a 1 3
## b 2 4

Data Types Summary:

Data Types

• atomic classes: numeric, logical, character, integer, complex

• vectors, lists

• factors

• missing values

• data frames

  • names

What’s Next

we are done with the nuts and bolts of R programming. In the next part I will introduce you to how to read tabular data, how to interact with outside world from R, subset-ing and more. I will be pasting the link of my next blog down here. Till then you can practice what we have learn, discover what you can do with R.

I will highly appreciate a shoutout. Thank you very much for reading my blogs and motivating me to do more.

Keep Learning.

Connect me on 💻Twitter and LinkedIn. shoot a message to let me know you are coming after reading my blog. I would love to connect with you.

Connect with me through a mailing list by signing up with your email id

here 👇.

--

--

Gitanjali Mule

Data Analyst |Python | R | Tableau | Web Application Developer | 10k+ views