R Tutorial

Topic 1: R Basics

What is R?

R is a free software environment and programming language for statistical computing and graphics created by Ross Ihaka and Robert Gentleman. R is freely distributed under the terms of the GNU General Public Licence4; its development and distribution are carried out by several statisticians known as the R Development Core Team.For more info visit https://www.r-project.org

• Go to https://www.r-project.org
• Select a CRAN location (a mirror location). For world wide you can go with “0-Cloud”.
• R is now installed and I suggest you to install an integrated development environment. Strictly suggested to install RStudio

• Open a new tab in your browser and go to RStudio
• RStudio is now installed on your computer. Enjoy it :)

Note:You need to install “R” to be able to use RStudio but no need to run “R” to use “RStudio”. It will be enough to open “RStudio” only.

How R works?

R is an interpreted language which means all commands written on the terminal are directly executed without requiring a compilation as in the i.e. C, C++ languages. Let’s look at some examples;

10 + 5
2 - 4
n <- 10
n
print("I'm an R developer, WoW!")
print("I love finance as well as programming in R")

Exercise: Now it is your turn to write your own code and run them.

• two plus two (2+2)
• three times five (3*3)
• six over six(6/6)

Commenting out in R

In order to write comments in R you need to put “#” key in front of the text you’d like to write as a comment. These are not executed and simply for information purposes.

Example: please run the below code to see the result

# a = 100 this a comment and will not be executed therefore has no impact on my program
# b = 50this is just another comment. Again no impact on the program
n <- 5 # only after "hash key" will not be executed but before it will be
# so, 5 will be assigned to n
n # just type n to the console to see the value assigned to n

Is result suprising to you? should not be. As lines 1,2, and 5 are commented with # key those line are simply ignored.

Exercise: Time to comment out the lines

Instructions

• Run the code below w/o changing anything
• Comment out the first two lines by putting hash key in front of them and run the code again
x <- 100
y <- "Comment me"
x
y

Did you get an error message after performing the second instruction? You should becasue first two lines are commented and it is ignored. Therefore there will be no assignment to variables x and y.

We showed how to comment in an R program. Sometimes you need to comment out many lines. There is a very handy way to do it. Just highlight the lines of interest and hit Ctrl + Shift + C which will automatically comment out the lines you have highlighted.

Exercise: Commenting lines in RStudio

Instructions

• Copy the below code into R script in RStudio
• Higlight them
• And hit Ctrl + Shift + C
EURUSD <- 1.15
GBPUSD <- 1.35
GOOGL <- 1050
sum_of_all <- EURUSD + GBPUSD + GOOGL

Note: Just as commenting the lines with Ctrl + Shift + C (in windows and linux) / Cmd + Shift + C (in Mac), you can uncomment by higlighting the code and hitting Ctrl + Shift + C.

Exercise: Now higlight lines you commented in the previous exercis and uncomment them. You will see that # keys are automatically added to the front of the code block highlighted.

Creating and deleting an object

An object can be created with the “assign” operator which is written as “<-” or “=” (in some cases). But I suggest you to use <- always to be consistent. As for the name of an object, it has to start with a letter (A-Z or a-z) and can include letters, underscores(_) and number digits(0-9).

Example

n <- 10
smtg_123 <- 1999.9999
MSFT <- 84
GBPUSD <- 1.35
string_object <- "This is my string object"
logical_object <- TRUE

Exercise: Variable declaretions and assignment

Instructions

• Define a variable and name GOOGL and assign Google’s today’s stock price to it
• Define another variable named AMAZN and assign Amazon’s today’s stock price to it
• Print out GOOGL and AMAZON variables that you have just created and assigned values. (To print out just write GOOGL and AMAZN)

It is also possible to define more than one object in a line. To be able to do that you need to use semicolon(;) after each assignment.

Example:

GOOGL <- 1200; AMZN <- 1160; AAPL <- 169

In addition to the assginemt operator (<-) objects can be created with assing function.

assign("x", 5)
assign("I", "Mehmet")

Note: You should always keep in mind that R is case sensitive therefore the variable “apple” is not equal to variable “Apple”, similarly r is not equal to R.

Example: Run the code to see the results.

"Apple" == "apple"
"r" == "R"
n <- 15
N <- 133
n
N

See the difference: n stores 15 while N is storing 133 and n is not overwritten by 133.

What if an object already exists? Let’s say we do have a variable t and assigned 5 to it. Then, we have redefined t and assigned 10. The t is overwritten and the value becomes 10

Example: Run the code to see the result

t <- 5 # the value of t is 5
t
t <- 10 # the value of t is now 10
t

We can also just write the expressions without assigning its value to a variable. In this case the result is displayed on the console but these expressions are not going to be visible in global environment.

Example: Run the code

20 + 5
(10 - 3) * 7
1999 + 1

When the objects are defined and assigned values to them, they are stored in active memory and visible in the “global environment”. You can see the variables in global environment which is visible on the top right corner of RStudio.c

Exercise: Creating objects

Instructions

• Create objects; x, y, z and assign “3” to “x”, “5” to “y” and “I am not numeric” to “z”.
• Print out x, y and z

Checking the objects in active memory/global environment

To see the objects in global environment you need to use the ls() function. There are several ways of showing the objects

• ls()
• objects()
• ls( pat = “character”)
• ls.str()

The collection of objects currently stored is called the workspace.

Deleting the objects in active memory/global environment

To delete objects in global environment you need to use the rm() function. (rm stands for remove)

• rm()

Getting help in R

R gives very useful information on how to use functions. In order to get online help in R there are two very common methods; ?function_name and help(“function_name”) (or help(function_name)

Example: Getting help with mean function. Run the code to see the output
?mean
help("mean")
help(mean)

They all will give the same result. Within RStudio it will display the help page for the function mean() (arithmetic mean). In the help page you will see the function details such as arguments, examples, references etc.

To get very high level help with R you can use help.start()
help.start()

Using help function is very important for beginners. Especially for looking at the function arguments.

Quiz

You can include any number of single or multiple choice questions as a quiz. Use the question function to define a question and the quiz function for grouping multiple questions together.

Some questions to verify that you understand the purposes of various base and recommended R packages:

Quiz

Topic 2: Data Types

R supports various data types but the most basic types to get sterted are;

• numeric: numerical values like 2017, 15, 3.5, 9.2, 11.7
• character: string values are called characters like “I love R”, “a”, “k” ans written with quotation mark
• logical: boolean values are called logicals an can take two values; TRUE and FALSE

Example: Variables with different types of data types

num_var <- 12.69
char_var <- "I am a character, not numerical"
logical_val <- TRUE

Exercise: Create your own variables for each data type that are shown in the example above

How to check a data type of a varible?

Data type of a variable can be seen with mode or class function

Exercise: Complete the code by adding required argument of mode function for a logical and a character value.

mode(10) # mode of a numerical value; 10
mode(...) # write your code where ...
mode(...) # write your code where ...

Topic 3: Data Structures

Main data structures in R to learn at this stage are;

• Vector
• Factor
• Matrix
• Data frame
• List
• ts(time series)

Let’s get started with vectors

Vector

Vectors is one-dimentional array that can take on numerical, logical, and character values but a vector cannot take more than one data type. In R you can create a vector in two ways; one with combination c() function or with vector() function. vector() function is particularly handy in “for loops” that you will learn in future tutorials.

• c() function does not take any default argument and no need to specify its mode and length upfront
• vector() : It takes two arguments; mode and length. You need to spesify.

Example: Creating vectors with c() function

num_vect <- c(1, 2, 10, 20)
char_vect <- c("AMZN", "GOOGL", "AAPL")
log_vect <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE)

Vectors may contain NAs as well

num_vect <- c(1, 2, 10, 20, NA)
char_vect <- c("AMZN", "GOOGL", "AAPL", NA)
log_vect <- c(TRUE, FALSE, NA, NA, NA)

Example: Creating vectors with vector() function

num_vect <- vector(mode = "numeric", length = 10) # this will create a numeric vector with length 10
# Printing out num_vect
num_vect
##  [1] 0 0 0 0 0 0 0 0 0 0
log_vect <- vector(mode = "logical", length = 10)
# Printing out log_vect
log_vect
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
char_vect <- vector(mode = "character", length = 10)
# Printing out char_vect
char_vect
##  [1] "" "" "" "" "" "" "" "" "" ""

Exercise: Creating Vectors

Instructions

• Complete the code for your_log_vect such that it will take 3 values; FALSE, TRUE, TRUE
• Complete the code for your_char_vect sucht that it will take days of the week; “Mon”, “Tue”, ….
• Complete the code for log_vec such that it will consists of logical values wiht the length of 6. For logical set the mode argument to “logical”
null_vector <- c()
# Create your vectors with c()
your_num_vect <- c(190, 200, 150, 10)
your_log_vect <- c()
your_char_vect <-

# Create your vectors with vector()
num_vect <- vector(mode = "numeric", length = 3)
log_vect <- vector()
char_vect <-

Here again we can us assign function with vectors.

Example: Creating a vector with assign() function
assign("x", c(1,3,5,7,9)) # vector c(1,3,5,7,9) is assigned to x

Arithmetic operations with vectors

Arithmetic operations with vectors are processed element-wise. For example when we add a number to a vector consisting 5 elements, the number will be added to all elements.

Example:
my_vec <- c(1, 5, 7, 10)
my_var <- 2
my_new_vec <- my_vec + my_var
my_new_vec
my_another_var <- my_vec * 10
my_another_var
my_one_another_var <- my_vec / 100
my_one_another_var

In the case of arithmetic with logical vectors, elements of the vector are coerced into numeric vectors, FALSE becoming 0 and TRUE becoming 1. But this is not a rule of thumb, there are situations which this is not the case.

Generating regular sequences

• :
• seq()

The most common used methods for generating number sequences are; colon(:) operator and seq() function. To generate random sequences from 1 to 15, for example, we write 1:15 by using colon operator and seq(1, 15) by using seq() function. Also worth noting that it is possible to create backward sequence. For example, 15:1 will create backward sequence which starts from 15 and ends with 1

The seq() function is more general method to generate sequences and has 5 arguments.

help(seq)

Let’s now generate a sequence using seq() function that from 1 to 15 and increase by 1.

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

In this code we did not specify by argument because it is 1 by default. Similarly length.out and along.with arguments are not specified since they do have default values NULL.

In the above example we did not use the argument names of seq() function but they can be gived in named form as well.

Caution: While order is important using positon form, it is not important with named form.

Example: Creating a sequence from 1 to 10 increasing by 1

my_seq <- seq(from = 1, to = 10)
my_seq
##  [1]  1  2  3  4  5  6  7  8  9 10
Exercise: Now generate a sequence from 10 to 100 increasing by 5
seq.by.5 <- seq(from = ..., to = ...)
seq.by.5
seq(from = ..., to = ..., by = 5)

Exercise: Generate a sequence with length of 20 starting from 10 increasing by 10

seq.by.10 <- seq(from = ..., by = 10)
seq.by.10
# Don't forget to specify length.out argument
seq(length.out = ...)

Some useful functions

rep() replicates the values by the times specified

?rep
Exercise: run the code and check how many thimes 5 is replicated
replicated_values <- rep(5)
replicated_values

As seen from the result, it is replicated only one time because the default value for the times argument is 1.

Exercise: Fill …. to replicate the value 7 five times
replicated_five_times <- rep(7,...)
# do not forget to specify times argument
rep(times = ...)

paste() takes an arbitrary number of arguments and concatenates them one by one into character strings. The arguments are by default seperated in the result by a single blank character.

Example: Just run the code
values <- paste(c("a", "b", "c"), 1:10, sep = "")
values
Example: just run the code
# paste with different value for sep argument
values <- paste(c("a", "b", "c"), 1:10, sep = "-")
values

Naming vector elements

In order to name the elements of a vector we use names() function.

Example: The “google_vec” represents the stock price details for Google as of March 21, 2018. The details are “Open”, “High”, “Low”, “Close” and “Volume”. Let’s run the code and see the resulting vector
# google stock price details as of March 21, 2018
google_vec <- c(1092.74, 1106.30, 1085.15, 1090.88, 1878873)
Now, let’s name the elements with the following details; “Open”, “High”, “Low”, “Close” and “Volume”. Run the code to see the details
names(google_vec) <- c("Open", "High", "Low", "Close", "Volume")
One alternative way of naming a vector is to create variables with the names and directly assign them. Let’s now create a varible called price_details consisting of the names and use them to name the google vector
price_details <- c("Open", "High", "Low", "Close", "Volume")

Now it is your turn to create a vector and assign names to it Exercise: * Create a vector with the following elements; 1586.45, 1590.00, 1563.17, 1581.86, 4750771 and assign it to variable amzn_vec * Name the elements as Open, High, Low, Close and Volume in the same way as I did above

amazn_vec <-

Subsetting vectors

Understanding the subsetting/element selection is very important. It is buildingblock for more complex data structures we will see later on.

Vectors can be subsetted or specific elements can be selected in four different ways. We need to use square brackets, [ ], to select the elements of a vector

• A logical vector: my_vec[TRUE]
• A vector of positive quantity: my_vec[c(1,3,5)]
• A vector of negative quantity: my_vec[-c(2,4,6)]
• A vector of string: my_vec[c(“GOOGL”, “AMZN”, “DELL”)]

Let’s re-create the google_vec and name with price details.

google_vec <- c(1092.74, 1106.30, 1085.15, 1090.88, 1878873)
names(google_vec) <- c("Open", "High", "Low", "Close", "Volume")
##       Open       High        Low      Close     Volume
##    1092.74    1106.30    1085.15    1090.88 1878873.00
We will now subset/select elements Open and Close price from google_vec with four methods mentioned above.
google_vec <- c(1092.74, 1106.30, 1085.15, 1090.88, 1878873)

# subsetting with logical vector

Caution: Only TRUEs will be selected

google_vec <- c(1092.74, 1106.30, 1085.15, 1090.88, 1878873)

# subsetting with vector of positive quantity

Caution: Onle first and 4th element will be selected

# subsetting with vector of negative quantity

Caution: With negative quantity subsetting non of the elements indicated after - sign will be selected. The negative quantity subsetting is generally used for the elements that we DO NOT want to select

google_vec <- c(1092.74, 1106.30, 1085.15, 1090.88, 1878873)

# subsetting with vector of character string

Caution: Only indicated element names will be selected. When the elements have names this method is very convenient.

More examples:

google_vec <- c(1092.74, 1106.30, 1085.15, 1090.88, 1878873)

# selecting only close price of google_vec
#or

# selecting first three elements of google_vec

# select everything except first element

# select only last element

Modifying the elemnts of a vector

In order to modify the specific element/s of a vector we often use the subsetting methods I have shown previously.

Let’s use the same google_vec vector, elements consisting of google stock prices but close price is wrong and needed to be modified. The correct close price should be 1090.88
google_vec <- c(1092.74, 1106.30, 1085.15, 100000, 1878873)
names(google_vec) <- c("Open", "High", "Low", "Close", "Volume")

To modify we use the same subsetting rules and assign a value/quantity/boolean value/ string etc.

Syntax: my_vec[elements_to_selected] <- new_value

Let’s now correct the close price in google_vec

google_vec <- c(1092.74, 1106.30, 1085.15, 100000, 1878873)
names(google_vec) <- c("Open", "High", "Low", "Close", "Volume")

Factor

The “factor” is used to store categorical data. An example of a categorical variable is bond rating; AAA, BBB, BBB-, etc. In R, facor is create with factor() function.

Suppose, for example, we have 20 bond default rates for bonds and their ratings are specified with a character vector; bond.ratings.

bond.ratings <- c("AAA", "AAA", "AA+", "BBB", "BBB-", "BB+", "B+",
"AA+", "AAA", "BBB", "BBB", "BB-", "BB+", "B+",
"AA+", "BBB", "BB+", "BB-", "B+", "B+")
bond.ratings
##  [1] "AAA"  "AAA"  "AA+"  "BBB"  "BBB-" "BB+"  "B+"   "AA+"  "AAA"  "BBB"
## [11] "BBB"  "BB-"  "BB+"  "B+"   "AA+"  "BBB"  "BB+"  "BB-"  "B+"   "B+"
In the below example we create a factor from bond.ratings and assign the result to bond.ratings.factor
bond.ratings <- c("AAA", "AAA", "AA+", "BBB", "BBB-", "BB+", "B+",
"AA+", "AAA", "BBB", "BBB", "BB-", "BB+", "B+",
"AA+", "BBB", "BB+", "BB-", "B+", "B+")
bond.ratings.factor <- factor(bond.ratings)
bond.ratings.factor

To find out the levels of a factor the function levels() can be used.

bond.ratings <- c("AAA", "AAA", "AA+", "BBB", "BBB-", "BB+", "B+",
"AA+", "AAA", "BBB", "BBB", "BB-", "BB+", "B+",
"AA+", "BBB", "BB+", "BB-", "B+", "B+")
bond.ratings.factor <- factor(bond.ratings)
levels(bond.ratings.factor)
## [1] "AA+"  "AAA"  "B+"   "BB-"  "BB+"  "BBB"  "BBB-"

To continue our example, suppose we have the bond default rates of the same bonds amd they are stored in another vector; bond.default.

bond.default <- c(0.05, 0.07, 0.10, 0.05, 0.2, 0.35, 0.48, 0.5,
0.10, 0.25, 0.05, 0.80, 0.10, 0.03, 0.38, 0.10,
0.05, 0.6, 0.45, 0.55)
bond.default
##  [1] 0.05 0.07 0.10 0.05 0.20 0.35 0.48 0.50 0.10 0.25 0.05 0.80 0.10 0.03
## [15] 0.38 0.10 0.05 0.60 0.45 0.55

We can, for example, calculate the sample mean and standard deviation for each category using a special function taplly(). (we will learn this special functions later on). To see the mean and standard deviation of each level/category run the code chunk below.

bond.ratings <- c("AAA", "AAA", "AA+", "BBB", "BBB-", "BB+", "B+",
"AA+", "AAA", "BBB", "BBB", "BB-", "BB+", "B+",
"AA+", "BBB", "BB+", "BB-", "B+", "B+")
bond.default <- c(0.05, 0.07, 0.10, 0.05, 0.2, 0.35, 0.48, 0.5,
0.10, 0.25, 0.05, 0.80, 0.10, 0.03, 0.38, 0.10,
0.05, 0.6, 0.45, 0.55)

bond.ratings.factor <- factor(bond.ratings)
default.average <- tapply(bond.default, bond.ratings.factor, mean)
default.average

default.sd <- tapply(bond.default, bond.ratings.factor, sd)
default.sd

As you see from the results our factor is unordered which means order is based on alphabetical order. We can for example create an ordered factor by specifiying the orders. Suppose, we would like to order them by rating such as; AAA, AA+, BBB, BBB-, BB+, BB-, and B+. We can manage this by explicitly specifying the level argument of a factor.

Example: Specifying orders for the bond.rating

bond.ratings <- c("AAA", "AAA", "AA+", "BBB", "BBB-", "BB+", "B+",
"AA+", "AAA", "BBB", "BBB", "BB-", "BB+", "B+",
"AA+", "BBB", "BB+", "BB-", "B+", "B+")

# creating an unordered factor
bond.ratings.factor.unordered <- factor(bond.ratings)

# checking the levels
levels(bond.ratings.factor.unordered)

# creating an ordered factor
bond.ratings.factor.ordered <- factor(bond.ratings, levels = list("AAA", "AA+", "BBB", "BBB-", "BB+", "BB-", "B+"))

# checking the levels
levels(bond.ratings.factor.ordered)

As seen from the output, levels of bond.ratings.factor.ordered factor starts from AAA, AA+ …, while levels of bond.ratings.factor.unordered factor starts from AA+, AAA, …. etc. based on alphabetical order.

Matrix

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Matrix can be created with matrix() function.

Since, we are, in a matrix, working with columns and rows a matrix is called two dimentional.

A matrix can hold only the same data type, such as numeric, character, and logical.

Suppose, for example, we collected stock prices for Google, Amazon, IBM, and Microsoft for Monday, Tuesday, Wednesday, Thursday and Friday - from March 19 to March 23, 2018.

Suppose the prices(closed) are; * GOOGL: 1100.07, 1095.80, 1094.00, 1053.15, 1026.55 * AMZN : 1544.93, 1586.51, 1581.86, 1544.92, 1495.56 * IBM : 157.35, 156.20, 156.69, 152.09, 148.89 * MSFT : 92.89, 93.13, 92.48, 89.78, 87.18

Example: Let’s now create a matrix with each column representing stock tickers and each row representing stock prices. We should have 5 rows and 4 columns.

We can first check the arguments of matrix() with args() function. This is very useful for a quick look at the function.

args(matrix)
## function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
## NULL

As seen from the output matrix function creates matrix by column as a default.

# creating a vector with all stock prices
stock_prices <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55, 1544.93, 1586.51, 1581.86, 1544.92, 1495.56, 157.35, 156.20, 156.69, 152.09, 148.89, 92.89, 93.13, 92.48, 89.78, 87.18)

# creating matrix with default parameters> The result is 1x1 matrix
stock_prices_matrix <- matrix(stock_prices)
stock_prices_matrix

The result is someting we did not intend to get. Since we did not specify the ncol and nrow arguments matrix function is created matrix with its default parameters.

Now let’s construct a matrix by specifying ncol argument.

# creating a vector with all stock prices
stock_prices <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55, 1544.93, 1586.51, 1581.86, 1544.92, 1495.56, 157.35, 156.20, 156.69, 152.09, 148.89, 92.89, 93.13, 92.48, 89.78, 87.18)

# creating matrix with specifying nrow and ncol arguments. We need 5 rows and 4 columns in our example
stock_prices_matrix <- matrix(stock_prices, ncol = 4, byrow = FALSE)
stock_prices_matrix

The resulting matrix 4x5 matrix each column representing a stock ticker and each row representing a stock price for a day.

In our example we’ve created matrix from a single vector; stock_prices. Another way of creating matrix is to create from different vectors. Actually this way is more clear in terms of seeing which column is googl, which amzn and so on.

# creating a vector with all stock prices
googl <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55)
amzn <- c(1544.93, 1586.51, 1581.86, 1544.92, 1495.56)
ibm <- c(157.35, 156.20, 156.69, 152.09, 148.89)
msft <-  c(92.89, 93.13, 92.48, 89.78, 87.18)

# creating matrix with specifying nrow and ncol arguments. We need 5 rows and 4 columns in our example
stock_prices_matrix <- matrix(c(googl, amzn, ibm, msft), ncol = 4, byrow = FALSE)
stock_prices_matrix

Naming matrix elements

Yes, we constructed 5x4 matrix in the example above however something is missing; row and column names. We do not know whick column represents what stock ticker and which row represents what day of the week. In order to obtain more clear matrix we may want to name the columns and rows.

In R, matrix columns and rows can be named in two ways; either explicitly specifying row and column names in dimnames argument or using colnames() and rownames() functions used to name column and row names, respectively.

Example: Naming columns and raws for stock_prices_matrix using via dimnames argument.

# creating a vector with all stock prices
stock_prices <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55, 1544.93, 1586.51, 1581.86, 1544.92, 1495.56, 157.35, 156.20, 156.69, 152.09, 148.89, 92.89, 93.13, 92.48, 89.78, 87.18)

# creating matrix with specifying nrow and ncol arguments. We need 5 rows and 4 columns in our example
stock_prices_matrix <- matrix(stock_prices, ncol = 4, dimnames = list( c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), c("GOOGL", "AMZN", "IBM", "MSFT")))
stock_prices_matrix

As I’ve mentioned earlier another way of naming vectors is to use colnames() and rownames() functions.

# creating a vector with all stock prices
stock_prices <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55, 1544.93, 1586.51, 1581.86, 1544.92, 1495.56, 157.35, 156.20, 156.69, 152.09, 148.89, 92.89, 93.13, 92.48, 89.78, 87.18)

# creating matrix with specifying nrow and ncol arguments. We need 5 rows and 4 columns in our example
stock_prices_matrix <- matrix(stock_prices, ncol = 4)
rownames(stock_prices_matrix) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
colnames(stock_prices_matrix) <- c("GOOGL", "AMZN", "IBM", "MSFT")
stock_prices_matrix

Arithmetic wiht Matrix

In R, arithmetic operations with matrix are performed element wise.

Suppose we have a matrix that has 3 rows and 3 columns. Elements are from 1 to 9.

my_matrix <- matrix(1:9, ncol = 3)
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Multiplying my_matrix with 100

my_matrix <- matrix(1:9, ncol = 3)
my_matrix * 100
##      [,1] [,2] [,3]
## [1,]  100  400  700
## [2,]  200  500  800
## [3,]  300  600  900

Sum of two matrix

my_matrix_1 <- matrix(1:9, ncol = 3)
my_matrix_2 <- matrix(10:18, ncol = 3)
my_matrix_1 + my_matrix_2
##      [,1] [,2] [,3]
## [1,]   11   17   23
## [2,]   13   19   25
## [3,]   15   21   27

Product of two matrix

my_matrix_1 <- matrix(1:9, ncol = 3)
my_matrix_2 <- matrix(10:18, ncol = 3)
my_matrix_1 * my_matrix_2
##      [,1] [,2] [,3]
## [1,]   10   52  112
## [2,]   22   70  136
## [3,]   36   90  162

Exercise: Suppose, we have the following return series for each of the stock starting from monday to friday; * GOOGL: 0.05, 0.03, 0.02, -0.05, -0.10 * AMZN : -0.07, -0.05, 0.05, 0.04, 0.08 * IBM : -0.00, -0.01, 0.03, -0.04, 0.06 * MSFT : 0.05, -0.01, 0.10, -0.03, 0.04

googl.return <- c(0.05, 0.03, 0.02, -0.05, -0.10)
amzn.return <- c(-0.07, -0.05, 0.05, 0.04, 0.08)
ibm.return <- c(-0.00, -0.01, 0.03, -0.04, 0.06)
msft.return <-  c(0.05, -0.01, 0.10, -0.03, 0.04)

# creating matrix. specify the ncol and dimnames arguments
stock_return_matrix <- matrix(c(googl.return, amzn.return, ibm.return, msft.return), ncol = 4, dimnames = list(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), c("GOOGL", "AMZN", "IBM", "MSFT")))
stock_return_matrix

In the below code chunk you’ve discovered that it is not written in percentage format.

• Transform the stock_return_matrix to percentage format by diving the matrix elements by 100 and assign the result to stock_return_matrix_percentage.
• Print out the stock_return_matrix_percentage
googl.return <- c(5, 3, 2, -5, -10)
amzn.return <- c(-7, -5, 5, 4, 8)
ibm.return <- c(-0, -1, 3, -4, 6)
msft.return <-  c(5, -1, 10, -3, 4)

days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
stock_names <- c("GOOGL", "AMZN", "IBM", "MSFT")

# creating matrix. specify the ncol and dimnames arguments
stock_return_matrix <- matrix(c(googl.return, amzn.return, ibm.return, msft.return), ncol = 4, dimnames = list(days, stock_names))

stock_return_matrix_percentage <-
• You are wondering what is the total return for each stock. Use colSum function to find the total return. Assign the result to total_stock_return
googl.return <- c(0.05, 0.03, 0.02, -0.05, -0.10)
amzn.return <- c(-0.07, -0.05, 0.05, 0.04, 0.08)
ibm.return <- c(-0.00, -0.01, 0.03, -0.04, 0.06)
msft.return <-  c(0.05, -0.01, 0.10, -0.03, 0.04)

days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
stock_names <- c("GOOGL", "AMZN", "IBM", "MSFT")

# creating matrix. specify the ncol and dimnames arguments
stock_return_matrix <- matrix(c(googl.return, amzn.return, ibm.return, msft.return), ncol = 4, dimnames = list(days, stock_names))

total_stock_return <- colSums(....)
• Now wondering what is the average return for each day. Use rowMeans function to find the average return for each day. Assing the result to average.daily.return
googl.return <- c(0.05, 0.03, 0.02, -0.05, -0.10)
amzn.return <- c(-0.07, -0.05, 0.05, 0.04, 0.08)
ibm.return <- c(-0.00, -0.01, 0.03, -0.04, 0.06)
msft.return <-  c(0.05, -0.01, 0.10, -0.03, 0.04)

days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
stock_names <- c("GOOGL", "AMZN", "IBM", "MSFT")

# creating matrix. specify the ncol and dimnames arguments
stock_return_matrix <- matrix(c(googl.return, amzn.return, ibm.return, msft.return), ncol = 4, dimnames = list(days, stock_names))

average.daily.return <-

Adding columns and rows to a matrix

The most standard way of adding columns or rows to a matrix is to use rbind() and cbind() functions.

rbind() : to add by row cbind() : to add by column

Suppose, in pur stock price example, we would like to add one column to store the total return for each day and one row to store total return for each stock. In the example below let’s see how we do this.

googl.return <- c(0.05, 0.03, 0.02, -0.05, -0.10)
amzn.return <- c(-0.07, -0.05, 0.05, 0.04, 0.08)
ibm.return <- c(-0.00, -0.01, 0.03, -0.04, 0.06)
msft.return <-  c(0.05, -0.01, 0.10, -0.03, 0.04)

days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
stock_names <- c("GOOGL", "AMZN", "IBM", "MSFT")

# creating matrix. specify the ncol and dimnames arguments
stock_return_matrix <- matrix(c(googl.return, amzn.return, ibm.return, msft.return), ncol = 4, dimnames = list(days, stock_names))

# now combine stock_return_matrix with total.daily.return.vector by column and total.stock.return.vector by row. Assigning the result to the same variable stock_return_matrix
# let's first create two vectors; total.daily and  total.stock which represents total return for each stock
total.stock <- colSums(stock_return_matrix)
stock_return_matrix <- rbind(stock_return_matrix, total.stock)

total.daily <- rowSums(stock_return_matrix)
stock_return_matrix <- cbind(stock_return_matrix, total.daily)
stock_return_matrix

Subsetting elements of a matrix

To subset elemets of a matrix in R we use this syntax: my_matrix[i, j] where i represents the row and j represents the column.

As in vectors, a matrix also can be subsetted or specific elements can be selected in four different ways. The logic is the same as vectors but rows and columns are seperated via comma; [i, j] where i = rows, j = columns

• A logical vector: my_matrix[TRUE ,TRUE]
• A vector of positive quantity: my_matrix[1:3, c(1,3,5)]
• A vector of negative quantity: my_matrix[-1:2, -c(2,4,6)]
• A vector of string: my_matrix[c(“day 1”, “day 2”) c(“GOOGL”, “AMZN”, “DELL”)]

CAUTION: Not necesserily to use only one method. All methods can be combined and it is often the case in data analysis.

Example: by logical

# creating 3x3 matrix
my_matrix <- matrix(1:9, ncol = 3)
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# selecting every element
my_matrix[,]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# Subsetting second row , third column by logical
my_matrix[c(FALSE, TRUE, FALSE), c(FALSE, FALSE, TRUE)]
## [1] 8
# Subsetting the whole second row and all columns by logical
my_matrix[c(FALSE, TRUE, FALSE), c(TRUE, TRUE, TRUE)] # = my_matrix[c(F, T, F), ]
## [1] 2 5 8
# Subsetting every elements from first column by logical
my_matrix[, c(TRUE, FALSE, FALSE)]
## [1] 1 2 3

Example: by positive index

# creating 3x3 matrix
my_matrix <- matrix(1:9, ncol = 3)
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# selecting every element
my_matrix[,]  # = my_matrix[1:3, 1:3] = my_matrix[c(1,2,3), c(1,2,3)]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# Subsetting second row , third column
my_matrix[2, 3]
## [1] 8
# Subsetting the whole second row and all columns
my_matrix[2, 1:3] # = my_matrix[2, ]
## [1] 2 5 8
# Subsetting every elements from first column
my_matrix[, 1]
## [1] 1 2 3

Example: by negative index

# creating 3x3 matrix
my_matrix <- matrix(1:9, ncol = 3)
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# selecting every element
my_matrix[,]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# Subsetting second row , third column
my_matrix[-c(1,3), -c(1,2)]
## [1] 8
# Subsetting the whole second row and all columns
my_matrix[-c(1,3), 1:3] # = my_matrix[-c(1,3), ]
## [1] 2 5 8
# Subsetting every elements from first column
my_matrix[, -c(2:3)] # = my_matrix[, -2:-3]
## [1] 1 2 3

Example: by name

# creating 3x3 matrix
my_matrix <- matrix(1:9, ncol = 3, dimnames = list(c("day1", "day2", "day3"), c("EURUSD", "JPYUSD", "GBPCHF")))
my_matrix
##      EURUSD JPYUSD GBPCHF
## day1      1      4      7
## day2      2      5      8
## day3      3      6      9
# selecting every element
my_matrix[,] # = my_matrix[c("day1", "day2", "day3"), c("EURUSD", "JPYUSD", "GBPCHF")]
##      EURUSD JPYUSD GBPCHF
## day1      1      4      7
## day2      2      5      8
## day3      3      6      9
# Subsetting second row , third column
my_matrix["day2", "GBPCHF"]
## [1] 8
# Subsetting the whole second row and all columns
my_matrix["day2", ] # = my_matrix["day2", c("EURUSD", "JPYUSD", "GBPCHF")]
## EURUSD JPYUSD GBPCHF
##      2      5      8
# Subsetting every elements from first column
my_matrix[, "EURUSD"] # = my_matrix[c("day1", "day2", "day3"), "EURUSD"]
## day1 day2 day3
##    1    2    3

Example: by mix of everything

# creating 3x3 matrix
my_matrix <- matrix(1:9, ncol = 3, dimnames = list(c("day1", "day2", "day3"), c("EURUSD", "JPYUSD", "GBPCHF")))
my_matrix
##      EURUSD JPYUSD GBPCHF
## day1      1      4      7
## day2      2      5      8
## day3      3      6      9
# selecting every element
my_matrix[1:3, c("EURUSD", "JPYUSD", "GBPCHF")] # = my_matrix[c("day1", "day2", "day3"), 1:3] = my_matrix[c("day1", "day2", "day3"),]  = my_matrix[,]
##      EURUSD JPYUSD GBPCHF
## day1      1      4      7
## day2      2      5      8
## day3      3      6      9
# Subsetting second row , third column
my_matrix[2, "GBPCHF"]
## [1] 8
# Subsetting the whole second row and all columns
my_matrix[2, ]
## EURUSD JPYUSD GBPCHF
##      2      5      8
my_matrix["day2", ]
## EURUSD JPYUSD GBPCHF
##      2      5      8
# Subsetting every elements from first column
my_matrix[, 1]
## day1 day2 day3
##    1    2    3
my_matrix[, c(T, F, F)] # T = TRUE, F = FALSE
## day1 day2 day3
##    1    2    3

As seen from the last example all four methods can be combined.

Exercise: Subsetting elements of a matrix: Let’s continue with our example and suppose you’d like to extract return series for Microsoft. * Extract MSFT column by logical, positive index, negative index and name & assign the results to MSFT.logical, MSFT.negative.index, MSFT.positive.index, MSFT.name, respectively. and print all of them * Extract Monday’s and Tuesday’s returns for Google and Amazon. Assign the result to google_amazon and print it out

googl.return <- c(0.05, 0.03, 0.02, -0.05, -0.10)
amzn.return <- c(-0.07, -0.05, 0.05, 0.04, 0.08)
ibm.return <- c(-0.00, -0.01, 0.03, -0.04, 0.06)
msft.return <-  c(0.05, -0.01, 0.10, -0.03, 0.04)

days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
stock_names <- c("GOOGL", "AMZN", "IBM", "MSFT")

# creating matrix. specify the ncol and dimnames arguments
stock_return_matrix <- matrix(c(googl.return, amzn.return, ibm.return, msft.return), ncol = 4, dimnames = list(days, stock_names))

MSFT.logical <-
MSFT.negative.index <-
MSFT.positive.index <-
MSFT.name <-

Data Frame

Data frame is by far the most used data structure in data analysis. In a data frame each column represents the variables of a data set and each row represents the observations. A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on). For example, one of the columns can be numeric while another column can be character.

In R, to create a data frame we use data.frame() function.

Let’s see the help page for data.frame function.

help(data.frame)
Example: Run the code to create data frame df
df <- data.frame(column1 = c("a", "b", "c"), column2 = c(1,2,3), column3 = c(TRUE, TRUE, FALSE))
df

As seen from the resulting data frame the first column is character, the second is numeric and the third column is logical.

Let’s now re-consider our example from “Matrix” topic and use the same variables to create a data frame this time. Suppose we collected stock prices for Google, Amazon, IBM, and Microsoft from March 19 to March 23, 2018.

The prices(closed) are; * GOOGL: 1100.07, 1095.80, 1094.00, 1053.15, 1026.55 * AMZN : 1544.93, 1586.51, 1581.86, 1544.92, 1495.56 * IBM : 157.35, 156.20, 156.69, 152.09, 148.89 * MSFT : 92.89, 93.13, 92.48, 89.78, 87.18

Let’s construct a data frame with 3 columns; the first column is “day”, the second column is “stock” and the third is “price”

Run the example to crate the data frame.

googl <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55)
amzn <- c(1544.93, 1586.51, 1581.86, 1544.92, 1495.56)
ibm <- c(157.35, 156.20, 156.69, 152.09, 148.89)
msft <- c(92.89, 93.13, 92.48, 89.78, 87.18)

stock.prices.df <- data.frame(day = rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times = 4), stock = c(rep("google", times = 5), rep("amazon", times = 5), rep("ibm", times = 5), rep("microsoft", times = 5)), price =  c(googl, amzn, ibm, msft))

stock.prices.df

Now you have a data frame with 3 variables and 20 rows. In real world problems it is often the case that you need to deal with millions of rows and many variables. Therefore, before starting your analysis, it is often good idea to check its structure with str() function.

Run the code to see the structure of data set.
googl <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55)
amzn <- c(1544.93, 1586.51, 1581.86, 1544.92, 1495.56)
ibm <- c(157.35, 156.20, 156.69, 152.09, 148.89)
msft <- c(92.89, 93.13, 92.48, 89.78, 87.18)

stock.prices.df <- data.frame(day = rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times = 4), stock = c(rep("google", times = 5), rep("amazon", times = 5), rep("ibm", times = 5), rep("microsoft", times = 5)), price =  c(googl, amzn, ibm, msft))

str(stock.prices.df)

As seen from the result day and stock columns are factors and price column is numerical. Treating some columns as factors is not very useful in most of the cases unles we explicitly need factor. In order to have character instead of factors we need to explicitly define the stringsAsFactors paratemer to FALSE.

Run the code chunk below to create a data frame with stringsAsFactors argument specified.

googl <- c(1100.07, 1095.80, 1094.00, 1053.15, 1026.55)
amzn <- c(1544.93, 1586.51, 1581.86, 1544.92, 1495.56)
ibm <- c(157.35, 156.20, 156.69, 152.09, 148.89)
msft <- c(92.89, 93.13, 92.48, 89.78, 87.18)

stock.prices.df.no.factors <- data.frame(day = rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times = 4), stock = c(rep("google", times = 5), rep("amazon", times = 5), rep("ibm", times = 5), rep("microsoft", times = 5)), price =  c(googl, amzn, ibm, msft), stringsAsFactors = F)

stock.prices.df.no.factors

Now run this code to check the structure of the data

str(stock.prices.df.no.factors)

As you see from the output day and stock columns are type of character this time.

Useful functions

str() : shows the structure of the data set. head() : prints out the first 6 variables by default tail() : prints out the last 6 variables by default.

To work with a better example let’s use EuStockMarkets data set. It is a built-in data set. To see the full list of data sets available run this code chunk.

library(help = "datasets")

Imagine you are working in an hedge fund and you are asked to quickly look at the EuStockMarkets data.

• check the class of the data and transform it to the data frame if it is already not. Assign the resulting data frame to eu.stock.markets variable.
# checking the class of EuStockMarkets
class(EuStockMarkets)

# creating data frame from EuStockMarkets data
eu.stock.markets <- data.frame(....)

# re-checking the class
class(eu.stock.markets)
• check the structure of data set
• check the first and the last 6 observations
# first 6 observations

# last 6 observations

Subsetting data frame elements

To subset in a data frame every methods available to subset a matrix is also available for data frames.

The logic remains the same as a matrix. To subset elemets of a data frame in R we use this syntax: my_df[i, j] where i represents the row and j represents the column.

In addtion to [ ] operator, we can also subset specific columns via $$** sign. syntax: **my_df$$column_name

As in matrices, a data frame also can be subsetted or specific elements can be selected in four different ways. The logic is the same as matrices but columns selection can be done with $sign as well. • A logical vector: my_df[TRUE ,TRUE] • A vector of positive quantity: my_df[1:3, c(1,3,5)] • A vector of negative quantity: my_df[-1:2, -c(2,4,6)] • A vector of string: my_df[c(“day 1”, “day 2”) c(“GOOGL”, “AMZN”, “DELL”)] CAUTION: Not necesserily to use only one method. All methods can be combined and it is often the case in data analysis. Let’s continue with EuStockMarkets data. Example: Subsetting with logical vector eu_stock_markets <- data.frame(EuStockMarkets) # subsetting all rows from DAX index by logical # Caution: the result of the subset is a vector with the length of DAX column eu_stock_markets[, c(T,F,F,F)] # subsetting the all rows from fourth column eu_stock_markets[, c(F,F,F,T)] In the case we do have many rows as in our example this method is not useful at all. Insted we use this often with is.na() function where we’d like to exlude missing values from data set. is.na() returns TRUE for the rows where NA exists and FALSE where there is no missing value. Suppose we do have a data frame df and would like to exclude missing values from it. To do so, we may use the following syntax: df <- df[!is.na(df), ] is.na() returns TRUE for the rows wiht missing values therefore we put ! in front of it which makes a statement opposite. And FALSE rows will automatically unselected. Let’s remove NAs in EU Stock Market data if there is any. eu_stock_markets <- data.frame(EuStockMarkets) eu_stock_markets <- eu_stock_markets[!is.na(eu_stock_markets), ] Example: Subsetting with positive integer vector eu_stock_markets <- data.frame(EuStockMarkets) # subsetting all rows from DAX index by index # Caution: the result of the subset is a vector with the length of DAX column eu_stock_markets[, 1] # subsetting the all rows from fourth column eu_stock_markets[, 4] # subsetting first 20 rows from third column eu_stock_markets[1:20, 3] # subsetting 1st,5th, and 7th rows from third and fourth column eu_stock_markets[c(1, 5, 7), c(3, 4)] # subsetting first two rows from all columns eu_stock_markets[1:2, ] Example: Subsetting with negative integer vector eu_stock_markets <- data.frame(EuStockMarkets) # subsetting all rows from except 2nd, 3rd and 4th columns # Caution: the result of the subset is a vector with the length of DAX column eu_stock_markets[, -c(2,3,4)] # subsetting the all rows from fourth column eu_stock_markets[, -c(1,2,3)] # subsetting first 20 rows from all columns except 4 eu_stock_markets[1:20, -4] # subsetting 1st,5th, and 7th rows from third and fourth column eu_stock_markets[c(1, 5, 7), -c(1, 2)] # subsetting except first two rows from all columns eu_stock_markets[-(1:2), ] Example: Subsetting by names eu_stock_markets <- data.frame(EuStockMarkets) # subsetting all rows from SMI and CAC columns # Caution: the result of the subset is a vector with the length of DAX column eu_stock_markets[, c("SMI", "CAC")] # subsetting the all rows from fourth column eu_stock_markets[, "FTSE"] # subsetting first 20 rows from DAX eu_stock_markets[1:20, "DAX"] # subsetting 1st,5th, and 7th rows from CAC and FTSE column eu_stock_markets[c(1, 5, 7), c("CAC", "FTSE")] # subsetting the first two rows from all columns eu_stock_markets[(1:2), c("DAX", "SMI","CAC", "FTSE")] Example: Extracting a specific column wiht$ sign

eu_stock_markets <- data.frame(EuStockMarkets)

eu_stock_markets$DAX # extracting DAX column eu_stock_markets <- data.frame(EuStockMarkets) eu_stock_markets$FTSE # extracting FTSE column

List

You can consider list a kind of super data structure that can hold vector, matrix, data frame and other lists.In R list is an object consisting of an ordered collection of objects known as its components. To create a list in R, we use list() function.

## $name ## [1] "google" ## ##$ticker
## [1] "GOOGL"
##
## $hist.prices ## [1] 1500 1600 1300 1200 Length of a list: ## [1] 3 Exercise: Create a list named stock_lst and add the following components to stock_lst: stock.name = Amazon , ticker = AMZN, exchange = NASDAQ, hist.prices = c(100, 150, 120, 50, 30) stock_lst <- list(stock.name = "Amazon", ticker = "...", exchange = ) # do not forget to add exchange and hist.prices components stock_lst <- list(stock.name = "Amazon", ticker = "...", exchange = "NASDAQ", hist.prices = c(100, 150, 120, 50, 30)) Subsetting list components Syntax: Using double brackets “[[ ]]” by component position or component name: my_list[[i]] where i represents its ith component. my_list[[“component_name”]] Using$ sign by component name:

my_list$component_name Example: Subsetting by double brackets. my_lst <- list(stock.name = "google", ticker = "GOOGL", hist.prices = c(1500, 1600, 1300, 1200)) # Subsetting the first component: stock.name by position my_lst[[1]] # subsetting the same component by name using [[ ]] my_lst[["stock.name"]] Exercise: • Subset the ticker from my_lst using double brackets. Both by position and by name. • Subset the hist.prices from my_lst using$ sign by name.
my_lst <- list(stock.name = "google", ticker = "GOOGL", hist.prices = c(1500, 1600, 1300, 1200))

# Subsetting ticker using "[[ ]]"
# by name
my_lst[["...."]]

# by position
my_lst[[...]]

# Subsetting hist.prices using "$" sign my_lst$.....

Subset elements from a component

To subset elements in an R list, the logic is very simple:

1. Reach out the component which we would like to choose elements from.
1. Choose the elements from that component. Method depends on the structue (vector, matric, df, list, etc.) of the component. You can use all the methods you’ve seen in previous sections.

Syntax:

my_lst[[“component_name”]][“element_name”]

It is also possible to subset by position. my_lst[[component_position]][element_position]

Possible to combine both position and name. my_lst\$component_name[element_position]

Exercise : Selecting 3rd element from hist.prices component
my_lst <- list(stock.name = "google", ticker = "GOOGL", hist.prices = c(1500, 1600, 1300, 1200))
# selecting third element:1300
my_lst[["hist.prices"]][3]