Introduction and Main Concepts

Welcome

Welcome to our first interactive tutorial and the first week of our course Data Analysis and Visualization with R at the European University Viadrina. For more information, please check out the course at moodle.

In this tutorial, we introduce you to the programming language R and how to access it using the powerful Integrated Development Environment (IDE) RStudio. You will further learn the core features of R for handling data. First introductory ideas for visualization will assist you on your journey of becoming a true data analyst.

In this week’s tutorial you will learn all about:

nowadays role of data analysis
accessing and handling data in R
standard programming structures such as if-else, loops and functions
dealing with programming errors

Unless otherwise indicated, the contents of this site are licensed under CC BY-SA 4.0 (© Rick Steinert). Embedded third-party content (e.g., YouTube videos) is subject to its own license and terms and is not covered by this Creative Commons license.

1. Introduction

In the first video, we introduce you to the process of analyzing data and how to setup R with RStudio:

In the previous video we have discussed the data analysis process by understanding the different steps involved in it.

In detail, we have learning that analyzing data is a multi-stage process, which often involves many departments of a single company. While the entry of data is often outsourced and thus outside of the scope of the analyst, the retrieval of raw data is often a first crucial step in the process. Even though the retrieval will not be discussed thoroughly in this course, we want to emphasize that in the real world data can be stored in many formats and potentially across many directories and even servers. R is best at dealing with already existent files, which often necessitates the usage of another software such as Hadoop’s MapReduce framework before finally being able to process the data with R. Especially if this retrieval step is also outsourced to another department, the data analyst must be aware of potential misrepresentations in the data and thus clean the data appropriately. A commonly used phrase in IT stating “garbage in, garbage out” refers to the tight connection of cleaning data and producing meaningful results, implying that flawed inputs will almost always lead to inferior results no matter the skills or tools of the analyst.

The subsequent steps of the data analysis process are usually referred to as the core skills of the analyst. First, the exploration of data by revealing hidden relations and structures as well as embedding “expert knowledge” to the data. Hidden relations are usually discovered by exploiting simple statistics such as correlation or distribution analysis, but can also refer to sophisticated piece-wise analysis of nonlinear patterns. We define expert knowledge on the other hand as quality insights from people working in the field. Expert knowledge can be utilized when relations in the data are clear due to e.g. physical relation or common sense: A property, for instance, with an area of -20 can only be treated as an error in the data and does not reveal any information on the matter, unless the error can be corrected. Using this information previously discovered, we sometimes have to adjust the data yet again, for instance by creating another column of data or deleting erroneous entries.

Second, the most mathematical demanding step arises, which is building certain models and algorithms tailor-made for the data. These models will be specific for the goal at hand and the right choice will highly determine the success of the modeler. Given the plethora of different modeling approaches, a well-versed analyst is able to deduce the problem to its main components and thus limiting the number of models in questions. These models are then fitted towards the current challenge and often estimated simultaneously to create benchmarks. Eventually, models are fine-tuned and sometimes even combined to create the most powerful model for the task. Occasionally, the modeler has to use the knowledge of the created benchmarks to adjust the data yet again. However, this is a highly risky strategy as it increases the risk of data snooping, i.e. the discovery of patterns within the data which only exists because the modeler abused the knowledge given to him by results of the previous modeling step. In these events, the model is only capable of adjusting to that very set of data and consequently fails to generalize on the true nature of the problem. Statistical inference would thus be rendered meaningless.

Finally, a data analyst needs to not only engage in aiding other departments to implement the findings in a product. It is of utmost importance, that the analyst is also capable of communicating the results in a comprehensible manner. In most cases, the audience of the data analysts’ report is not very knowledgeable about the matter. The analyst must consequently aim for an inclusive representation of the data and the relationships discovered. As most humans are more pleased by graphics than numbers, the data analyst has to be able to create visually enticing depictions. R is specifically strong in that field, especially when combined with add-ons, which we will learn more about in future tutorials.

To revise your knowledge on the findings from the video and text above, we would kindly ask you to answer the following quiz questions.

1. Accessing data I

After understanding how to setup R with RStudio, we can now start to learn how to handle data:

Even though R natively comes with a lot of useful functions, it is best to increase it’s functionality by implementing powerful packages. One of these packages, which is mostly used in practice, is the library(tidyverse). While a full explanation of the functionality is beyond the scope of this course, we will explore the basics of this package throughout the upcoming weeks, as it is most beneficial when accessing and handling data. For those who are eager to learn all the details, we kindly refer to R for Data Science, written by the creator of the package, Hadley Wickham as well as Garrett Grolemund.

As we have mentioned previously, data can come from a variety of sources and can be stored in a variety of formats. For now, however, we want to assume that we already have prepared our data and are now able to load it into our global environment. In this part of the course we will use the diamonds data, which contains the prices of diamonds of different size, color and shape. As this data is already implemented in tidyverse, we can call the library first by typing library(tidyverse). Afterwards we can include the data by using the command data("diamonds"). We can receive a nice overview for the data by typing the following:

data("diamonds")
head(diamonds)

In this overview, which gives us the first 6 entries of our data, we can learn about the different columns of the data, which contain information on each of the diamonds. A single diamond is found in the rows of the data. If we are interested in a certain diamonds, let us say the 15th diamond, we can access it the following command:

diamonds[15,]

Please be aware that we specifically have accessed the 15th diamond by using square brackets []. Also, we have typed 15 followed by a comma. The reason for that is that in order to access data which has rows and columns we need to specify the number of rows and columns we want to see, i.e. diamonds[rows,columns]. If we leave this field blank, we will automatically see all columns or rows respectively.

It is also possible to print out or even store information on the number of diamonds as well as columns within our data. This is important to get an overview about the magnitude of the data and often needed for later calculations. We can extract the number of diamonds of an object which has rows and columns in the following way:

n <- nrow(diamonds)
n
## [1] 53940

Did you see that we have actually used the letter n to print out the numbers of rows? The reason for that is that we have used the operator <- to pass information to a variable, which we named n. n is now written into the RAM of your computer at a very specific address. Each time you now type n your computer will look into the address of n in your RAM and display it’s value, which, in our case, is exactly the number of rows of the diamonds data!

In the video we also have learned that we can combine numbers to access a certain set of rows or columns of data. As we always want to try something new, we will this time use the gapminder data, which contains information of population, life expectancy etc. of many countries in the world. In your first task you should therefore include the package “gapminder”, read in the data gapminder and print out the 2nd to last row together with columns 1 and 5 to the console. Which country did we select here and what is it’s population?

# call the package

# include the data gapminder

# retrieve and store the number of rows
n <-
# print out the 2nd to last row together with columns 1 and 5
gapminder[?,?]

# call the package
library(gapminder)
# include the data gapminder
data(gapminder)
# retrieve and store the number of rows
n <- nrow(gapminder)
# print out the 2nd to last row together with columns 1 and 5
gapminder[n-1,c(1,5)]

1. Exploring data

To get a feeling for the basic structures within the data, we will now learn which functions of R can help us to explore the data using certain statistics.

In the video we have learned to use built-in R function such as mean() and sd() to retrieve information of the center and dispersion of our data. Further, we discussed logic operators, such as & or |. Of course, when we combine these two approaches, we can even analyse our data much more in depth. For instance, we could be interested in the biggest diamond measured by carat, for which the price is still lower than 1500. There are two ways of doing that. The first would be R standard way, which is:

max(diamonds$carat[diamonds$price<1500])
## [1] 1.03

Here, diamonds$price<1500 compares the prices of diamonds element-wise if they are smaller than 1500 and return TRUE whenever that is the case. We apply this result immediately to diamonds$carat, by selecting only those entries, for which the condition was true. This works because both structures, i.e. the price and the carat have the same number of entries. R is programmed in a way that if the first element of the logical structure (price<1500) is true, then the first element of the data structure (carat) is kept. If it would be false, then the element would not be considered any further. This repeats for every element in the data structure (carat) until the last element is reached. This is also why we require both structures to be of the same length. After selecting only those elements of carat, for which the price condition was true, we apply the max()function to print the maximum carat value. Doing this with the library(tidyverse)results in the same result, but the code itself is easier to read:

max(diamonds %>% filter(price<1500) %>% select(carat))
## [1] 1.03

As both variants have their specific merits, both are can be equally often found in coding examples. The native R functions will be more often found in scientific literature, while the tidyverse version will be more often found in practice.

Another neat thing to know is that the logical result TRUE is what we call a binary result. We will learn much more in the future about data structures, but for now it is important to know that a binary indicates that a value is either 1 or 0. The logical outcomes TRUE and FALSE therefore correspond to 1 and 0 respectively. This feature can be used to quickly count certain elements for instance all diamonds which have a price smaller than 1500:

sum(diamonds$price<1500)
## [1] 20010

With the function sum() we simply sum over all elements of the following object. As we know that diamonds$price<1500 creates output in the form of TRUE and FALSE, which corresponds to 1 and 0, we can immediately deduct the number of diamonds smaller than 1500 in terms of price.

In the case we are not only interested in a few different statistics about some columns, we can also use the function summary(). The output will immediately give us information about centrality and dispersion.

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

1. Accessing data II

Accessing, rearranging and slicing data is needed to create precise statistics of your data. In the following video you will therefore learn more on that matter:

To access certain data, which was specified by a common logic or a search criterion, we can use several ways. One way we have discovered in the previous video was using a logical operator such as > to obtain TRUE and FALSE values. Another way of extracting the correct subset of data is by using an index. An index represents the number of row or column, in which we can find the data we were looking for. For this, we can use the function which():

which(diamonds$depth>72)
## [1] 41919 46680 51929 52861 52862 53541

The function returns the number of each row, for which it can find a diamond with a depth of greater than 70. These numbers represent an index, i.e. they will give the position of the entry where the logical comparison succeeded. We can see that we can apply the which() function in the same way as the direct comparison:

diamonds$depth[which(diamonds$depth>72)]
## [1] 78.2 73.6 72.2 79.0 79.0 72.9
diamonds$depth[diamonds$depth>72]
## [1] 78.2 73.6 72.2 79.0 79.0 72.9

The result is exactly the same. The function match() is very much related to the function which(), however, it can quickly be used to search for many objects, for instance keywords, in data and return their position. It differs also by the fact that match() will only provide you with the first occurrence of the keyword instead of returning all matches.

If we would only want to display the number of the columns price and table of the diamonds data then we would have to use which() as follows:

c(which(colnames(diamonds)=="price"),which(colnames(diamonds)=="table"))
## [1] 7 6

The much shorter version, however, will be achieved with match:

match(c("price","table"),colnames(diamonds))
## [1] 7 6

If you still wonder, why there is which() and match(), solve the following task. All capital letters of the alphabet can be found natively in the data LETTERS in R, i.e. LETTERS[1] yields “A”. You are now supposed to output the positions of the letters “N,E,T,W,O,R,K” in the alphabet.

# create the search word data
word <- c("N","E","T","W","O","R","K")
# use match to retrieve the positions of the letters of word in the data LETTERS
match()
# use which to retrieve the positions of the letters of word in the data LETTERS
which()

# create the search word data
word <- c("N","E","T","W","O","R","K")
# use match to retrieve the positions of the letters of word in the data LETTERS
match(word,LETTERS)
# use which to retrieve the positions of the letters of word in the data LETTERS
c(which(word[1]==LETTERS),which(word[2]==LETTERS),which(word[3]==LETTERS),which(word[4]==LETTERS),which(word[5]==LETTERS),which(word[6]==LETTERS),which(word[7]==LETTERS))

The video provided also dense information on how to access data using the library(tidyverse). In this example, we want to take a look again at the mutate functionality of the library. For that, we will use the gapminder data:

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

Sometimes in data analysis it is important to use expert knowledge to manually create new variables to investigate. These variables are usually transformations of existing variables, for instance when interested in measuring overall levels of health it is beneficial to calculate the amount fat per total body weight and not fat alone. In the gapminder data, which displays different socio-economic features of countries, we might be interested in the quality of life measured as life expectancy. However, to understand the magnitude of a countries level of life expectancy it is reasonable to weight the life expectancy with respect to the country, for which the life expectancy is highest. This new number, e.g. 0.5, would immediately inform us about the relative performance of that country, e.g. people only life half as long as in the best country. We can achieve this using the function mutate():

gapminder <- gapminder %>% mutate("lifeExpRel" = lifeExp / max(lifeExp))
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap          lifeExpRel    
##  Min.   :6.001e+04   Min.   :   241.2   Min.   :0.2857  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1   1st Qu.:0.5835  
##  Median :7.024e+06   Median :  3531.8   Median :0.7350  
##  Mean   :2.960e+07   Mean   :  7215.3   Mean   :0.7200  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5   3rd Qu.:0.8577  
##  Max.   :1.319e+09   Max.   :113523.1   Max.   :1.0000  
##

Eventually we might be interested in only a certain subset of the data. Instead of using the function which() as we have learned before, we can also use the function of tidyverse, which are more clear and easier to read. Let us assume we want to see a snapshot of the data that holds all Asian and European countries for which the life expectancy was higher than 60 years in the year 1987. How can we achieve that in tidyverse?

gapminder_new <-
  gapminder %>% filter((continent == "Europe" |
                          continent == "Asia") &
                         (year == 1987) & (lifeExp > 60))
summary(gapminder_new)
##                    country      continent       year         lifeExp     
##  Albania               : 1   Africa  : 0   Min.   :1987   Min.   :60.14  
##  Austria               : 1   Americas: 0   1st Qu.:1987   1st Qu.:67.83  
##  Bahrain               : 1   Asia    :25   Median :1987   Median :71.52  
##  Belgium               : 1   Europe  :30   Mean   :1987   Mean   :71.29  
##  Bosnia and Herzegovina: 1   Oceania : 0   3rd Qu.:1987   3rd Qu.:74.97  
##  Bulgaria              : 1                 Max.   :1987   Max.   :78.67  
##  (Other)               :49                                               
##       pop              gdpPercap         lifeExpRel    
##  Min.   :2.447e+05   Min.   :  820.8   Min.   :0.7280  
##  1st Qu.:4.195e+06   1st Qu.: 5178.5   1st Qu.:0.8212  
##  Median :9.915e+06   Median :13822.6   Median :0.8658  
##  Mean   :4.233e+07   Mean   :13807.3   Mean   :0.8631  
##  3rd Qu.:3.831e+07   3rd Qu.:21169.6   3rd Qu.:0.9076  
##  Max.   :1.084e+09   Max.   :31541.0   Max.   :0.9524  
##

The combination of logical operators with the filter() function allows us to achieve a complex data query without losing readability.

1. Plotting

Communicating data structures is of utmost importance for a successful data analyst. In the following video you will get to know some of the capabilities of R for visualization.

In the previous video we have learned that there are two ways of creatings plots in R. One is to use the in-built functionality of R using functions like plot(). The other is to use a library which extends the plotting functionality of R. A very powerful and thus commonly used library is ggplot2, which is already included in tidyverse. Generally speaking, the in-built plotting functionality works best when we deal with certain data types which are native to R such as matrices and vectors. However, when using the the data types which come with tidyverse, i.e. we use the functionality of tidyverse, it is more advisable to use ggplot2. Please keep in mind that we will discuss data types in a later tutorials in depth. For now, we will focus on ggplot2, mostly because it is very widespread in practice and it is comparably easy to generate complex plots.

For our first exercise we would like to review the histogram as a method to get an overview about the data. In fact, the histogram provides us with information on how the data is spread by giving us the count data for all possible ranges. We will now take a look at the histogram of a subset of the gapminder data. In detail, we want to show the histogram of the GDP per capita of the year 2007 for the continents Africa, Asia, and Europe using the gapminder data. This can be done as follows:

gap_new <- gapminder %>% filter((continent == "Africa" |
                                   continent == "Asia" |
                                   continent == "Europe") &
                                  (year == 2007))

ggplot(data = gap_new, aes(gdpPercap, group = continent, fill = continent)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

While the histogram provides us with simple counts of the data, we can also smooth the counts to create an estimation of the true density of the data. This will help us to get a more realistic picture of the data origin. We achieve that by using the function geom_density(), which automatically estimates the density using a kernel density estimator.

gap_new <- gapminder %>% filter((continent == "Africa" |
                                   continent == "Asia" |
                                   continent == "Europe") &
                                  (year == 2007))

ggplot(data = gap_new, aes(gdpPercap, group = continent, fill = continent)) +
  geom_density(alpha=0.5)

The parameter alpha specifies the transparency level of the different curves and allows us to see even overlapping areas. Try changing the parameter to different levels at your home computer and check the outcome. This density plot give us further insight in the distribution of income among people from different continents. As a data analyst you now should be able to explain certain aspects of the resulting plot, such as why we seem to have two peaks for the African continent or why the Asian line seems to be quite flat for GDP per capita of over 15000.

Plotting does not onyl help us to understand the data distribution better, it can also help us discover hidden relationships. Let us take the relationship between GDP per capita and life expectancy for all years and countries for the upcoming example. Can we assume a linear dependency, so we can use models like linear regression?

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent))

Well, even if your are master of linear regression you will still fail to build up a high quality model if not considering the data visually. The depicted relationship is clearly not linear, instead, is seems to raise quickly until around 15000 GDP per capita and the flattens out or even decreases for Asian countries after a certain threshold.

For the final task for this first part of the tutorial you are requested to create a quite more complex but yet intriguing plot. Plot the relationship of GDP per capita and life expectancy for the year 2007 for all African countries. The size of every dot should be representing the population size (pop), i.e. the more people the bigger the dot. In addition, the color of every dot must be set according to the size of population (pop) as well. In particular, the least populated country should be “orchid” colored, while the most populated should be “magenta”. All countries in between receive a color on the linear color scale from “orchid” to “magenta4”. Get familiar with the function scale_colour_gradient() to fulfill the task. See the following plot to understand, what your result should look like:

# create the data subset
gap_new <- 
# plot the graph
ggplot() +
  geom_point() +
  scale_colour_gradient() +
  theme()

# create the data subset
gap_new <- gapminder %>% filter(continent == "Africa", year == 2007)
# plot the graph
ggplot(gap_new, aes(x = gdpPercap, y = lifeExp, color = pop)) +
  geom_point(aes(size = pop)) +
  scale_colour_gradient(low = "orchid", high = "magenta4") +
  theme(legend.position = "none")

2. Read CSV-data

In this video you will learn how to implement external data into R. As implementing different data types would be a course on its own, we will primarily focus here on the standard format for data outside of a database, which is CSV.

CSV is one of the most common and simple formats for data. It stores data separated by commas and uses a period as decimal delimiter. Because of that it is able to store large amounts of data with a fair requirement for storage space. Proprietary document types like XLS or XLSX as well as open source types as ODS can also be integrated into R, but will usually come with some tedious work. It is much more recommended to to convert the data into CSV using the specific program before using R. R also has some interfaces to structured data like DB or JSON, which we will not consider in this part of the course. Generally speaking, the CSV format is usually enough when working on smaller projects like a thesis, a Kaggle competition or a common case study. In practice, when multitudes of data need to be integrated and selected from different source, DB management systems and similar file formats are more common. Data from GET requests of some internet databases are often stores as JSON or XML.

2. If-else and functions

For those of you who are already familiar with programming, the following will mostly be a repetition of a very basic programming principle. Branching (conditions) as well as writing own functionality using functions. For anyone else, however, you will now learn a core concept of almost all programming languages, which will be useful for any upcoming code challenging you might face. Please enjoy.

Using conditions is a simple, but yet very powerful feature of programming languages. It can help us to apply simple logic towards our decision making and kind of teach the computer how to make these decisions for us. In the early times of computers, these were the only means of allowing a computer to learn anything. Nowadays, however, machine learning is mostly done using fancy algorithms, which you will also learn on your way of becoming a data analyst. The simplicity of if-else constructions does have some drawbacks. First, the code can become quite messy and suffer from a lack of readability when if-else conditions are followed by many more if-else conditions. Further, and this is a bit more of a technical remark, the switching of different code chunks dependent on a variable check is a comparably high CPU intense task, as it requires a lot of code in the programming language which your CPU understands and in which the R code is eventually transferred to. It is therefore advised, to prevent using if-else whenever possible by, for instance, using smart logic comparisons.

Another great and widespread tool in programming is writing your own functions. Individual functions can be used, whenever a certain predefined procedure needs to be executed multiple times. Instead of repeating your code over and over with the same instructions and thus artificially bloat the number of lines, you can save your preferred method once and then call it whenever needed. In R, functions are very often used to calculate certain statistics, such as the mean. The arithmetic mean is calculated as:

\[\frac{1}{n} \sum_{i=1}^{n}x_i\, ,\]

where $x_i$ is the i-th observation and $n$ is the number of all observations. Even though this function already exists in R, we could write our own mean function as follows:

my_mean <- function(x){
  n <- length(x)
  result <- 1/n * sum(x)
  return(result)
}
my_mean(x=1:5)
## [1] 3

The rounded brackets of a function always contain the arguments to that function. Please note that here we did not specify $n$ as a parameter, as it can be calculated from $x$. The curly brackets contain the function body, i.e. the actual calculation. A function must always end with the return() statement. The reason for that is that R will execute the function as a separate process, which means that no matter what you pass into the function, it will never change the input parameters. This is easily demonstrated with an example:

change_value <- function(x){
  x[1] <- "My"
  return(x)
}
x <- c("Hello","World")
change_value(x=x)
## [1] "My"    "World"
x
## [1] "Hello" "World"

The value of the original variable x has stayed the same even after applying our function. In order to overwrite x we would have needed to write x <- change_value(x).

In R, you can view the source code of any function simply by typing the function without the brackets. For our my_mean() function this leads to:

my_mean
## function (x) 
## {
##     n <- length(x)
##     result <- 1/n * sum(x)
##     return(result)
## }
## <environment: 0x62f68f74cd10>

R already comes with many in-built functions, such as sd() for the standard deviation which can also be viewed by that:

sd
## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x62f697d40f30>
## <environment: namespace:stats>

It shows us that it calculates the square root sqrt() of the variance var() to get the standard deviation. However, it adjusts the argument x according to it’s current data type, which we will explain in a later tutorial. The information “namespace” provides us with the origin of the function, i.e. from which package that function is from. Finally, some functions in R are actually written in C (another programming language) in order to allow for a fast calculation. This is usually indicated by a .Call()or .Internal() etc. In this case, you cannot see the source code immediately but instead you have to browse your R directory under “/src” for the solution.

2. Loops

Loops are a very frequently used tool to repeatedly apply a certain task until a certain criterion is met or a certain quantity is reached. It gives you access to the extreme speed of modern CPUs , which exceeds the calculation capabilities of a human by a large margin.

As you have seen, in R we mostly use only two types of loops, the for() and the while() loop. In real applications of data analysis, we often have to combine the methods of branching (if-else), loops (for and while) together with individually created functions. Hence, we now want to exercise the application of these methods by investigating real world data. We will focus on the returns of companies of the S&P 500, an index, which contains the biggest 500 listed companies of the USA. The data is stored in the variable sp500 and looks the following:

head(sp500)

The data contains the daily simple rates of return $R_t=\frac{P_t-P_{t-1}}{P_{t-1}}$ of all companies of the S&P 500 which were included in the index from beginning of January 1998 to end of December 2018. A return of 0.01 would therefore coincide with a gain of 1 percent on that day, while -0.01 would indicate a loss of 0.01. In finance, a common benchmark criterion to compare assets is the Sharpe-Ratio. The Sharpe-Ratio is able to compare investments by their potential gain in relation to the risk an asset is endowed with. It also has some optimality properties and a tight relation to the CAPM, which we will not discuss here further. The Sharpe-Ratio of an asset $i$ is calculated as:

\[SR_i=\frac{\mu_i-r_f}{\sigma_i},\]

where $\mu_i$ and $\sigma_i$ are the mean and standard deviation of the returns of asset $i$. $r_f$ represents the risk-free rate of return, i.e. the return of an asset which has no risk ($\sigma_{rf}=0$).

Your task is now to write the Sharpe-Ratio as a function in R. The user should be allowed to specify the returns of a single asset as well as the risk-free rate of return. Then, use an appropriate loop to calculate the Sharpe-Ratios of all assets in the sp500 data. Assume $r_f=0$ for this task.

## create the function
SR <- function(returns,rf){
  mean_r <-
  std_r <-
  SR <-
  return(SR)
}
## create empty data where we later store each SR
SR_assets <- c()
## get all companies
company_names <- 
## loop through all assets
for(? in ?:?){
  SR_assets[?] <- 
}
## print the result
SR_assets

## create the function
SR <- function(returns,rf){
  mean_r <- mean(returns)
  std_r <- sd(returns)
  SR <- (mean_r-rf)/std_r
  return(SR)
}
## create empty data where we later store each SR
SR_assets <- c()
## get all companies
company_names <- colnames(sp500)[-1]
## loop through all assets
for(i in 1:length(company_names)){
  SR_assets[i] <- SR(pull(sp500,company_names[i]),rf=0)
}
## print the result
SR_assets

Just as in the video, we now want to buy the assets which fulfill a certain criterion. Here, we want to buy all assets which have a Sharpe-Ratio of greater than 0.04. In each of those, we will invest exactly 1000. We can assume that we have no budget, i.e. we can potentially buy all assets which meet the criterion. How much money will we have invested in total? Well, we could check that by using a for loop together with an if condition:

SR_assets <- c()
company_names <- colnames(sp500)[-1]
for(i in 1:length(company_names)){
  SR_assets[i] <- SR(pull(sp500,company_names[i]),rf=0)
}

total_investment <- 0

for(i in 1:length(SR_assets)){
  if(SR_assets[i]>0.04){
    total_investment = total_investment + 1000
  }
}

total_investment
## [1] 11000

However, the more you will advance in programming the more you will understand that writing code like that is not very efficient. First, we needed two loops and an if condition to fulfill just two simple tasks. Second, as mentioned briefly earlier on, we should avoid if conditions whenever possible due to it’s high CPU-time. Something similar holds true for loops; in terms of CPU-time it is almost always advisable to use built-in function of R or using slicing, i.e. providing the correct indices beforehand. In the following example we would therefore like to show you how to fully replace the code of the previous code box without using loops and if conditions. We will only use in-built functions as well as math to solve the tasks.

sum((apply(sp500[,-1],2,SR,rf=0)>0.04)*1000)
## [1] 11000

We would be happy if you take the time to discover the ideas behind this construction, as it would truly show that you mastered the materials in this tutorial.

2. Error handling

A German phrase states “Wer arbeitet, macht Fehler”, meaning that (only) those who work, (can) make mistakes. So it is clear that throughout this tutorial you very likely have encountered some errors already. In the following video we will show you how to deal with them:

When encountering warnings, but even more important, errors, you should be able to isolate the issue. In addition to that, you will very often receive a message, stating to you where the program did not work anymore. However, sometimes you will not be fully able to understand the error message, may it be either due to a lack of knowledge of the internal processes of R or due to a misunderstanding of how your code works. Especially for the first case, the quickest and most powerful solution is to simply copy the message, paste it into a search engine and find answers. It is highly likely that at your current stage of knowledge there was already someone else who had the same error and thus posted this message to a forum waiting for an answer. Helpful forums for coding can be Stack Overflow, Quora, GitHub or others. Further, you should always keep in mind that it is a bad idea to post your code into the search engine when looking for an answer. The code you have written, i.e. the specific variable names and commands, are specific to your own style and will therefore lead almost surely no results. It is much better to post the shortest version of the error message followed by what you did in general. Here is a small example.

left_side <- "A"
right_side <- 2

If you would now try to execute left_side + right_side, R would output the error “non-numeric argument to binary operator”. In order to find out what is wrong with your code, you should now use a search engine using either only the error message or the error message in combination with what you wanted to do, i.e. adding a letter to a number. You will then very likely find some nice explanation why that is not possible and what to do to circumvent it. Do not, on the other hand, add your code left_side + right_side to the search as it will be impossible for the search engine to understand that you have masked a letter and a number behind these two variables.

To conclude this section on warnings and errors as well as this week’s tutorial, we have a final task for you.

Take the first column of the data sp500, which contains the returns of the company Pfizer. Then, write a loop for the company which iterates through all returns of the company and saves the square root sqrt() of the return in a separate variable. If a warning message occurs, the loop should instead calculate the square root of the absolute value abs().

## get pfizer
pfizer_ret <- pull(sp500,"PFIZER")
## create the sqrt data
sqrt_ret <- c()
## write what happens when the warning occurs
warning_func <- function(w) {
  return(?)
}
## create the loop
for (? in ?:?) {
  sqrt_ret[?] <- 
}
## print the result
sqrt_ret

## get pfizer
pfizer_ret <- pull(sp500,"PFIZER")
## create the sqrt data
sqrt_ret <- c()
## write what happens when the warning occurs
warning_func <- function(w) {
  return(sqrt(abs(pfizer_ret[i])))
}
## create the loop
for (i in 1:length(pfizer_ret)) {
  sqrt_ret[i] <- tryCatch(sqrt(pfizer_ret[i]),warning = warning_func)
}
## print the result
sqrt_ret

We hope you have learned a lot in this tutorial and are eager to see you next week!