Chapter 16 Introduction to R

Lab developed building upon R for Data Science, Smith College SDS 100 and Noli Brazil (UC Davis).

The objectives of the guide are as follows:

  1. Install and set up R and RStudio.

  2. Understand working in R Markdown and Quarto.

This lab guide supplements the material presented in Introduction and Chapter 4 in the textbook R for Data Science (RDS).

16.1 What is R?

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. R is an interpreted language, not a compiled one. This means that you type something into R and it does what you tell it. It is both a command line software and a programming environment. It is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms, which allows for the user to freely distribute, study, change, and improve the software. It is basically a free, super big, and complex calculator. You will be using R to accomplish all data analysis tasks in this class. You might be wondering “Why in the world do we need to know how to use a statistical software program?” Here are the main reasons:

  1. You will be learning about abstract concepts in lecture and the readings. Applying these concepts using real data is an important form of learning. A statistical software program is the most efficient (and in many cases the only) means of running data analyses, not just in the cloistered setting of a university classroom, but especially in the real world. Applied data analysis will be the way we bridge statistical theory to the “real world.” And R is the vehicle for accomplishing this.

  2. In order to do applied data analysis outside of the classroom, you need to know how to use a statistical program. There is no way around it as we don’t live in an exclusively pen and paper world. If you want to collect data on soil health, you need a program to store and analyze that data. If you want to collect data on the characteristics of recent migrants, you need a program to store and analyze that data.

The next question you may have is “I love Excel [or insert your favorite program]. Why can’t I use that and forget your stupid R?” Here are some reasons

  1. it is free. Most programs are not;

  2. it is open source. Which means the software is community supported. This allows you to get help not from some big corporation (e.g. Microsoft with Excel), but people all around the world who are using R. And R has a lot of users, which means that if you have a problem, and you pose it to the user community, someone will help you;

  3. it is powerful and extensible (meaning that procedures for analyzing data that don’t currently exist can be readily developed);

  4. it has the capability for mapping data, an asset not generally available in other statistical software;

  5. If it isn’t already there, R is becoming the de-facto data analysis tool in the social sciences

R is different from Excel in that it is generally not a point-and-click program. You will be primarily writing code to clean and analyze data. What does writing or sourcing code mean? A basic example will help clarify. Let’s say you are given a data set with 10 rows representing people living in Red Hook, NY. You have a variable in the data set representing individual income. Let’s say this variable is named inc. To get the mean income of the 10 people in your data set, you would write code that would look something like this:

mean(inc)

The command tells the program to get the mean of the variable inc. If you wanted the sum, you write the command sum(inc).

Now, where do you write this command? You write it in a script. A script is basically a text file. Think of writing code as something similar to writing an essay in a word document. Instead of sentences to produce an essay, in a programming script you are writing code to run a data analysis. The basic process of sourcing code to run a data analysis task is as follows:

  1. Write code. First, you open your script file, and write code or various commands (like mean(inc)) that will execute data analysis tasks in this file.

  2. Send code to the software program. You then send some or all of your commands to the statistical software program (R in our case).

  3. Program produces results based on code. The program then reads in your commands from the file and executes them, spitting out results in its console screen.

I am skipping over many details, most of which are dependent on the type of statistical software program you are using, but the above steps outline the general work flow. You might now be thinking that you’re perfectly happy pointing and clicking your mouse in Excel (or wherever) to do your data analysis tasks. So, why should you adopt the statistical programming approach to conducting a data analysis?

  1. Your script documents the decisions you made during the data analysis process. This is beneficial for many reasons.
  • It allows you to recreate your steps if you need to rerun your analysis many weeks, months or even years in the future.

  • It allows you to share your steps with other people. If someone asks you what were the decisions made in the data analysis process, just hand them the script.

  • Related to the above points, a script promotes transparency (here is what i did) and reproducibility (you can do it too). When you write code, you are forced to explicitly state the steps you took to do your research. When you do research by clicking through drop-down menus, your steps are lost, or at least documenting them requires considerable extra effort.

  1. If you make a mistake in a data analysis step, you can go back, change a few lines of code, and poof, you’ve fixed your problem.

  2. It is more efficient. In particular, cleaning data can encompass a lot of tedious work that can be streamlined using statistical programming.