Research Design & Data – MSDA IFP

Week 08 – Tutorial 01

Part 1: Your First R Markdown File

To produce short reports that combine code, output, and written explanation, we will use R Markdown

1 Creating and running code in an .Rmd file

Create a new R Markdown document via:

File > New File > R Markdown…

Give your document a title (e.g. Intro Lab) and enter your name. Leave HTML as the default output format, then click OK.

RStudio creates a template file. Delete everything below the first code chunk so that you begin with a clean document.

Adding your first code chunk

Insert a new R code chunk using either:

  • Insert > R, or

  • Ctrl + Alt + I (Windows) / Option + Cmd + I (macOS)

Inside the chunk, type:

# Packages used in this document
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Running the chunk may produce several informational messages — this is normal.

Comments in code

Inside a code chunk, # creates a comment which R ignores. Use comments to explain what your code is doing.

2 Writing text in an .Rmd file

Outside code chunks you can write ordinary text.

Headings are created using #:

# R code examples

When the .Rmd file is rendered, this becomes a formatted heading.

Recall

  • Inside a code chunk: # creates a comment

  • Outside a code chunk: # creates a heading

Task

Choose a few of the symbols below and, in your .Rmd, write:

  1. a short explanation of what each one does

  2. an example inside a code chunk

Available symbols:

These symbols are called operators in R. They are used to perform calculations, assign values to objects, and compare values when analysing data.

Symbol Meaning When to use it Example in R
+ Addition To add numbers or numeric variables 3 + 5
- Subtraction To subtract one number from another 10 - 4
* Multiplication To multiply numbers 6 * 3
/ Division To divide numbers 12 / 4
() Parentheses To control the order of operations or group calculations (3 + 2) * 4
^ Exponent (power) To raise a number to a power 2^3
<- Assignment operator To store a value or result in an object (variable) x <- 5
< Less than To test if one value is smaller than another 3 < 5
> Greater than To test if one value is larger than another 8 > 2
<= Less than or equal to To test if a value is smaller than or equal to another 5 <= 5
>= Greater than or equal to To test if a value is larger than or equal to another 7 >= 3
== Equal to To check whether two values are equal 4 == 4
!= Not equal to To check whether two values are different 3 != 5

You might structure it like this:

# Addition
2 + 3
[1] 5

Storing data in R

So far we have stored single numbers using <-, for example:

x <- 5

To store a sequence of numbers, combine them using the c() function.
A sequence of elements of the same type is called a vector.

For example:

myfirstvector <- c(1, 5, 3, 7)
myfirstvector

R prints:

[1] 1 5 3 7

You can carry out arithmetic on every element of the vector:

myfirstvector + 5

Returns:

[1] 6 10 8 12

Vectors can also contain words (“character” data):

wordsvector <- c("cat", "dog", "parrot", "peppapig")
wordsvector

R prints:

[1] "cat" "dog" "parrot" "peppapig"

You may use single or double quotation marks; both work.

You can check the type of an object using class():

class(wordsvector)

Because addition only makes sense for numbers, R will give an error if you try an invalid operation:

wordsvector + 5

Finally, if you mix types inside c(), R converts everything to the most general type (usually characters):

mysecondvector <- c(4, "cat")
mysecondvector

Returns:

[1] "4" "cat"

Putting all of this together, you now know how R stores individual values, sequences of numbers, and sequences of text, and how R decides what type each vector should be. These tools form the foundation for reading, inspecting, and describing real datasets — skills you will now begin applying in your coursework.

Part 2: Instructions

Group Setup

Work through tutorial tasks in groups of up to 5 students.

  • Each week, one person is the driver and the rest are navigators.
  • The driver types the code and edits the report template file.
  • The navigators read carefully, make suggestions, and help with debugging.
  • Rotate the driver each week so that everyone has a turn.

Lab Help and Support

The tutorials are designed with multiple layers of support. If you are unsure or stuck: Raise your hand for help. Hover over the superscript numbers to view quick hints for commands and functions. Scroll to the Worked Example section at the bottom of the tutorial to see a parallel worked solution in R. Even if you complete the task without the Worked Example, make sure you read it carefully during independent study — it is designed to reinforce the reasoning and R skills you will need for later weeks.

1.1 Data

CollegeScores Dataset

You will work with the CollegeScores dataset in this tutorial.
At the link CollegeScores_teaching.csv you will find information about 400 higher-education institutions in the United States.
The dataset includes variables describing each institution’s location, sector (public/private), tuition costs, enrolment, and the demographic composition of their student body.

You will use this dataset throughout the Stats block for your formative report.
Below is a description of the variables included in the dataset.

Variable Name Description
Name Name of the institution
State U.S. state where the institution is located
ID Unique identifier for the institution
Main Main campus indicator (1 = main campus, 0 = branch campus)
Control Type of institution (Public, Private, For-profit)
Region Geographical region in the U.S.
Locale Setting of the institution (City, Suburb, Town, Rural)
Enrollment Total undergraduate enrolment
PartTime Percent of undergraduates who are part-time students
TuitionIn In-state tuition and fees (USD)
TuitionOut Out-of-state tuition and fees (USD)
White Percent of undergraduates who identify as White
Black Percent of undergraduates who identify as Black
Hispanic Percent of undergraduates who identify as Hispanic
Asian Percent of undergraduates who identify as Asian
Other Percent of undergraduates who identify with another racial/ethnic group
Tip

When you write about the data in your report, you will describe variables in words, not as code. This table is your reference for names and meanings.

1.2 Tasks

For Formative Report you will eventually complete several tasks.
This week we focus only on Task 1.

1) Read the CollegeScores dataset into R, inspect it, and write a concise introduction to the data and its structure.

  1. Display and describe the categorical variables.

  1. Display and describe a selection of numeric variables.

  1. Display and describe at least one relationship between two or three variables.

  1. Finish the report write-up, knit to PDF, and submit.

This tutorial is designed to help you complete Task 1.

1.3 Task 1, sub-tasks

Tip

Tip: Hover over the footnotes for hints showing useful R functions.

This week you only need to complete Task 1. The steps below are suggestions to guide your work.

  1. Read the data into R (from the provided CSV file in datasets/) and give the object a sensible name (for example college).1

  2. View the data in RStudio and check that it matches your expectations.2

  3. How many rows (observations) are there in the dataset?3

  4. How many columns (variables) are there?4

  5. What type is each variable in R (character, numeric, factor, etc.)?5

  6. Choose the following four variables from the dataset:

  • Control
  • Region
  • Enrollment
  • TuitionIn

For each variable state:

  • whether it is categorical or numeric
  • its level of measurement (nominal, ordinal, interval, ratio)
  • one example value from the dataset
  1. For several numeric variables (e.g. Enrollment, TuitionIn, TuitionOut), find the minimum, maximum, and mean values.6

  2. Identify which variables contain missing values (if any) and how many.7

  3. Write a short description of the dataset (one paragraph) for a generic reader. You might include:

    • What the data represent.
    • How many institutions and variables are included.
    • The kinds of information recorded (location, enrolment, costs, student characteristics).
    • A brief comment on any clear missing data issues.

You do not need to include R output in the report; you only need the written description. Keep this paragraph; you will re-use it in your report.

2 Worked example

Consider the dataset provided in datasets/HollywoodMovies.csv, containing 1295 observations on the following 15 variables:

Variable Name Description
Movie Title of the movie
LeadStudio Primary U.S. distributor
RottenTomatoes Critics' rating (Rotten Tomatoes)
AudienceScore Audience rating (Rotten Tomatoes)
Genre Film genre (e.g., Action Adventure, Comedy, Thriller)
TheatersOpenWeek Number of screens on opening weekend
OpeningWeekend Opening weekend gross (in millions)
BOAvgOpenWeekend Average box office income per theatre, opening weekend
Budget Production budget (in millions)
DomesticGross U.S. gross income (in millions)
ForeignGross Foreign gross income (in millions)
WorldGross Worldwide gross income (in millions)
Profitability Worldwide gross as a percentage of budget
OpenProfit Percentage of budget recovered on opening weekend
Year Year of release

These data were compiled from Box Office Mojo, The Numbers, and Rotten Tomatoes.

We load the tidyverse package as we will use the functions
read_csv() and glimpse() from this package.

1. Load tidyverse and import the data

library(tidyverse)

read_csv() reads CSV (comma-separated values) files.
The loaded data are stored in an object called movies using the arrow <-.

movies <- read_csv("HollywoodMovies.csv")

2. Take a first look at the data

head() shows by default the first six rows of the dataset.
Use n = 10 to show more rows (e.g., head(movies, n = 10)).

head(movies)
# A tibble: 6 × 15
  Movie           LeadStudio RottenTomatoes AudienceScore Genre TheatersOpenWeek
  <chr>           <chr>               <dbl>         <dbl> <chr>            <dbl>
1 2016: Obama's … Rocky Mou…             26            73 Docu…                1
2 21 Jump Street  Sony Pict…             85            82 Come…             3121
3 A Late Quartet  Entertain…             76            71 Drama                9
4 A Royal Affair  Magnolia …             90            82 Drama                7
5 Abraham Lincol… Twentieth…             35            51 Horr…             3108
6 Act of Valor    Relativit…             27            72 Acti…             3039
# ℹ 9 more variables: OpeningWeekend <dbl>, BOAvgOpenWeekend <dbl>,
#   Budget <dbl>, DomesticGross <dbl>, WorldGross <dbl>, ForeignGross <dbl>,
#   Profitability <dbl>, OpenProfit <dbl>, Year <dbl>
  • Number of variables
ncol(movies)
[1] 15
  • Number of observations
nrow(movies)
[1] 1295

or you could use ‘dim()’

dim() returns the dimensions of the dataset:
- number of rows (observations)
- number of columns (variables)

dim(movies)
[1] 1295   15

glimpse() provides a compact summary of the entire dataset.
It shows:
- the number of rows and columns
- each variable’s name
- its type (e.g., <dbl>, <chr>)
- a preview of the first few values

Useful for checking that variables are read correctly and spotting problems such as missing values. Ideal for large datasets.

glimpse(movies)
Rows: 1,295
Columns: 15
$ Movie            <chr> "2016: Obama's America", "21 Jump Street", "A Late Qu…
$ LeadStudio       <chr> "Rocky Mountain Pictures", "Sony Pictures Releasing",…
$ RottenTomatoes   <dbl> 26, 85, 76, 90, 35, 27, 91, 56, 11, 44, 93, 63, 87, 9…
$ AudienceScore    <dbl> 73, 82, 71, 82, 51, 72, 62, 47, 47, 63, 82, 51, 63, 9…
$ Genre            <chr> "Documentary", "Comedy", "Drama", "Drama", "Horror", …
$ TheatersOpenWeek <dbl> 1, 3121, 9, 7, 3108, 3039, 132, 245, 2539, 3192, 3, 1…
$ OpeningWeekend   <dbl> 0.03, 36.30, 0.08, 0.04, 16.31, 24.48, 1.14, 0.70, 11…
$ BOAvgOpenWeekend <dbl> 30000, 11631, 8889, 5714, 5248, 8055, 8636, 2857, 449…
$ Budget           <dbl> 3.0, 42.0, NA, NA, 68.0, 12.0, NA, 7.5, 35.0, 50.0, 1…
$ DomesticGross    <dbl> 33.35, 138.45, 1.56, 1.55, 37.52, 70.01, 1.99, 3.01, …
$ WorldGross       <dbl> 33.35, 202.81, 6.30, 7.60, 137.49, 82.50, 3.59, 8.54,…
$ ForeignGross     <dbl> 0.00, 64.36, 4.74, 6.05, 99.97, 12.49, 1.60, 5.53, 9.…
$ Profitability    <dbl> 1334.00, 482.88, NA, NA, 202.19, 687.50, NA, 113.87, …
$ OpenProfit       <dbl> 1.20, 86.43, NA, NA, 23.99, 204.00, NA, 9.33, 32.57, …
$ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
  1. Type of Variable glimpse()
Type label Meaning Example
<dbl> Double (numeric values with decimals) 3.5, 1000, 12.8
<int> Integer (whole numbers) 1, 25, 400
<chr> Character (text values) "Public", "Texas"
<lgl> Logical (TRUE/FALSE values) TRUE, FALSE
<fct> Factor (categorical variable stored as categories) "Public", "Private"

For the purposes of this course, you can interpret the main types like this:

  • or numeric variable

  • or categorical variable

4. Summary statistics

The function summary() provides a quick statistical overview of data.

It can be used in two ways:

  1. Using summary() on an entire dataset

When you apply summary() to a dataset, R produces summary information for every variable in the dataset.

summary() produces quick descriptive statistics for each variable.
For numeric variables, it shows the minimum, maximum, median, mean, and quartiles.
For categorical variables, it shows frequency counts for each category.
Useful for a fast overview of the dataset.

summary(dataset) → overview of all variables

summary(movies)
    Movie            LeadStudio        RottenTomatoes  AudienceScore  
 Length:1295        Length:1295        Min.   : 0.00   Min.   :10.00  
 Class :character   Class :character   1st Qu.:33.00   1st Qu.:49.00  
 Mode  :character   Mode  :character   Median :61.00   Median :64.00  
                                       Mean   :57.58   Mean   :62.18  
                                       3rd Qu.:84.00   3rd Qu.:77.00  
                                       Max.   :99.00   Max.   :99.00  
                                       NA's   :6                      
    Genre           TheatersOpenWeek OpeningWeekend    BOAvgOpenWeekend
 Length:1295        Min.   :   1.0   Min.   :  0.020   Min.   :   204  
 Class :character   1st Qu.: 152.5   1st Qu.:  0.845   1st Qu.:  3482  
 Mode  :character   Median :2459.0   Median :  7.600   Median :  6586  
                    Mean   :2008.0   Mean   : 17.541   Mean   : 13400  
                    3rd Qu.:3213.5   3rd Qu.: 20.810   3rd Qu.: 14534  
                    Max.   :4529.0   Max.   :257.700   Max.   :240000  
                                                                       
     Budget       DomesticGross      WorldGross       ForeignGross    
 Min.   :  0.90   Min.   :  1.02   Min.   :   0.74   Min.   :  -0.76  
 1st Qu.: 12.00   1st Qu.:  6.40   1st Qu.:  13.09   1st Qu.:   3.91  
 Median : 30.00   Median : 26.46   Median :  50.37   Median :  21.58  
 Mean   : 51.38   Mean   : 58.16   Mean   : 147.01   Mean   :  88.84  
 3rd Qu.: 65.00   3rd Qu.: 66.44   3rd Qu.: 160.38   3rd Qu.:  89.75  
 Max.   :365.00   Max.   :936.66   Max.   :2068.22   Max.   :1369.54  
 NA's   :239                                                          
 Profitability       OpenProfit           Year     
 Min.   :    2.3   Min.   :   0.05   Min.   :2012  
 1st Qu.:  139.1   1st Qu.:  12.87   1st Qu.:2013  
 Median :  268.9   Median :  31.77   Median :2015  
 Mean   :  435.7   Mean   :  64.50   Mean   :2015  
 3rd Qu.:  483.0   3rd Qu.:  62.59   3rd Qu.:2017  
 Max.   :10176.0   Max.   :3373.00   Max.   :2018  
 NA's   :239       NA's   :239                     

You probably will not understand every part of the summary output yet, and that is completely fine. For now, focus on the minimum, maximum, mean, and the general spread of the numeric variables.

For numeric variables, the output includes:

  • Min – smallest value

  • 1st Qu. – first quartile (25% of values are below this)

  • Median – middle value

  • Mean – average

  • 3rd Qu. – third quartile (75% of values are below this)

  • Max – largest value

  • NA’s – number of missing values

For categorical variables, the output shows the frequency of each category.

Using summary() on the full dataset is useful for obtaining a general overview of the entire dataset and quickly spotting things such as missing values or unusual ranges.

  1. Using summary() on a single variable

You can also apply summary() to one specific variable by selecting it with the $ operator.

For example:

summary(dataset$variable) → summary of one specific variable

summary(movies$Budget)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.90   12.00   30.00   51.38   65.00  365.00     239 

This produces summary statistics only for the Budget variable.

Using summary() on individual variables is useful when you want to:

  • focus on one variable at a time

  • report statistics such as minimum, maximum, and mean

  • explore a variable more carefully during analysis

Variables to Examine:

Say we want to examine the following four variables from the dataset:

  • Genre
  • LeadStudio
  • Budget
  • WorldGross

1. Variable types

# Look at the structure of the dataset
glimpse(movies)
Rows: 1,295
Columns: 15
$ Movie            <chr> "2016: Obama's America", "21 Jump Street", "A Late Qu…
$ LeadStudio       <chr> "Rocky Mountain Pictures", "Sony Pictures Releasing",…
$ RottenTomatoes   <dbl> 26, 85, 76, 90, 35, 27, 91, 56, 11, 44, 93, 63, 87, 9…
$ AudienceScore    <dbl> 73, 82, 71, 82, 51, 72, 62, 47, 47, 63, 82, 51, 63, 9…
$ Genre            <chr> "Documentary", "Comedy", "Drama", "Drama", "Horror", …
$ TheatersOpenWeek <dbl> 1, 3121, 9, 7, 3108, 3039, 132, 245, 2539, 3192, 3, 1…
$ OpeningWeekend   <dbl> 0.03, 36.30, 0.08, 0.04, 16.31, 24.48, 1.14, 0.70, 11…
$ BOAvgOpenWeekend <dbl> 30000, 11631, 8889, 5714, 5248, 8055, 8636, 2857, 449…
$ Budget           <dbl> 3.0, 42.0, NA, NA, 68.0, 12.0, NA, 7.5, 35.0, 50.0, 1…
$ DomesticGross    <dbl> 33.35, 138.45, 1.56, 1.55, 37.52, 70.01, 1.99, 3.01, …
$ WorldGross       <dbl> 33.35, 202.81, 6.30, 7.60, 137.49, 82.50, 3.59, 8.54,…
$ ForeignGross     <dbl> 0.00, 64.36, 4.74, 6.05, 99.97, 12.49, 1.60, 5.53, 9.…
$ Profitability    <dbl> 1334.00, 482.88, NA, NA, 202.19, 687.50, NA, 113.87, …
$ OpenProfit       <dbl> 1.20, 86.43, NA, NA, 23.99, 204.00, NA, 9.33, 32.57, …
$ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
# Look at the first rows
head(movies)
# A tibble: 6 × 15
  Movie           LeadStudio RottenTomatoes AudienceScore Genre TheatersOpenWeek
  <chr>           <chr>               <dbl>         <dbl> <chr>            <dbl>
1 2016: Obama's … Rocky Mou…             26            73 Docu…                1
2 21 Jump Street  Sony Pict…             85            82 Come…             3121
3 A Late Quartet  Entertain…             76            71 Drama                9
4 A Royal Affair  Magnolia …             90            82 Drama                7
5 Abraham Lincol… Twentieth…             35            51 Horr…             3108
6 Act of Valor    Relativit…             27            72 Acti…             3039
# ℹ 9 more variables: OpeningWeekend <dbl>, BOAvgOpenWeekend <dbl>,
#   Budget <dbl>, DomesticGross <dbl>, WorldGross <dbl>, ForeignGross <dbl>,
#   Profitability <dbl>, OpenProfit <dbl>, Year <dbl>

we use

  • glimpse() → understand variable types

  • head() → see example values

Variable Type Level of measurement Example value
Budget Numeric
WorldGross Numeric
Genre Categorical
LeadStudio Categorical

2. Numerical Summaries of the Variables

a) Numeric Variables

summary(movies$Budget)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.90   12.00   30.00   51.38   65.00  365.00     239 
summary(movies$WorldGross)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.74   13.09   50.37  147.01  160.38 2068.22 

The output shows:

  • minimum value

  • first quartile

  • median

  • mean

  • third quartile

  • maximum value

  • number of missing values (if any)

Reminder : Identifying Ratio vs Interval Variables

Feature Interval scale Ratio scale
Numeric variable
Equal intervals between values
Meaningful zero value
Ratios are meaningful (“twice as much”)

Ratio variables

A ratio variable has a true zero, meaning that a value of zero represents the complete absence of the quantity being measured.

Variable Why it is ratio
Budget (film production cost) $0 means no budget
WorldGross (revenue) $0 means no revenue
Enrollment (number of students) 0 means no students
Age 0 years means no age yet
Distance 0 km means no distance travelled
Height 0 cm means no height
Income £0 means no income

Example:

A film with a budget of $100 million has twice the budget of one with $50 million.

This type of comparison only makes sense with ratio variables.

Interval variables

An interval variable has equal spacing between values, but zero does not represent the absence of the quantity. Instead, zero is just an arbitrary reference point.

Variable Why it is interval
Temperature in Celsius 0°C does not mean “no temperature”
Temperature in Fahrenheit 0°F is arbitrary
Calendar year Year 0 does not represent absence of time
IQ scores 0 does not mean absence of intelligence

Example:

20°C is not twice as hot as 10°C.

Even though the numbers are larger, the ratio interpretation does not make sense.

In the Celsius scale:

  • 0°C does not mean “no temperature.”
  • It simply corresponds to the freezing point of water.

Because of this, the scale is shifted relative to the physical quantity being measured (thermal energy). The zero point is arbitrary rather than representing the absence of temperature.

If we convert the same temperatures to Kelvin, which does have a true zero (absolute zero), we can see the difference:

Celsius Kelvin
10°C 283 K
20°C 293 K

Now compare the ratio:

[ ]

So 20°C is only about 4% warmer than 10°C, not twice as hot. This is why Celsius temperature is considered an interval variable, not a ratio variable.

b. Categorical Variables

summary(movies$Genre)
   Length     Class      Mode 
     1295 character character 

The output of summary(movies$Genre) provides only very basic information because the variable is currently stored as a character rather than a categorical variable. Specifically, it tells us the length of the vector (1295 observations), its class (“character”), and its mode, but it does not provide any meaningful summary of the categories themselves (e.g., counts or frequencies of each genre). This is because R does not treat character variables as categorical data by default. As a result, we cannot yet examine how many observations fall into each genre or explore the distribution of this variable. To obtain a more informative summary, the variable needs to be converted into a factor, which is R’s appropriate data type for representing categorical variables.

Convert categorical variables (character) to factors

What does factor() do?

factor() is a function in R that turns a variable into a factor, which is R’s way of storing categorical data (groups, labels, categories).

Many datasets import text categories as "character" data.
Using factor() tells R that these values represent categories, not free-text.

movies$Genre <- factor(movies$Genre)
movies$LeadStudio <- factor(movies$LeadStudio)
movies$Year <- factor(movies$Year)

For example:

  • movies$Genre contains labels such as "Comedy", "Drama", "Horror".
  • factor(movies$Genre) converts these labels into a categorical variable.

We use:

movies$Genre <- factor(movies$Genre)

to replace the original column with a factor version of itself.

The $ operator

  • The $ symbol is used to select a single column from a dataset.

For example,
movies$Genre means “the Genre column inside the movies dataset”.

Check the updated structure:

After converting character variables into factors, the output of summary() becomes much more informative for categorical data.

Instead of only showing "Length", "Class", and "Mode", R now displays the number of observations in each category (for example, how many films belong to each genre or studio).

This is why converting categorical variables to factors is useful: it allows R to treat them properly as categories and summarise them in a meaningful way.

summary(movies)
    Movie                             LeadStudio  RottenTomatoes 
 Length:1295        Warner Bros.           :136   Min.   : 0.00  
 Class :character   Universal Pictures     :114   1st Qu.:33.00  
 Mode  :character   Lionsgate              :106   Median :61.00  
                    Twentieth Century Fox  :103   Mean   :57.58  
                    Paramount Pictures     : 76   3rd Qu.:84.00  
                    Sony Pictures Releasing: 72   Max.   :99.00  
                    (Other)                :688   NA's   :6      
 AudienceScore         Genre     TheatersOpenWeek OpeningWeekend   
 Min.   :10.00   Drama    :386   Min.   :   1.0   Min.   :  0.020  
 1st Qu.:49.00   Comedy   :191   1st Qu.: 152.5   1st Qu.:  0.845  
 Median :64.00   Action   :170   Median :2459.0   Median :  7.600  
 Mean   :62.18   Adventure:164   Mean   :2008.0   Mean   : 17.541  
 3rd Qu.:77.00   Thriller :155   3rd Qu.:3213.5   3rd Qu.: 20.810  
 Max.   :99.00   Horror   : 85   Max.   :4529.0   Max.   :257.700  
                 (Other)  :144                                     
 BOAvgOpenWeekend     Budget       DomesticGross      WorldGross     
 Min.   :   204   Min.   :  0.90   Min.   :  1.02   Min.   :   0.74  
 1st Qu.:  3482   1st Qu.: 12.00   1st Qu.:  6.40   1st Qu.:  13.09  
 Median :  6586   Median : 30.00   Median : 26.46   Median :  50.37  
 Mean   : 13400   Mean   : 51.38   Mean   : 58.16   Mean   : 147.01  
 3rd Qu.: 14534   3rd Qu.: 65.00   3rd Qu.: 66.44   3rd Qu.: 160.38  
 Max.   :240000   Max.   :365.00   Max.   :936.66   Max.   :2068.22  
                  NA's   :239                                        
  ForeignGross     Profitability       OpenProfit        Year    
 Min.   :  -0.76   Min.   :    2.3   Min.   :   0.05   2012:169  
 1st Qu.:   3.91   1st Qu.:  139.1   1st Qu.:  12.87   2013:172  
 Median :  21.58   Median :  268.9   Median :  31.77   2014:197  
 Mean   :  88.84   Mean   :  435.7   Mean   :  64.50   2015:190  
 3rd Qu.:  89.75   3rd Qu.:  483.0   3rd Qu.:  62.59   2016:196  
 Max.   :1369.54   Max.   :10176.0   Max.   :3373.00   2017:182  
                   NA's   :239       NA's   :239       2018:189  
summary(movies$Genre)
         Action       Adventure    Black Comedy          Comedy         Concert 
            170             164              22             191               4 
    Documentary           Drama          Horror         Musical Romantic Comedy 
             41             386              85              18              49 
       Thriller         Western 
            155              10 
summary(movies$LeadStudio)
                                    -                                   A24 
                                    3                                    34 
                           Abramorama                          Affirm Films 
                                    2                                     2 
                              Alchemy                        Amazon Studios 
                                    1                                     4 
                     Anchor Bay Films                    Annapurna Pictures 
                                    1                                     7 
                    Arc Entertainment            Atlas Distribution Company 
                                    2                                     2 
                      Aviron Pictures                               BH Tilt 
                                    3                                    10 
                Bleecker Street Media                       Blue Sky Cinema 
                                   19                                     2 
             Briarcliff Entertainment                  Broad Green Pictures 
                                    1                                     8 
                            CBS Films          China Lion Film Distribution 
                                   11                                     1 
                           CineGalaxy                         Cinelou Films 
                                    1                                     2 
                Clarius Entertainment                     Cohen Media Group 
                                    4                                     2 
                      Dimension Films                      Drafthouse Films 
                                    1                                     1 
                           DreamWorks                     EchoLight Studios 
                                    2                                     1 
               Electric Entertainment                     Entertainment One 
                                    2                                     4 
Entertainment Studios Motion Pictures                    Eros International 
                                    5                                    11 
                           EuropaCorp                   Excel Entertainment 
                                    4                                     1 
                         FilmDistrict                        Focus Features 
                                    8                                    46 
                                  Fox              Fox Searchlight Pictures 
                                    1                                    42 
                  Freestyle Releasing           Fun Academy Motion Pictures 
                                    7                                     1 
             FUNimation Entertainment             Global Road Entertainment 
                                    1                                     4 
              Good Deed Entertainment                     Great India Films 
                                    1                                     2 
              Greenwich Entertainment                       Gunpowder & Sky 
                                    1                                     2 
                        GVN Releasing            Hammond Entertainment, LLC 
                                    1                                     1 
                            IFC Films                                  IMAX 
                                   12                                     1 
               Kenn Viselman Presents                      LD Entertainment 
                                    1                                     3 
                      Liberty Studios                             Lionsgate 
                                    1                                   106 
                    Magnolia Pictures             Metro-Goldwyn-Mayer (MGM) 
                                    8                                     3 
             Millennium Entertainment                       Music Box Films 
                                    2                                     4 
                                 Neon                                 Novus 
                                    5                                     1 
                 Open Road Films (II)                        Orion Pictures 
                                   35                                     1 
                         Oscilloscope                       Pantelion Films 
                                    1                                     5 
                   Paramount Pictures                     Paramount Vantage 
                                   76                                     1 
                         Picturehouse                   Purdie Distribution 
                                    1                                     2 
              Pure Flix Entertainment                          Quality Flix 
                                   12                                     2 
                           RADiUS-TWC                      RCR Distribution 
                                    5                                     1 
                     Relativity Media                 Reliance Big Pictures 
                                   26                                     2 
               River Rain Productions                  Roadside Attractions 
                                    1                                    35 
              Rocky Mountain Pictures                           Saban Films 
                                    2                                     1 
                          Screen Gems                    Screen Media Films 
                                   21                                     1 
               Sony Pictures Classics               Sony Pictures Releasing 
                                   44                                    72 
                             Studio 8                     STX Entertainment 
                                    2                                    23 
                         Ten Furlongs                       The Film Arcade 
                                    1                                     1 
                          The Orchard            The Samuel Goldwyn Company 
                                    3                                     5 
                The Weinstein Company                   Trafalgar Releasing 
                                   35                                     1 
                     TriStar Pictures         Triumph Releasing Corporation 
                                   18                                     1 
                Twentieth Century Fox                                Unison 
                                  103                                     1 
                   Universal Pictures                   UTV Motion Pictures 
                                  114                                     8 
               Vertical Entertainment   Walt Disney Studios Motion Pictures 
                                    1                                    72 
                         Warner Bros.                             Weinstein 
                                  136                                     2 
                       Yash Raj Films 
                                    2 

We can now complete the table:

Variable Type Level of measurement Example value
Genre Categorical Nominal Comedy
LeadStudio Categorical Nominal Warner Bros
Budget Numeric Ratio 80
WorldGross Numeric Ratio 350

Functions used in this section

Function Purpose
glimpse() identify variable types
head() see example values
summary() calculate statistics

3. Write about the dataset and the variables examined

Example writeup

The dataset contains information on 1,295 Hollywood films released between 2012 and 2018. It includes variables describing both categorical characteristics of films and numerical measures of financial performance. In this analysis, four variables were examined: Genre, LeadStudio, Budget, and WorldGross. The variables Genre and LeadStudio are categorical variables describing the type of film and the production studio responsible for its distribution. The variables Budget and WorldGross are numeric variables measured on a ratio scale and represent the production budget and worldwide box office revenue of each film. Inspection of the summary statistics shows that budgets and worldwide revenues vary considerably across films, indicating large differences in production scale and financial performance within the dataset.

3. Student Glossary

To conclude the lab, create a glossary of R functions.
Use Word, Excel, OneNote, or any tool you like, and make a simple table with two columns:

  • Function name

  • Brief description of what the function does

This “do it yourself” glossary will help you revise the key ideas from today’s lab and will be extremely useful when completing the weekly quizzes.

Below is a starter example you can expand:

Function Use and package
read_csv Reads comma-separated value files. Part of the tidyverse.
head Shows the first few rows (six by default).
nrow Returns the number of rows in a dataset.
ncol Returns the number of columns in a dataset.
dim Returns both rows and columns (dimensions).
glimpse Provides a compact overview of all variables (dplyr).
summary Gives summary statistics for each variable.
factor Converts a variable into a categorical factor.

Feel free to add more as you encounter additional functions during the course.


Reference

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F. Lock, and Dennis F. Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: use read_csv() from the readr package.↩︎

  2. Hint: try View(DATA) or head(DATA).↩︎

  3. Hint: nrow(DATA) or dim(DATA)[1].↩︎

  4. Hint: ncol(DATA) or dim(DATA)[2].↩︎

  5. Hint: glimpse(DATA) from the dplyr package or str(DATA).↩︎

  6. Hint: summary(DATA$Variable) or min(), max(), mean().↩︎

  7. Hint: summary(DATA) will show NAs per variable.↩︎