Research Design & Data – MSDA IFP

Week 08 – Tutorial 01

Part 1: Your First R Markdown File

To produce short reports that combine code, output, and written explanation, we will use R Markdown.

1 Creating and running code in an `.Rmd` file

Create a new R Markdown document via:

File > New File > R Markdown…

Give your document a title (e.g. Intro Lab) and enter your name. Leave HTML as the default output format, then click OK.

RStudio creates a template file. Delete everything below the first code chunk so that you begin with a clean document.

Adding your first code chunk

Insert a new R code chunk using either:

Insert > R, or
Ctrl + Alt + I (Windows) / Option + Cmd + I (macOS)

Inside the chunk, type:

# Packages used in this document
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Running the chunk may produce several informational messages — this is normal.

Comments in code

Inside a code chunk, # creates a comment which R ignores. Use comments to explain what your code is doing.

2 Writing text in an `.Rmd` file

Outside code chunks you can write ordinary text.

Headings are created using #:

# R code examples

When the .Rmd file is rendered, this becomes a formatted heading.

Recall

Inside a code chunk: # creates a comment
Outside a code chunk: # creates a heading

Task

Choose a few of the symbols below and, in your .Rmd, write:

a short explanation of what each one does
an example inside a code chunk

Available symbols:

These symbols are called operators in R. They are used to perform calculations, assign values to objects, and compare values when analysing data.

Symbol	Meaning	When to use it	Example in R
`+`	Addition	To add numbers or numeric variables	`3 + 5`
`-`	Subtraction	To subtract one number from another	`10 - 4`
`*`	Multiplication	To multiply numbers	`6 * 3`
`/`	Division	To divide numbers	`12 / 4`
`()`	Parentheses	To control the order of operations or group calculations	`(3 + 2) * 4`
`^`	Exponent (power)	To raise a number to a power	`2^3`
`<-`	Assignment operator	To store a value or result in an object (variable)	`x <- 5`
`<`	Less than	To test if one value is smaller than another	`3 < 5`
`>`	Greater than	To test if one value is larger than another	`8 > 2`
`<=`	Less than or equal to	To test if a value is smaller than or equal to another	`5 <= 5`
`>=`	Greater than or equal to	To test if a value is larger than or equal to another	`7 >= 3`
`==`	Equal to	To check whether two values are equal	`4 == 4`
`!=`	Not equal to	To check whether two values are different	`3 != 5`

You might structure it like this:

# Addition
2 + 3

[1] 5

Storing data in R

So far we have stored single numbers using <-, for example:

x <- 5

To store a sequence of numbers, combine them using the c() function.
A sequence of elements of the same type is called a vector.

For example:

myfirstvector <- c(1, 5, 3, 7)
myfirstvector

R prints:

[1] 1 5 3 7

You can carry out arithmetic on every element of the vector:

myfirstvector + 5

Returns:

[1] 6 10 8 12

Vectors can also contain words (“character” data):

wordsvector <- c("cat", "dog", "parrot", "peppapig")
wordsvector

R prints:

[1] "cat" "dog" "parrot" "peppapig"

You may use single or double quotation marks; both work.

You can check the type of an object using class():

class(wordsvector)

Because addition only makes sense for numbers, R will give an error if you try an invalid operation:

wordsvector + 5

Finally, if you mix types inside c(), R converts everything to the most general type (usually characters):

mysecondvector <- c(4, "cat")
mysecondvector

Returns:

[1] "4" "cat"

Putting all of this together, you now know how R stores individual values, sequences of numbers, and sequences of text, and how R decides what type each vector should be. These tools form the foundation for reading, inspecting, and describing real datasets — skills you will now begin applying in your coursework.

Part 2: Instructions

Instructions

Group Setup

Work through tutorial tasks in groups of up to 5 students.

Each week, one person is the driver and the rest are navigators.
The driver types the code and edits the report template file.
The navigators read carefully, make suggestions, and help with debugging.
Rotate the driver each week so that everyone has a turn.

Lab Help and Support

The tutorials are designed with multiple layers of support. If you are unsure or stuck: Raise your hand for help. Hover over the superscript numbers to view quick hints for commands and functions. Scroll to the Worked Example section at the bottom of the tutorial to see a parallel worked solution in R. Even if you complete the task without the Worked Example, make sure you read it carefully during independent study — it is designed to reinforce the reasoning and R skills you will need for later weeks.

1.1 Data

CollegeScores Dataset

You will work with the CollegeScores dataset in this tutorial.
At the link CollegeScores_teaching.csv you will find information about 400 higher-education institutions in the United States.
The dataset includes variables describing each institution’s location, sector (public/private), tuition costs, enrolment, and the demographic composition of their student body.

You will use this dataset throughout the Stats block for your formative report.
Below is a description of the variables included in the dataset.

Variable Name	Description
Name	Name of the institution
State	U.S. state where the institution is located
ID	Unique identifier for the institution
Main	Main campus indicator (1 = main campus, 0 = branch campus)
Control	Type of institution (Public, Private, For-profit)
Region	Geographical region in the U.S.
Locale	Setting of the institution (City, Suburb, Town, Rural)
Enrollment	Total undergraduate enrolment
PartTime	Percent of undergraduates who are part-time students
TuitionIn	In-state tuition and fees (USD)
TuitionOut	Out-of-state tuition and fees (USD)
White	Percent of undergraduates who identify as White
Black	Percent of undergraduates who identify as Black
Hispanic	Percent of undergraduates who identify as Hispanic
Asian	Percent of undergraduates who identify as Asian
Other	Percent of undergraduates who identify with another racial/ethnic group

Tip

When you write about the data in your report, you will describe variables in words, not as code. This table is your reference for names and meanings.

1.2 Tasks

For Formative Report you will eventually complete several tasks.
This week we focus only on Task 1.

1) Read the CollegeScores dataset into R, inspect it, and write a concise introduction to the data and its structure.

Display and describe the categorical variables.

Display and describe a selection of numeric variables.

Display and describe at least one relationship between two or three variables.

Finish the report write-up, knit to PDF, and submit.

This tutorial is designed to help you complete Task 1.

1.3 Task 1, sub-tasks

Tip

Tip: Hover over the footnotes for hints showing useful R functions.

This week you only need to complete Task 1. The steps below are suggestions to guide your work.

Read the data into R (from the provided CSV file in datasets/) and give the object a sensible name (for example college).¹
View the data in RStudio and check that it matches your expectations.²
How many rows (observations) are there in the dataset?³
How many columns (variables) are there?⁴
What type is each variable in R (character, numeric, factor, etc.)?⁵
Choose the following four variables from the dataset:

Control
Region
Enrollment
TuitionIn

For each variable state:

whether it is categorical or numeric
its level of measurement (nominal, ordinal, interval, ratio)
one example value from the dataset

For several numeric variables (e.g. Enrollment, TuitionIn, TuitionOut), find the minimum, maximum, and mean values.⁶
Identify which variables contain missing values (if any) and how many.⁷
Write a short description of the dataset (one paragraph) for a generic reader. You might include:
- What the data represent.
- How many institutions and variables are included.
- The kinds of information recorded (location, enrolment, costs, student characteristics).
- A brief comment on any clear missing data issues.

You do not need to include R output in the report; you only need the written description. Keep this paragraph; you will re-use it in your report.

2 Worked example

Consider the dataset provided in datasets/HollywoodMovies.csv, containing 1295 observations on the following 15 variables:

Variable Name	Description
Movie	Title of the movie
LeadStudio	Primary U.S. distributor
RottenTomatoes	Critics' rating (Rotten Tomatoes)
AudienceScore	Audience rating (Rotten Tomatoes)
Genre	Film genre (e.g., Action Adventure, Comedy, Thriller)
TheatersOpenWeek	Number of screens on opening weekend
OpeningWeekend	Opening weekend gross (in millions)
BOAvgOpenWeekend	Average box office income per theatre, opening weekend
Budget	Production budget (in millions)
DomesticGross	U.S. gross income (in millions)
ForeignGross	Foreign gross income (in millions)
WorldGross	Worldwide gross income (in millions)
Profitability	Worldwide gross as a percentage of budget
OpenProfit	Percentage of budget recovered on opening weekend
Year	Year of release

These data were compiled from Box Office Mojo, The Numbers, and Rotten Tomatoes.

We load the tidyverse package as we will use the functions
read_csv() and glimpse() from this package.

1. Load tidyverse and import the data

library(tidyverse)

read_csv() reads CSV (comma-separated values) files.
The loaded data are stored in an object called movies using the arrow <-.

movies <- read_csv("HollywoodMovies.csv")

2. Take a first look at the data

head() shows by default the first six rows of the dataset.
Use n = 10 to show more rows (e.g., head(movies, n = 10)).

head(movies)

# A tibble: 6 × 15
  Movie           LeadStudio RottenTomatoes AudienceScore Genre TheatersOpenWeek
  <chr>           <chr>               <dbl>         <dbl> <chr>            <dbl>
1 2016: Obama's … Rocky Mou…             26            73 Docu…                1
2 21 Jump Street  Sony Pict…             85            82 Come…             3121
3 A Late Quartet  Entertain…             76            71 Drama                9
4 A Royal Affair  Magnolia …             90            82 Drama                7
5 Abraham Lincol… Twentieth…             35            51 Horr…             3108
6 Act of Valor    Relativit…             27            72 Acti…             3039
# ℹ 9 more variables: OpeningWeekend <dbl>, BOAvgOpenWeekend <dbl>,
#   Budget <dbl>, DomesticGross <dbl>, WorldGross <dbl>, ForeignGross <dbl>,
#   Profitability <dbl>, OpenProfit <dbl>, Year <dbl>

Number of variables

ncol(movies)

[1] 15

Number of observations

nrow(movies)

[1] 1295

or you could use ‘dim()’

dim() returns the dimensions of the dataset:
- number of rows (observations)
- number of columns (variables)

dim(movies)

[1] 1295   15

glimpse() provides a compact summary of the entire dataset.
It shows:
- the number of rows and columns
- each variable’s name
- its type (e.g., <dbl>, <chr>)
- a preview of the first few values

Useful for checking that variables are read correctly and spotting problems such as missing values. Ideal for large datasets.

glimpse(movies)

Rows: 1,295
Columns: 15
$ Movie            <chr> "2016: Obama's America", "21 Jump Street", "A Late Qu…
$ LeadStudio       <chr> "Rocky Mountain Pictures", "Sony Pictures Releasing",…
$ RottenTomatoes   <dbl> 26, 85, 76, 90, 35, 27, 91, 56, 11, 44, 93, 63, 87, 9…
$ AudienceScore    <dbl> 73, 82, 71, 82, 51, 72, 62, 47, 47, 63, 82, 51, 63, 9…
$ Genre            <chr> "Documentary", "Comedy", "Drama", "Drama", "Horror", …
$ TheatersOpenWeek <dbl> 1, 3121, 9, 7, 3108, 3039, 132, 245, 2539, 3192, 3, 1…
$ OpeningWeekend   <dbl> 0.03, 36.30, 0.08, 0.04, 16.31, 24.48, 1.14, 0.70, 11…
$ BOAvgOpenWeekend <dbl> 30000, 11631, 8889, 5714, 5248, 8055, 8636, 2857, 449…
$ Budget           <dbl> 3.0, 42.0, NA, NA, 68.0, 12.0, NA, 7.5, 35.0, 50.0, 1…
$ DomesticGross    <dbl> 33.35, 138.45, 1.56, 1.55, 37.52, 70.01, 1.99, 3.01, …
$ WorldGross       <dbl> 33.35, 202.81, 6.30, 7.60, 137.49, 82.50, 3.59, 8.54,…
$ ForeignGross     <dbl> 0.00, 64.36, 4.74, 6.05, 99.97, 12.49, 1.60, 5.53, 9.…
$ Profitability    <dbl> 1334.00, 482.88, NA, NA, 202.19, 687.50, NA, 113.87, …
$ OpenProfit       <dbl> 1.20, 86.43, NA, NA, 23.99, 204.00, NA, 9.33, 32.57, …
$ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…

Type of Variable glimpse()

Type label	Meaning	Example
`<dbl>`	Double (numeric values with decimals)	`3.5`, `1000`, `12.8`
`<int>`	Integer (whole numbers)	`1`, `25`, `400`
`<chr>`	Character (text values)	`"Public"`, `"Texas"`
`<lgl>`	Logical (TRUE/FALSE values)	`TRUE`, `FALSE`
`<fct>`	Factor (categorical variable stored as categories)	`"Public"`, `"Private"`

For the purposes of this course, you can interpret the main types like this:

or → numeric variable
or → categorical variable

4. Summary statistics

The function summary() provides a quick statistical overview of data.

It can be used in two ways:

Using summary() on an entire dataset

When you apply summary() to a dataset, R produces summary information for every variable in the dataset.

summary() produces quick descriptive statistics for each variable.
For numeric variables, it shows the minimum, maximum, median, mean, and quartiles.
For categorical variables, it shows frequency counts for each category.
Useful for a fast overview of the dataset.

summary(dataset) → overview of all variables

summary(movies)

    Movie            LeadStudio        RottenTomatoes  AudienceScore  
 Length:1295        Length:1295        Min.   : 0.00   Min.   :10.00  
 Class :character   Class :character   1st Qu.:33.00   1st Qu.:49.00  
 Mode  :character   Mode  :character   Median :61.00   Median :64.00  
                                       Mean   :57.58   Mean   :62.18  
                                       3rd Qu.:84.00   3rd Qu.:77.00  
                                       Max.   :99.00   Max.   :99.00  
                                       NA's   :6                      
    Genre           TheatersOpenWeek OpeningWeekend    BOAvgOpenWeekend
 Length:1295        Min.   :   1.0   Min.   :  0.020   Min.   :   204  
 Class :character   1st Qu.: 152.5   1st Qu.:  0.845   1st Qu.:  3482  
 Mode  :character   Median :2459.0   Median :  7.600   Median :  6586  
                    Mean   :2008.0   Mean   : 17.541   Mean   : 13400  
                    3rd Qu.:3213.5   3rd Qu.: 20.810   3rd Qu.: 14534  
                    Max.   :4529.0   Max.   :257.700   Max.   :240000  
                                                                       
     Budget       DomesticGross      WorldGross       ForeignGross    
 Min.   :  0.90   Min.   :  1.02   Min.   :   0.74   Min.   :  -0.76  
 1st Qu.: 12.00   1st Qu.:  6.40   1st Qu.:  13.09   1st Qu.:   3.91  
 Median : 30.00   Median : 26.46   Median :  50.37   Median :  21.58  
 Mean   : 51.38   Mean   : 58.16   Mean   : 147.01   Mean   :  88.84  
 3rd Qu.: 65.00   3rd Qu.: 66.44   3rd Qu.: 160.38   3rd Qu.:  89.75  
 Max.   :365.00   Max.   :936.66   Max.   :2068.22   Max.   :1369.54  
 NA's   :239                                                          
 Profitability       OpenProfit           Year     
 Min.   :    2.3   Min.   :   0.05   Min.   :2012  
 1st Qu.:  139.1   1st Qu.:  12.87   1st Qu.:2013  
 Median :  268.9   Median :  31.77   Median :2015  
 Mean   :  435.7   Mean   :  64.50   Mean   :2015  
 3rd Qu.:  483.0   3rd Qu.:  62.59   3rd Qu.:2017  
 Max.   :10176.0   Max.   :3373.00   Max.   :2018  
 NA's   :239       NA's   :239

You probably will not understand every part of the summary output yet, and that is completely fine. For now, focus on the minimum, maximum, mean, and the general spread of the numeric variables.

For numeric variables, the output includes:

Min – smallest value
1st Qu. – first quartile (25% of values are below this)
Median – middle value
Mean – average
3rd Qu. – third quartile (75% of values are below this)
Max – largest value
NA’s – number of missing values

For categorical variables, the output shows the frequency of each category.

Using summary() on the full dataset is useful for obtaining a general overview of the entire dataset and quickly spotting things such as missing values or unusual ranges.

Using summary() on a single variable

You can also apply summary() to one specific variable by selecting it with the $ operator.

For example:

summary(dataset$variable) → summary of one specific variable

summary(movies$Budget)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.90   12.00   30.00   51.38   65.00  365.00     239

This produces summary statistics only for the Budget variable.

Using summary() on individual variables is useful when you want to:

focus on one variable at a time
report statistics such as minimum, maximum, and mean
explore a variable more carefully during analysis

Variables to Examine:

Say we want to examine the following four variables from the dataset:

Genre
LeadStudio
Budget
WorldGross

1. Variable types

# Look at the structure of the dataset
glimpse(movies)

Rows: 1,295
Columns: 15
$ Movie            <chr> "2016: Obama's America", "21 Jump Street", "A Late Qu…
$ LeadStudio       <chr> "Rocky Mountain Pictures", "Sony Pictures Releasing",…
$ RottenTomatoes   <dbl> 26, 85, 76, 90, 35, 27, 91, 56, 11, 44, 93, 63, 87, 9…
$ AudienceScore    <dbl> 73, 82, 71, 82, 51, 72, 62, 47, 47, 63, 82, 51, 63, 9…
$ Genre            <chr> "Documentary", "Comedy", "Drama", "Drama", "Horror", …
$ TheatersOpenWeek <dbl> 1, 3121, 9, 7, 3108, 3039, 132, 245, 2539, 3192, 3, 1…
$ OpeningWeekend   <dbl> 0.03, 36.30, 0.08, 0.04, 16.31, 24.48, 1.14, 0.70, 11…
$ BOAvgOpenWeekend <dbl> 30000, 11631, 8889, 5714, 5248, 8055, 8636, 2857, 449…
$ Budget           <dbl> 3.0, 42.0, NA, NA, 68.0, 12.0, NA, 7.5, 35.0, 50.0, 1…
$ DomesticGross    <dbl> 33.35, 138.45, 1.56, 1.55, 37.52, 70.01, 1.99, 3.01, …
$ WorldGross       <dbl> 33.35, 202.81, 6.30, 7.60, 137.49, 82.50, 3.59, 8.54,…
$ ForeignGross     <dbl> 0.00, 64.36, 4.74, 6.05, 99.97, 12.49, 1.60, 5.53, 9.…
$ Profitability    <dbl> 1334.00, 482.88, NA, NA, 202.19, 687.50, NA, 113.87, …
$ OpenProfit       <dbl> 1.20, 86.43, NA, NA, 23.99, 204.00, NA, 9.33, 32.57, …
$ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…

# Look at the first rows
head(movies)

# A tibble: 6 × 15
  Movie           LeadStudio RottenTomatoes AudienceScore Genre TheatersOpenWeek
  <chr>           <chr>               <dbl>         <dbl> <chr>            <dbl>
1 2016: Obama's … Rocky Mou…             26            73 Docu…                1
2 21 Jump Street  Sony Pict…             85            82 Come…             3121
3 A Late Quartet  Entertain…             76            71 Drama                9
4 A Royal Affair  Magnolia …             90            82 Drama                7
5 Abraham Lincol… Twentieth…             35            51 Horr…             3108
6 Act of Valor    Relativit…             27            72 Acti…             3039
# ℹ 9 more variables: OpeningWeekend <dbl>, BOAvgOpenWeekend <dbl>,
#   Budget <dbl>, DomesticGross <dbl>, WorldGross <dbl>, ForeignGross <dbl>,
#   Profitability <dbl>, OpenProfit <dbl>, Year <dbl>

we use

glimpse() → understand variable types
head() → see example values

Variable	Type	Level of measurement	Example value
Budget	Numeric
WorldGross	Numeric
Genre	Categorical
LeadStudio	Categorical

2. Numerical Summaries of the Variables

a) Numeric Variables

summary(movies$Budget)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.90   12.00   30.00   51.38   65.00  365.00     239

summary(movies$WorldGross)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.74   13.09   50.37  147.01  160.38 2068.22

The output shows:

minimum value
first quartile
median
mean
third quartile
maximum value
number of missing values (if any)

Reminder : Identifying Ratio vs Interval Variables

Feature	Interval scale	Ratio scale
Numeric variable	✓	✓
Equal intervals between values	✓	✓
Meaningful zero value	✗	✓
Ratios are meaningful (“twice as much”)	✗	✓

Ratio variables

A ratio variable has a true zero, meaning that a value of zero represents the complete absence of the quantity being measured.

Variable	Why it is ratio
Budget (film production cost)	$0 means no budget
WorldGross (revenue)	$0 means no revenue
Enrollment (number of students)	0 means no students
Age	0 years means no age yet
Distance	0 km means no distance travelled
Height	0 cm means no height
Income	£0 means no income

Example:

A film with a budget of $100 million has twice the budget of one with $50 million.

This type of comparison only makes sense with ratio variables.

Interval variables

An interval variable has equal spacing between values, but zero does not represent the absence of the quantity. Instead, zero is just an arbitrary reference point.

Variable	Why it is interval
Temperature in Celsius	0°C does not mean “no temperature”
Temperature in Fahrenheit	0°F is arbitrary
Calendar year	Year 0 does not represent absence of time
IQ scores	0 does not mean absence of intelligence

Example:

20°C is not twice as hot as 10°C.

Even though the numbers are larger, the ratio interpretation does not make sense.

In the Celsius scale:

0°C does not mean “no temperature.”
It simply corresponds to the freezing point of water.

Because of this, the scale is shifted relative to the physical quantity being measured (thermal energy). The zero point is arbitrary rather than representing the absence of temperature.

If we convert the same temperatures to Kelvin, which does have a true zero (absolute zero), we can see the difference:

Celsius	Kelvin
10°C	283 K
20°C	293 K

Now compare the ratio:

[ ]

So 20°C is only about 4% warmer than 10°C, not twice as hot. This is why Celsius temperature is considered an interval variable, not a ratio variable.

b. Categorical Variables

summary(movies$Genre)

   Length     Class      Mode 
     1295 character character

The output of summary(movies$Genre) provides only very basic information because the variable is currently stored as a character rather than a categorical variable. Specifically, it tells us the length of the vector (1295 observations), its class (“character”), and its mode, but it does not provide any meaningful summary of the categories themselves (e.g., counts or frequencies of each genre). This is because R does not treat character variables as categorical data by default. As a result, we cannot yet examine how many observations fall into each genre or explore the distribution of this variable. To obtain a more informative summary, the variable needs to be converted into a factor, which is R’s appropriate data type for representing categorical variables.

Convert categorical variables (character) to factors

What does `factor()` do?

factor() is a function in R that turns a variable into a factor, which is R’s way of storing categorical data (groups, labels, categories).

Many datasets import text categories as "character" data.
Using factor() tells R that these values represent categories, not free-text.

movies$Genre <- factor(movies$Genre)
movies$LeadStudio <- factor(movies$LeadStudio)
movies$Year <- factor(movies$Year)

For example:

movies$Genre contains labels such as "Comedy", "Drama", "Horror".
factor(movies$Genre) converts these labels into a categorical variable.

We use:

movies$Genre <- factor(movies$Genre)

to replace the original column with a factor version of itself.

The $ operator

The $ symbol is used to select a single column from a dataset.

For example,
movies$Genre means “the Genre column inside the movies dataset”.

Check the updated structure:

After converting character variables into factors, the output of summary() becomes much more informative for categorical data.

Instead of only showing "Length", "Class", and "Mode", R now displays the number of observations in each category (for example, how many films belong to each genre or studio).

This is why converting categorical variables to factors is useful: it allows R to treat them properly as categories and summarise them in a meaningful way.

summary(movies)

    Movie                             LeadStudio  RottenTomatoes 
 Length:1295        Warner Bros.           :136   Min.   : 0.00  
 Class :character   Universal Pictures     :114   1st Qu.:33.00  
 Mode  :character   Lionsgate              :106   Median :61.00  
                    Twentieth Century Fox  :103   Mean   :57.58  
                    Paramount Pictures     : 76   3rd Qu.:84.00  
                    Sony Pictures Releasing: 72   Max.   :99.00  
                    (Other)                :688   NA's   :6      
 AudienceScore         Genre     TheatersOpenWeek OpeningWeekend   
 Min.   :10.00   Drama    :386   Min.   :   1.0   Min.   :  0.020  
 1st Qu.:49.00   Comedy   :191   1st Qu.: 152.5   1st Qu.:  0.845  
 Median :64.00   Action   :170   Median :2459.0   Median :  7.600  
 Mean   :62.18   Adventure:164   Mean   :2008.0   Mean   : 17.541  
 3rd Qu.:77.00   Thriller :155   3rd Qu.:3213.5   3rd Qu.: 20.810  
 Max.   :99.00   Horror   : 85   Max.   :4529.0   Max.   :257.700  
                 (Other)  :144                                     
 BOAvgOpenWeekend     Budget       DomesticGross      WorldGross     
 Min.   :   204   Min.   :  0.90   Min.   :  1.02   Min.   :   0.74  
 1st Qu.:  3482   1st Qu.: 12.00   1st Qu.:  6.40   1st Qu.:  13.09  
 Median :  6586   Median : 30.00   Median : 26.46   Median :  50.37  
 Mean   : 13400   Mean   : 51.38   Mean   : 58.16   Mean   : 147.01  
 3rd Qu.: 14534   3rd Qu.: 65.00   3rd Qu.: 66.44   3rd Qu.: 160.38  
 Max.   :240000   Max.   :365.00   Max.   :936.66   Max.   :2068.22  
                  NA's   :239                                        
  ForeignGross     Profitability       OpenProfit        Year    
 Min.   :  -0.76   Min.   :    2.3   Min.   :   0.05   2012:169  
 1st Qu.:   3.91   1st Qu.:  139.1   1st Qu.:  12.87   2013:172  
 Median :  21.58   Median :  268.9   Median :  31.77   2014:197  
 Mean   :  88.84   Mean   :  435.7   Mean   :  64.50   2015:190  
 3rd Qu.:  89.75   3rd Qu.:  483.0   3rd Qu.:  62.59   2016:196  
 Max.   :1369.54   Max.   :10176.0   Max.   :3373.00   2017:182  
                   NA's   :239       NA's   :239       2018:189

summary(movies$Genre)

         Action       Adventure    Black Comedy          Comedy         Concert 
            170             164              22             191               4 
    Documentary           Drama          Horror         Musical Romantic Comedy 
             41             386              85              18              49 
       Thriller         Western 
            155              10

summary(movies$LeadStudio)

                                    -                                   A24 
                                    3                                    34 
                           Abramorama                          Affirm Films 
                                    2                                     2 
                              Alchemy                        Amazon Studios 
                                    1                                     4 
                     Anchor Bay Films                    Annapurna Pictures 
                                    1                                     7 
                    Arc Entertainment            Atlas Distribution Company 
                                    2                                     2 
                      Aviron Pictures                               BH Tilt 
                                    3                                    10 
                Bleecker Street Media                       Blue Sky Cinema 
                                   19                                     2 
             Briarcliff Entertainment                  Broad Green Pictures 
                                    1                                     8 
                            CBS Films          China Lion Film Distribution 
                                   11                                     1 
                           CineGalaxy                         Cinelou Films 
                                    1                                     2 
                Clarius Entertainment                     Cohen Media Group 
                                    4                                     2 
                      Dimension Films                      Drafthouse Films 
                                    1                                     1 
                           DreamWorks                     EchoLight Studios 
                                    2                                     1 
               Electric Entertainment                     Entertainment One 
                                    2                                     4 
Entertainment Studios Motion Pictures                    Eros International 
                                    5                                    11 
                           EuropaCorp                   Excel Entertainment 
                                    4                                     1 
                         FilmDistrict                        Focus Features 
                                    8                                    46 
                                  Fox              Fox Searchlight Pictures 
                                    1                                    42 
                  Freestyle Releasing           Fun Academy Motion Pictures 
                                    7                                     1 
             FUNimation Entertainment             Global Road Entertainment 
                                    1                                     4 
              Good Deed Entertainment                     Great India Films 
                                    1                                     2 
              Greenwich Entertainment                       Gunpowder & Sky 
                                    1                                     2 
                        GVN Releasing            Hammond Entertainment, LLC 
                                    1                                     1 
                            IFC Films                                  IMAX 
                                   12                                     1 
               Kenn Viselman Presents                      LD Entertainment 
                                    1                                     3 
                      Liberty Studios                             Lionsgate 
                                    1                                   106 
                    Magnolia Pictures             Metro-Goldwyn-Mayer (MGM) 
                                    8                                     3 
             Millennium Entertainment                       Music Box Films 
                                    2                                     4 
                                 Neon                                 Novus 
                                    5                                     1 
                 Open Road Films (II)                        Orion Pictures 
                                   35                                     1 
                         Oscilloscope                       Pantelion Films 
                                    1                                     5 
                   Paramount Pictures                     Paramount Vantage 
                                   76                                     1 
                         Picturehouse                   Purdie Distribution 
                                    1                                     2 
              Pure Flix Entertainment                          Quality Flix 
                                   12                                     2 
                           RADiUS-TWC                      RCR Distribution 
                                    5                                     1 
                     Relativity Media                 Reliance Big Pictures 
                                   26                                     2 
               River Rain Productions                  Roadside Attractions 
                                    1                                    35 
              Rocky Mountain Pictures                           Saban Films 
                                    2                                     1 
                          Screen Gems                    Screen Media Films 
                                   21                                     1 
               Sony Pictures Classics               Sony Pictures Releasing 
                                   44                                    72 
                             Studio 8                     STX Entertainment 
                                    2                                    23 
                         Ten Furlongs                       The Film Arcade 
                                    1                                     1 
                          The Orchard            The Samuel Goldwyn Company 
                                    3                                     5 
                The Weinstein Company                   Trafalgar Releasing 
                                   35                                     1 
                     TriStar Pictures         Triumph Releasing Corporation 
                                   18                                     1 
                Twentieth Century Fox                                Unison 
                                  103                                     1 
                   Universal Pictures                   UTV Motion Pictures 
                                  114                                     8 
               Vertical Entertainment   Walt Disney Studios Motion Pictures 
                                    1                                    72 
                         Warner Bros.                             Weinstein 
                                  136                                     2 
                       Yash Raj Films 
                                    2

We can now complete the table:

Variable	Type	Level of measurement	Example value
Genre	Categorical	Nominal	Comedy
LeadStudio	Categorical	Nominal	Warner Bros
Budget	Numeric	Ratio	80
WorldGross	Numeric	Ratio	350

Functions used in this section

Function	Purpose
`glimpse()`	identify variable types
`head()`	see example values
`summary()`	calculate statistics

3. Write about the dataset and the variables examined

Example writeup

The dataset contains information on 1,295 Hollywood films released between 2012 and 2018. It includes variables describing both categorical characteristics of films and numerical measures of financial performance. In this analysis, four variables were examined: Genre, LeadStudio, Budget, and WorldGross. The variables Genre and LeadStudio are categorical variables describing the type of film and the production studio responsible for its distribution. The variables Budget and WorldGross are numeric variables measured on a ratio scale and represent the production budget and worldwide box office revenue of each film. Inspection of the summary statistics shows that budgets and worldwide revenues vary considerably across films, indicating large differences in production scale and financial performance within the dataset.

3. Student Glossary

To conclude the lab, create a glossary of R functions.
Use Word, Excel, OneNote, or any tool you like, and make a simple table with two columns:

Function name
Brief description of what the function does

This “do it yourself” glossary will help you revise the key ideas from today’s lab and will be extremely useful when completing the weekly quizzes.

Below is a starter example you can expand:

Function	Use and package
`read_csv`	Reads comma-separated value files. Part of the tidyverse.
`head`	Shows the first few rows (six by default).
`nrow`	Returns the number of rows in a dataset.
`ncol`	Returns the number of columns in a dataset.
`dim`	Returns both rows and columns (dimensions).
`glimpse`	Provides a compact overview of all variables (dplyr).
`summary`	Gives summary statistics for each variable.
`factor`	Converts a variable into a categorical factor.

Feel free to add more as you encounter additional functions during the course.

Reference

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F. Lock, and Dennis F. Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

Hint: use read_csv() from the readr package.↩︎
Hint: try View(DATA) or head(DATA).↩︎
Hint: nrow(DATA) or dim(DATA)[1].↩︎
Hint: ncol(DATA) or dim(DATA)[2].↩︎
Hint: glimpse(DATA) from the dplyr package or str(DATA).↩︎
Hint: summary(DATA$Variable) or min(), max(), mean().↩︎
Hint: summary(DATA) will show NAs per variable.↩︎

Part 1: Your First R Markdown File

1 Creating and running code in an .Rmd file

Adding your first code chunk

Comments in code

2 Writing text in an .Rmd file

Recall

Task

Storing data in R

Part 2: Instructions

Group Setup

Lab Help and Support

1.1 Data

1.2 Tasks

1.3 Task 1, sub-tasks

2 Worked example

1. Load tidyverse and import the data

2. Take a first look at the data

4. Summary statistics

Variables to Examine:

1. Variable types

2. Numerical Summaries of the Variables

a) Numeric Variables

Reminder : Identifying Ratio vs Interval Variables

b. Categorical Variables

Convert categorical variables (character) to factors

What does factor() do?

Check the updated structure:

Functions used in this section

3. Write about the dataset and the variables examined

3. Student Glossary

Reference

Footnotes

1 Creating and running code in an `.Rmd` file

2 Writing text in an `.Rmd` file

What does `factor()` do?