Categorical Data – MSDA IFP

Week 09 – Tutorial 02

1. Tasks for today

This week’s task is highlighted in bold below. Please only focus on completing that task this week. In the next section, you will also find guided sub-steps you may want to consider to complete this week’s task.

  1. Read the College dataset into R, inspect it, and write a concise introduction to the data and its structure.

2) Identify, display, and describe the categorical variables.

  1. Display and describe a selection of numeric variables.

  1. Display and describe at least one relationship between two or three variables.

  1. Finish the report write-up, knit to PDF, and submit.

This tutorial is designed to help you complete Task 2.

1.1 Task 2 – sub-tasks

Tip

Tip: Hover over the footnotes for hints showing useful R functions.

This week you will focus on Task 2: Identify, display, and describe the categorical variables.
Below are some guided sub-steps you may want to consider to complete Task 2.

  • Reopen last week’s .qmd file, as you will continue last week’s work and build on it.1
  1. Read the CollegeScores dataset into R (from the provided CSV file in datasets/) and give the object a sensible name, for example college.2

  2. Use functions such as glimpse() or str() to inspect the structure of the dataset and identify which variables are categorical.3

  3. Separate the categorical variables into:

    • nominal variables (no natural order), such as State, Control, Region, Locale, and Main
    • ordinal variables (meaningful order), such as Selectivity
  4. Convert these categorical variables to factors in R so that they are treated correctly in plots and summaries.4

  5. Create frequency tables for at least three categorical variables, showing:

    • the count
    • the percentage of institutions in each category.5
  6. Create bar charts for at least three categorical variables
    (for example: number of institutions by Region, by Control, and by Locale).6

  7. Improve the readability of your bar charts by:

    • rotating x-axis labels if the names are long
    • ordering nominal categories by frequency using fct_infreq() where appropriate
    • keeping ordinal variables in their natural order
    • adding clear axis labels and an informative title with labs().7
  8. For one ordinal categorical variable (Selectivity), go further and calculate:

    • the median
    • the quartiles
    • the interquartile range (IQR)
  9. In your written Analysis section, describe the distributions in words. For example:

    • Which categories are most common?
    • Are some categories rare?
    • Is the distribution fairly balanced or dominated by one or two groups?
    • For Selectivity, what do the median and IQR tell you?
  10. For each categorical variable in your report, include either the bar plot or the frequency table (not both) to avoid duplicating information.

You do not need to include R output in the report; you only need the written description. Keep this paragraph; you will re-use it in your report.

Data

CollegeScores Dataset

You will work with the CollegeScores dataset in this tutorial.
The dataset includes variables describing each institution’s location, sector, admissions selectivity, tuition costs, enrolment, and the demographic composition of their student body.

You can download the dataset “CollegeScores_teaching_with_selectivity.csv” on Learn ,”Materials - Week 9”

Variable Name Description
ID Institution ID number
Name Name of the institution
State State where the institution is located
Main Main campus? (1 = yes, 0 = branch campus)
Control Control of institution (Private, Profit, Public)
Region Region of the country (Midwest, Northeast, Southeast, Territory, West)
Locale Locale of the institution (City, Rural, Suburb, Town)
Enrollment Undergraduate enrolment
PartTime Percent of undergraduates who are part-time students
TuitionIn In-state tuition and fees
TuitionOut Out-of-state tuition and fees
White Percent of undergraduates who report being white
Black Percent of undergraduates who report being black
Hispanic Percent of undergraduates who report being Hispanic
Asian Percent of undergraduates who report being Asian
Other Percent of undergraduates who do not report one of the above categories
AdmitRate Admission rate
Selectivity Ordinal category based on admission rate (Highly selective, Moderately selective, Less selective)

2 Worked Example

Consider the dataset provided in datasets/HollywoodMovies.csv, containing 1295 observations on the following 15 variables:

Variable Name Description
Movie Title of the movie
LeadStudio Primary U.S. distributor
RottenTomatoes Critics' rating (Rotten Tomatoes)
AudienceScore Audience rating (Rotten Tomatoes)
Genre Film genre (e.g., Action Adventure, Comedy, Thriller)
TheatersOpenWeek Number of screens on opening weekend
OpeningWeekend Opening weekend gross (in millions)
BOAvgOpenWeekend Average box office income per theatre, opening weekend
Budget Production budget (in millions)
DomesticGross U.S. gross income (in millions)
ForeignGross Foreign gross income (in millions)
WorldGross Worldwide gross income (in millions)
Profitability Worldwide gross as a percentage of budget
OpenProfit Percentage of budget recovered on opening weekend
Year Year of release

These data were compiled from Box Office Mojo, The Numbers, and Rotten Tomatoes.

We load the tidyverse package as we will use the functions
read_csv() and glimpse() from this package.

Load tidyverse and import the data

library(tidyverse)

read_csv() reads CSV (comma-separated values) files.
The loaded data are stored in an object called movies using the arrow <-.

2.1 Load tidyverse and import the data

movies <- read_csv("HollywoodMovies.csv")

2.2 Inspect the structure of the dataset

glimpse(movies)
Rows: 1,295
Columns: 15
$ Movie            <chr> "2016: Obama's America", "21 Jump Street", "A Late Qu…
$ LeadStudio       <chr> "Rocky Mountain Pictures", "Sony Pictures Releasing",…
$ RottenTomatoes   <dbl> 26, 85, 76, 90, 35, 27, 91, 56, 11, 44, 93, 63, 87, 9…
$ AudienceScore    <dbl> 73, 82, 71, 82, 51, 72, 62, 47, 47, 63, 82, 51, 63, 9…
$ Genre            <chr> "Documentary", "Comedy", "Drama", "Drama", "Horror", …
$ TheatersOpenWeek <dbl> 1, 3121, 9, 7, 3108, 3039, 132, 245, 2539, 3192, 3, 1…
$ OpeningWeekend   <dbl> 0.03, 36.30, 0.08, 0.04, 16.31, 24.48, 1.14, 0.70, 11…
$ BOAvgOpenWeekend <dbl> 30000, 11631, 8889, 5714, 5248, 8055, 8636, 2857, 449…
$ Budget           <dbl> 3.0, 42.0, NA, NA, 68.0, 12.0, NA, 7.5, 35.0, 50.0, 1…
$ DomesticGross    <dbl> 33.35, 138.45, 1.56, 1.55, 37.52, 70.01, 1.99, 3.01, …
$ WorldGross       <dbl> 33.35, 202.81, 6.30, 7.60, 137.49, 82.50, 3.59, 8.54,…
$ ForeignGross     <dbl> 0.00, 64.36, 4.74, 6.05, 99.97, 12.49, 1.60, 5.53, 9.…
$ Profitability    <dbl> 1334.00, 482.88, NA, NA, 202.19, 687.50, NA, 113.87, …
$ OpenProfit       <dbl> 1.20, 86.43, NA, NA, 23.99, 204.00, NA, 9.33, 32.57, …
$ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
  • This helps us identify which variables are categorical and how R is currently storing them.

2.3 Identify categorical variables

Some variables in this dataset are categorical.

a. Nominal categorical variables

  • LeadStudio

  • Genre

  • These have no natural order.

b. Ordinal categorical variable

  • Year

Although Year is stored as a number, in this worked example we will treat it as an ordered categorical variable because the years follow a meaningful order.

Important

Movie is an identifier variable, not a useful categorical variable for summary plots or frequency tables.

Each movie title appears only once, so plotting Movie would not help us describe the data.

2.4 Convert categorical variables to factors

movies <- movies |>
  mutate(
    LeadStudio = factor(LeadStudio),
    Genre = factor(Genre))

We need the variable Year to be ordered.

  • first we check the current levels of the variable
levels(movies$Year)
NULL
  • We also check that R is reading the variable as a factor and not as a character:
class(movies$Year)
[1] "numeric"

to make sure the variable Year is a factor and is order :

movies <- movies |>
  mutate(
    Year = factor(Year, levels = sort(unique(Year)), ordered = TRUE)
  )

Functions:

  • unique() → find categories
  • sort() → order them
  • levels = → tell R the correct order
  • so, levels = sort(unique(Year)) means “take all the different years in the dataset, sort them from smallest to largest, and use that as the correct order of categories”.
  • unique(Year)

    → finds all the distinct year values in the dataset

    (removes duplicates)

  • sort(...)

    → arranges those values in increasing order

    (e.g. 2008, 2009, 2010, …)

  • levels = ...

    → tells R:

    “these are the categories, and this is their correct order”

  • ordered = TRUE

    tells R that the order matters, so Year is treated as an ordinal variable rather than a nominal one

3 Nominal categorical variables

For nominal variables, we focus on:

  • frequencies

  • percentages

  • bar charts

  • the most common category (the mode)

We do not calculate the median or IQR for nominal variables.

3.1 Frequency table: Genre

tbl_genre <- movies |>
  count(Genre, name = "n") |>
  mutate(
    Percent = round((n / sum(n)) * 100, 1)
  ) |>
  arrange(desc(n))

tbl_genre
# A tibble: 12 × 3
   Genre               n Percent
   <fct>           <int>   <dbl>
 1 Drama             386    29.8
 2 Comedy            191    14.7
 3 Action            170    13.1
 4 Adventure         164    12.7
 5 Thriller          155    12  
 6 Horror             85     6.6
 7 Romantic Comedy    49     3.8
 8 Documentary        41     3.2
 9 Black Comedy       22     1.7
10 Musical            18     1.4
11 Western            10     0.8
12 Concert             4     0.3

3.2 Bar chart: Genre

plot_genre <- ggplot(movies, aes(x = Genre)) +
  geom_bar() +
  labs(x = "Movie Genre", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
plot_genre

ggplot(movies, aes(x = fct_infreq(Genre))) +
  geom_bar(fill = "steelblue4") +
  labs(
    x = "Genre",
    y = "Count",
    title = "Frequency of movie genres"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

3.3 Frequency table: LeadStudio

tbl_studio <- movies |>
  count(LeadStudio, name = "n") |>
  mutate(
    Percent = round((n / sum(n)) * 100, 1)
  ) |>
  arrange(desc(n))

tbl_studio
# A tibble: 99 × 3
   LeadStudio                              n Percent
   <fct>                               <int>   <dbl>
 1 Warner Bros.                          136    10.5
 2 Universal Pictures                    114     8.8
 3 Lionsgate                             106     8.2
 4 Twentieth Century Fox                 103     8  
 5 Paramount Pictures                     76     5.9
 6 Sony Pictures Releasing                72     5.6
 7 Walt Disney Studios Motion Pictures    72     5.6
 8 Focus Features                         46     3.6
 9 Sony Pictures Classics                 44     3.4
10 Fox Searchlight Pictures               42     3.2
# ℹ 89 more rows

Because there are many studios, it is often clearer to show only the most common ones.

tbl_studio_top10 <- tbl_studio |>
  slice_head(n = 10)

tbl_studio_top10
# A tibble: 10 × 3
   LeadStudio                              n Percent
   <fct>                               <int>   <dbl>
 1 Warner Bros.                          136    10.5
 2 Universal Pictures                    114     8.8
 3 Lionsgate                             106     8.2
 4 Twentieth Century Fox                 103     8  
 5 Paramount Pictures                     76     5.9
 6 Sony Pictures Releasing                72     5.6
 7 Walt Disney Studios Motion Pictures    72     5.6
 8 Focus Features                         46     3.6
 9 Sony Pictures Classics                 44     3.4
10 Fox Searchlight Pictures               42     3.2

3.4 Bar chart: LeadStudio

library(tidyverse)
library(forcats)

plot_LeadStudio  <- ggplot(movies, aes(x = fct_infreq(LeadStudio))) +
  geom_bar() +
  labs(
    x = "Lead Studio (sorted by frequency)",
    y = "Count",
    title = "Frequency of Lead Studios"
  ) +
  scale_y_continuous(
    breaks = seq(0, 150, by = 20)   # More detailed count axis
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1,
      size = 6,                    # Smaller labels so they fit
      lineheight = 0.8
    ),
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.y = element_text(size = 9)
  )

plot_LeadStudio

Flip

ggplot(movies, aes(x = fct_infreq(LeadStudio))) +
  geom_bar() +
  coord_flip() +
  labs(x = "Lead Studio (sorted by frequency)", y = "Count") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, 150, by = 10))

ggplot(tbl_studio_top10, aes(x = fct_reorder(LeadStudio, n), y = n)) +
  geom_col(fill = "steelblue4") +
  labs(
    x = "Lead Studio",
    y = "Count",
    title = "Top 10 most common lead studios"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

3.5 What is the most common category?

For nominal variables, the most common category is the mode.

a. Most common Genre

slice_max(tbl_genre, n, n = 1)
# A tibble: 1 × 3
  Genre     n Percent
  <fct> <int>   <dbl>
1 Drama   386    29.8

This returns the genre with the highest number of movies.

b. Most common Lead Studio

slice_max(tbl_studio, n, n = 1)
# A tibble: 1 × 3
  LeadStudio       n Percent
  <fct>        <int>   <dbl>
1 Warner Bros.   136    10.5

This returns the studio that appears most frequently in the dataset.

What does slice_max() do?

  • n → the column containing the counts

  • n = 1 → return only the top category

So this code means:

“Give me the category with the highest count.”

How to describe this in your report

For example:

The most common genre in the dataset is Drama, accounting for 29.8% of all movies. This indicates that drama films are the most frequently represented category in the dataset. Similarly, the most common lead studio is Warner Bros., representing 10.5% of the observations. This suggests that this studio appears most frequently in the dataset.

3.6 Ordinal categorical variable: Year

Unlike Genre and LeadStudio, the variable Year has a natural order (time), so it is treated as an ordinal categorical variable.

This means:

  • the order of categories matters

  • we can summarise it using median and spread (IQR), not just counts

3.6.1 Frequency distribution of Year

tbl_year <- movies |>
  count(Year, name = "n") |>
  mutate(
    Percent = round((n / sum(n)) * 100, 2)
  )

tbl_year
# A tibble: 7 × 3
  Year      n Percent
  <ord> <int>   <dbl>
1 2012    169    13.0
2 2013    172    13.3
3 2014    197    15.2
4 2015    190    14.7
5 2016    196    15.1
6 2017    182    14.0
7 2018    189    14.6

This shows how many movies were released in each year.

  • Year frequency plot
ggplot(movies, aes(x = factor(Year))) +
geom_bar() +
labs(
x = "Year of Release",
y = "Frequency",
title = "Number of Movies Released per Year"
) +
theme_minimal() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 1
)
)

Formatting tables nicely

library(kableExtra)

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
kbl(
  list(
    tbl_genre,
    tbl_studio_top10
  ),
  booktabs = TRUE,
  digits = 2,
  col.names = c("Category", "Count", "Percent")
)
Category Count Percent
Drama 386 29.8
Comedy 191 14.7
Action 170 13.1
Adventure 164 12.7
Thriller 155 12.0
Horror 85 6.6
Romantic Comedy 49 3.8
Documentary 41 3.2
Black Comedy 22 1.7
Musical 18 1.4
Western 10 0.8
Concert 4 0.3
Category Count Percent
Warner Bros. 136 10.5
Universal Pictures 114 8.8
Lionsgate 106 8.2
Twentieth Century Fox 103 8.0
Paramount Pictures 76 5.9
Sony Pictures Releasing 72 5.6
Walt Disney Studios Motion Pictures 72 5.6
Focus Features 46 3.6
Sony Pictures Classics 44 3.4
Fox Searchlight Pictures 42 3.2

If you prefer separate tables with captions you can reference in the text:

# Genre table
tbl_genre |>
  kbl(
    booktabs = TRUE,
    digits = 2,
    caption = "Frequency distribution of movie genres",
    col.names = c("Genre", "Count", "Percent")
  )
Frequency distribution of movie genres
Genre Count Percent
Drama 386 29.8
Comedy 191 14.7
Action 170 13.1
Adventure 164 12.7
Thriller 155 12.0
Horror 85 6.6
Romantic Comedy 49 3.8
Documentary 41 3.2
Black Comedy 22 1.7
Musical 18 1.4
Western 10 0.8
Concert 4 0.3
# Lead studio table (top 10)
tbl_studio_top10 |>
  kbl(
    booktabs = TRUE,
    digits = 2,
    caption = "Top 10 most common lead studios",
    col.names = c("Lead Studio", "Count", "Percent")
  )
Top 10 most common lead studios
Lead Studio Count Percent
Warner Bros. 136 10.5
Universal Pictures 114 8.8
Lionsgate 106 8.2
Twentieth Century Fox 103 8.0
Paramount Pictures 76 5.9
Sony Pictures Releasing 72 5.6
Walt Disney Studios Motion Pictures 72 5.6
Focus Features 46 3.6
Sony Pictures Classics 44 3.4
Fox Searchlight Pictures 42 3.2

Then in your .qmd text you can reference them like:

Table \@ref(tab:genre-table) shows the frequency and percentage of each film genre.

Table \@ref(tab:studio-table) summarises the ten most common lead studios.

(using the corresponding chunk labels, e.g. {r genre-table} and {r studio-table} for the chunks that create each kbl).

What is the most common category?

  • Most common genre = category with the highest frequency

  • Most common lead studio = category with the highest frequency

You can identify this from the sorted tables or by using slice_max():

slice_max(tbl_genre, n, n = 1)
# A tibble: 1 × 3
  Genre     n Percent
  <fct> <int>   <dbl>
1 Drama   386    29.8
slice_max(tbl_studio, n, n = 1)
# A tibble: 1 × 3
  LeadStudio       n Percent
  <fct>        <int>   <dbl>
1 Warner Bros.   136    10.5

Footnotes

  1. Hint: Open the same .qmd file you used for Task 1 so you keep building a single report.↩︎

  2. Hint: Use read_csv("datasets/CollegeScores_teaching.csv") from the readr package.↩︎

  3. Hint: Try glimpse(college) from dplyr or str(college).↩︎

  4. Hint: Use mutate() with factor(), for example
    college <- college |> mutate(Control = factor(Control), Region = factor(Region)).↩︎

  5. Hint: college |> count(Control) |> mutate(Percent = round(n / sum(n) * 100, 2)).↩︎

  6. Hint: Use ggplot(college, aes(x = Control)) + geom_bar().↩︎

  7. Hint: theme(axis.text.x = element_text(angle = 45, hjust = 1)) and fct_infreq() from forcats.↩︎