Box Office Data Analysis

This project explored publicly available data on top-grossing US films to identify patterns across genres, studios, and release schedules. I used R and ggplot2 for data exploration and visualisation.

💻 Tech Stack:

R for data manipulation and visualisation
ggplot2 for creating custom, layered visual insights

🧪 Data Pipeline:

Import & inspect data: Used read.csv(), summary(), and str()
Initial exploration: Identified no Monday releases using a bar plot of Day.of.Week
Filtering for significance: Narrowed to key genres and major studios
Visualisation: Created jitter + box plots comparing domestic gross
Aesthetics: Tuned themes for clarity and presentation

📊 Code Snippets & Visualisations:

# No movies are ever released on a Monday. (Figure 1)
ggplot(data = mov, aes(x = Day.of.Week)) + 
geom_bar()

# Filter dataset for desired genres:
filt <- (mov$Genre == "action") | 
		(mov$Genre == "adventure") | 
		(mov$Genre == "animation") | 
		(mov$Genre == "comedy") | 
		(mov$Genre == "drama")

# Filter dataset for desired studios:
filt2 <- (mov$Studio == "Buena Vista Studios") | 
		(mov$Studio == "WB") | 
		(mov$Studio == "Fox") | 
		(mov$Studio == "Universal") | 
		(mov$Studio == "Sony") | 
		(mov$Studio == "Paramount Pictures")

# Apply filters
mov2 <- mov[filt & filt2, ]

# Prepare the plot's data and aes layers (Figure 2)
p <- ggplot(data = mov2, aes(x = Genre, y = Gross...US))

q <- p +
geom_jitter(aes(size = Budget...mill., colour = Studio)) +
geom_boxplot(alpha = 0.7, outlier.colour = NA)

# Non-data info
q <- q +
xlab("Genre") + 
ylab("Gross % US") + 
ggtitle("Domestic Gross % by Genre")

# Theme
q <- q +
theme(
	text = element_text(family = "Times New Roman"),
	axis.title.x = element_text(colour = "Blue", size = 30),
	axis.title.y = element_text(colour = "Blue", size = 30),
	axis.text.x = element_text(size = 20),
	axis.text.y = element_text(size = 20),
	plot.title = element_text(colour = "Black", size = 40),
	legend.title = element_text(size = 20),
	legend.text = element_text(size = 12)
)

Figure 1: No releases on Mondays — **Figure 1** No releases on Mondays

Figure 2: Domestic gross by genre — **Figure 2** Domestic gross by genre

🌟 Key Insights:

Profitable genres are concentrated among a few studios. Monday releases are avoided — possibly a scheduling strategy.

🧗🏾 Challenge Faced:

Overlapping outliers and jitter points in ggplot2 caused clutter. I resolved this with outlier.colour = NA and alpha blending.

View on GitHub

← Back to Projects