Box Office Data Analysis

This project explored publicly available data on top-grossing US films to identify patterns across genres, studios, and release schedules. I used R and ggplot2 for data exploration and visualisation.

๐Ÿ’ป Tech Stack:

๐Ÿงช Data Pipeline:

๐Ÿ“Š Code Snippets & Visualisations:

# No movies are ever released on a Monday. (Figure 1)
ggplot(data = mov, aes(x = Day.of.Week)) + 
geom_bar()

# Filter dataset for desired genres:
filt <- (mov$Genre == "action") | 
		(mov$Genre == "adventure") | 
		(mov$Genre == "animation") | 
		(mov$Genre == "comedy") | 
		(mov$Genre == "drama")

# Filter dataset for desired studios:
filt2 <- (mov$Studio == "Buena Vista Studios") | 
		(mov$Studio == "WB") | 
		(mov$Studio == "Fox") | 
		(mov$Studio == "Universal") | 
		(mov$Studio == "Sony") | 
		(mov$Studio == "Paramount Pictures")

# Apply filters
mov2 <- mov[filt & filt2, ]

# Prepare the plot's data and aes layers (Figure 2)
p <- ggplot(data = mov2, aes(x = Genre, y = Gross...US))

q <- p +
geom_jitter(aes(size = Budget...mill., colour = Studio)) +
geom_boxplot(alpha = 0.7, outlier.colour = NA)

# Non-data info
q <- q +
xlab("Genre") + 
ylab("Gross % US") + 
ggtitle("Domestic Gross % by Genre")

# Theme
q <- q +
theme(
	text = element_text(family = "Times New Roman"),
	axis.title.x = element_text(colour = "Blue", size = 30),
	axis.title.y = element_text(colour = "Blue", size = 30),
	axis.text.x = element_text(size = 20),
	axis.text.y = element_text(size = 20),
	plot.title = element_text(colour = "Black", size = 40),
	legend.title = element_text(size = 20),
	legend.text = element_text(size = 12)
)

						

๐ŸŒŸ Key Insights:

Profitable genres are concentrated among a few studios. Monday releases are avoided โ€” possibly a scheduling strategy.

๐Ÿง—๐Ÿพ Challenge Faced:

Overlapping outliers and jitter points in ggplot2 caused clutter. I resolved this with outlier.colour = NA and alpha blending.

View on GitHub

โ† Back to Projects