This project explored publicly available data on top-grossing US films to identify patterns across genres, studios, and release schedules. I used R and ggplot2 for data exploration and visualisation.
read.csv(), summary(), and str()Day.of.Week# No movies are ever released on a Monday. (Figure 1)
ggplot(data = mov, aes(x = Day.of.Week)) +
geom_bar()
# Filter dataset for desired genres:
filt <- (mov$Genre == "action") |
(mov$Genre == "adventure") |
(mov$Genre == "animation") |
(mov$Genre == "comedy") |
(mov$Genre == "drama")
# Filter dataset for desired studios:
filt2 <- (mov$Studio == "Buena Vista Studios") |
(mov$Studio == "WB") |
(mov$Studio == "Fox") |
(mov$Studio == "Universal") |
(mov$Studio == "Sony") |
(mov$Studio == "Paramount Pictures")
# Apply filters
mov2 <- mov[filt & filt2, ]
# Prepare the plot's data and aes layers (Figure 2)
p <- ggplot(data = mov2, aes(x = Genre, y = Gross...US))
q <- p +
geom_jitter(aes(size = Budget...mill., colour = Studio)) +
geom_boxplot(alpha = 0.7, outlier.colour = NA)
# Non-data info
q <- q +
xlab("Genre") +
ylab("Gross % US") +
ggtitle("Domestic Gross % by Genre")
# Theme
q <- q +
theme(
text = element_text(family = "Times New Roman"),
axis.title.x = element_text(colour = "Blue", size = 30),
axis.title.y = element_text(colour = "Blue", size = 30),
axis.text.x = element_text(size = 20),
axis.text.y = element_text(size = 20),
plot.title = element_text(colour = "Black", size = 40),
legend.title = element_text(size = 20),
legend.text = element_text(size = 12)
)
Profitable genres are concentrated among a few studios. Monday releases are avoided โ possibly a scheduling strategy.
Overlapping outliers and jitter points in ggplot2 caused clutter. I resolved this with outlier.colour = NA and alpha blending.