Machine Learning the Steam Video Games Database
How do video game genres evolve on Steam? This thesis combines bipartite network analysis, community detection, and regression to understand genre influence and predict player engagement.
27,075
Games analysed
29
Genre nodes
277
Genre-genre edges
8
Communities detected
Overview
The Steam platform hosts over 27,000 games spanning dozens of genres. This thesis explores how those genres co-evolve, which are most influential, and what game properties best explain player engagement, using a mix of network science and machine learning.
The core contribution is a bipartite genre network connecting games to their genres, projected into a weighted, unipartite genre-to-genre graph. That graph is then analysed with centrality measures, Louvain community detection, and temporal slicing across five eras (1995–2020). A regression pipeline on game properties attempts to predict average playtime.
Dataset
Source
Key Features
Pre-processing
Network Size
Methodology
The pipeline had four sequential stages:
- EDA & Pre-processing: correlation matrix, distribution checks, genre explosion into separate rows.
- Bipartite Network: games and genres as two node types; edges represent game-genre membership. Projected to a unipartite genre network via matrix multiplication in Gephi's Multimode plugin.
- Network Analysis: centrality measures (degree, betweenness, closeness, eigenvector), temporal genre evolution across 5-year windows, Louvain community detection, node attributes (net ratings, median playtime).
- Regression: Ridge, Decision Tree, and Random Forest models trained to predict Average_Playtime. Features include dummy-encoded platforms, genres, categories, owners, price, achievements, and a computed ratings_ratio. 10-fold cross-validation for model selection.
Centrality Results: Spotlight Genres
Indie and Casual genres are connected to every other genre in the network (degree centrality = 1.0), meaning no genre on Steam exists in isolation from them.
Community Detection
Louvain modularity clustering at resolution 0.5 identified 8 communities:
Cluster 1
Game Development, Utilities, Education, Audio/Photo/Video Production, Animation, Web Publishing, Accounting, Software Training, Design & Illustration
Cluster 2
Nudity, Violent, Gore, Sexual Content
Cluster 3
Sports, Racing
Cluster 4
Action, Indie
Cluster 5
Strategy, Simulation
Cluster 6
Massively Multiplayer, Free to Play
Cluster 7
Early Access, RPG
Cluster 8
Adventure, Casual
Genre Evolution (1995–2020)
- "Action" and "Indie" are the dominant parent genres; most emerging sub-genres trace back to one of them.
- Violent, Casual, Adventure, RPG, and Strategy all emerged from Action/Indie in the 2000s.
- "Free to Play" was first incorporated by Indie games, spawning the Massively Multiplayer cluster.
- Non-gaming genres (Education, Utilities, etc.) emerged independently but were later absorbed into mainstream gaming genres.
- "Early Access" appeared from 2010 onward as developers sought community feedback loops.
- Racing gave birth to the Sports sub-genre, a separate evolutionary branch from the Action/Indie lineage.
What Players Engage With
- Action games received the highest net ratings on the platform; Indie games came second.
- Indie games had the highest median playtime, likely due to their deep storylines and experimental gameplay that reward extended investment.
- The combination of high ratings and high playtime makes Indie + Action the twin pillars of Steam engagement.
Regression Results
Predicting Average_Playtime from game properties proved difficult. Most variability in engagement is explained by factors outside the available feature set. Ridge Regression was the best model by cross-validated RMSE.
10-fold cross-validation confirmed Ridge as the strongest generaliser:
The top feature for predicting average playtime was ratings_ratio (importance 0.20), followed by release year, achievements count, and price.
Honestly, the R² of ~2% surprised me at first. I expected static game properties to explain a decent chunk of playtime. But once I thought about it, it made sense: what actually drives engagement (streamer hype, patch updates, social trends, seasonal sales) is dynamic and community-driven. The dataset only captured snapshots of fixed properties. This was the biggest lesson from the project: sometimes the most valuable finding is learning that your features don't capture what matters, because it points you toward what does.
Key Findings
- Indie and Casual are the most structurally central genres; every other genre co-occurs with them.
- "Massively Multiplayer" and "Free to Play" have become synonymous, sharing player interaction models and monetisation strategies.
- Ratings are the strongest predictor of playtime; lower-priced or free-to-play games show higher engagement on average.
- Incorporating achievements actively boosts playtime by improving in-game motivation.
- Regression R² values stayed below 2%, suggesting community-driven factors (streamers, social trends, patches) dominate over static game properties.
- Class imbalance (few blockbusters vs. many niche titles) is the primary barrier to regression accuracy.
What I'd do differently
The network analysis side of this project holds up well. If I redid it, I'd keep the bipartite projection and Louvain clustering largely as-is. The part I'd rethink completely is the regression.
I went in assuming that genre, price, and achievements would explain engagement. They don't, at least not from a static snapshot. If I did this again, I'd scrape time-series data: player counts over weeks, patch dates, discount events, Twitch viewer numbers. Those dynamic signals are what actually move playtime. The static feature approach was the wrong tool for the question I was asking.
I'd also try graph neural networks on the genre network instead of just computing centrality scores. Centrality gives you rankings, but GNNs could learn richer interaction patterns between genres and possibly feed those into the engagement prediction as embeddings.
Limitations & Future Work
- The dataset cuts off at 2019, so the post-pandemic gaming boom and live-service genre growth are not captured.
- Adding game tag data alongside genre data could reveal finer sub-genre dynamics.
- Temporal regression (time-series of playtime per game) could outperform static predictors.
- Graph neural networks on the genre graph could learn richer genre interaction embeddings than centrality scores alone.