Back to projects
Network Analysis · MLCase Study

Machine Learning the Steam Video Games Database

How do video game genres evolve on Steam? This thesis combines bipartite network analysis, community detection, and regression to understand genre influence and predict player engagement.

PythonNetworkXGephiRidge RegressionRandom ForestLouvain Algorithm

27,075

Games analysed

29

Genre nodes

277

Genre-genre edges

8

Communities detected


Overview

The Steam platform hosts over 27,000 games spanning dozens of genres. This thesis explores how those genres co-evolve, which are most influential, and what game properties best explain player engagement, using a mix of network science and machine learning.

The core contribution is a bipartite genre network connecting games to their genres, projected into a weighted, unipartite genre-to-genre graph. That graph is then analysed with centrality measures, Louvain community detection, and temporal slicing across five eras (1995–2020). A regression pipeline on game properties attempts to predict average playtime.

Dataset

Source

Steam Store Games dataset by Nik Davis on Kaggle (CC BY 4.0). Temporal coverage through February 2019. 27,075 unique game entries with 18 columns.

Key Features

Game ID, name, release date, English availability, developer/publisher, platforms, genres, categories, tags, achievements, positive/negative ratings, median and average playtime, owner range, and price.

Pre-processing

No duplicate or null values. Multi-genre entries (semicolon-separated) were exploded into one row per game-genre pair. Correlation analysis removed leaky variables (median playtime) before regression.

Network Size

29 genre nodes and 277 edges in the projected genre network. Graph density of 0.682, meaning most genres co-occur with most others.

Methodology

The pipeline had four sequential stages:

  • EDA & Pre-processing: correlation matrix, distribution checks, genre explosion into separate rows.
  • Bipartite Network: games and genres as two node types; edges represent game-genre membership. Projected to a unipartite genre network via matrix multiplication in Gephi's Multimode plugin.
  • Network Analysis: centrality measures (degree, betweenness, closeness, eigenvector), temporal genre evolution across 5-year windows, Louvain community detection, node attributes (net ratings, median playtime).
  • Regression: Ridge, Decision Tree, and Random Forest models trained to predict Average_Playtime. Features include dummy-encoded platforms, genres, categories, owners, price, achievements, and a computed ratings_ratio. 10-fold cross-validation for model selection.

Centrality Results: Spotlight Genres

Indie and Casual genres are connected to every other genre in the network (degree centrality = 1.0), meaning no genre on Steam exists in isolation from them.

Indie
Degree 1.0Connected to every genre
Casual
Degree 1.0Connected to every genre
Free to Play
Degree 0.93Closeness 0.93
Strategy
Degree 0.93Closeness 0.93
RPG
Degree 0.93Closeness 0.93

Community Detection

Louvain modularity clustering at resolution 0.5 identified 8 communities:

Cluster 1

Game Development, Utilities, Education, Audio/Photo/Video Production, Animation, Web Publishing, Accounting, Software Training, Design & Illustration

Cluster 2

Nudity, Violent, Gore, Sexual Content

Cluster 3

Sports, Racing

Cluster 4

Action, Indie

Cluster 5

Strategy, Simulation

Cluster 6

Massively Multiplayer, Free to Play

Cluster 7

Early Access, RPG

Cluster 8

Adventure, Casual

Genre Evolution (1995–2020)

  • "Action" and "Indie" are the dominant parent genres; most emerging sub-genres trace back to one of them.
  • Violent, Casual, Adventure, RPG, and Strategy all emerged from Action/Indie in the 2000s.
  • "Free to Play" was first incorporated by Indie games, spawning the Massively Multiplayer cluster.
  • Non-gaming genres (Education, Utilities, etc.) emerged independently but were later absorbed into mainstream gaming genres.
  • "Early Access" appeared from 2010 onward as developers sought community feedback loops.
  • Racing gave birth to the Sports sub-genre, a separate evolutionary branch from the Action/Indie lineage.

What Players Engage With

  • Action games received the highest net ratings on the platform; Indie games came second.
  • Indie games had the highest median playtime, likely due to their deep storylines and experimental gameplay that reward extended investment.
  • The combination of high ratings and high playtime makes Indie + Action the twin pillars of Steam engagement.

Regression Results

Predicting Average_Playtime from game properties proved difficult. Most variability in engagement is explained by factors outside the available feature set. Ridge Regression was the best model by cross-validated RMSE.

Ridge Regression
R² 0.019RMSE 3198 · MAE 275
Random Forest
R² 0.003RMSE 3224 · MAE 259
Decision Tree
R² -0.35RMSE 3753 · MAE 303

10-fold cross-validation confirmed Ridge as the strongest generaliser:

Ridge Regression (10-fold CV)
1405mean RMSE
Random Forest (10-fold CV)
1610mean RMSE
Decision Tree (10-fold CV)
2160mean RMSE

The top feature for predicting average playtime was ratings_ratio (importance 0.20), followed by release year, achievements count, and price.

Honestly, the R² of ~2% surprised me at first. I expected static game properties to explain a decent chunk of playtime. But once I thought about it, it made sense: what actually drives engagement (streamer hype, patch updates, social trends, seasonal sales) is dynamic and community-driven. The dataset only captured snapshots of fixed properties. This was the biggest lesson from the project: sometimes the most valuable finding is learning that your features don't capture what matters, because it points you toward what does.

Key Findings

  • Indie and Casual are the most structurally central genres; every other genre co-occurs with them.
  • "Massively Multiplayer" and "Free to Play" have become synonymous, sharing player interaction models and monetisation strategies.
  • Ratings are the strongest predictor of playtime; lower-priced or free-to-play games show higher engagement on average.
  • Incorporating achievements actively boosts playtime by improving in-game motivation.
  • Regression R² values stayed below 2%, suggesting community-driven factors (streamers, social trends, patches) dominate over static game properties.
  • Class imbalance (few blockbusters vs. many niche titles) is the primary barrier to regression accuracy.

What I'd do differently

The network analysis side of this project holds up well. If I redid it, I'd keep the bipartite projection and Louvain clustering largely as-is. The part I'd rethink completely is the regression.

I went in assuming that genre, price, and achievements would explain engagement. They don't, at least not from a static snapshot. If I did this again, I'd scrape time-series data: player counts over weeks, patch dates, discount events, Twitch viewer numbers. Those dynamic signals are what actually move playtime. The static feature approach was the wrong tool for the question I was asking.

I'd also try graph neural networks on the genre network instead of just computing centrality scores. Centrality gives you rankings, but GNNs could learn richer interaction patterns between genres and possibly feed those into the engagement prediction as embeddings.

Limitations & Future Work

  • The dataset cuts off at 2019, so the post-pandemic gaming boom and live-service genre growth are not captured.
  • Adding game tag data alongside genre data could reveal finer sub-genre dynamics.
  • Temporal regression (time-series of playtime per game) could outperform static predictors.
  • Graph neural networks on the genre graph could learn richer genre interaction embeddings than centrality scores alone.