What 27,000 Steam Games Reveal About Genre Evolution
Steam has over 27,000 games, each tagged with one or more genres. I wanted to know which genres tend to appear together, and whether that structure changes over time. So I built a bipartite network connecting games to their genres, projected it into a weighted genre-to-genre graph, and ran centrality and community detection on it. I also tried predicting playtime with regression. The network stuff worked. The regression was a near-total failure, and honestly that ended up being the more useful result.
Why model genres as a network?
Most analyses of the Steam catalog treat genres as flat labels. A game is "Action" or "RPG" or both, you count them up, make a bar chart. Fine for a quick overview, but it throws away the relationships between genres. Which ones tend to appear together? Which ones bridge otherwise separate clusters?
A bipartite graph (two node types: games and genres, with edges between them) lets you project into a genre-only graph where edge weights equal the number of shared games. From there, centrality tells you which genres are most connected, community detection finds natural clusters, and temporal slicing shows how the whole structure shifted over 25 years.
The dataset
The data comes from Nik Davis's Steam Store Games dataset on Kaggle (CC BY 4.0), covering games through February 2019. 27,075 unique entries with 18 columns: app ID, name, release date, developer, publisher, platforms, genres, categories, tags, achievements, positive/negative ratings, average and median playtime, owner ranges, and price.
The part that matters for network analysis: genres are semicolon-separated strings. A single game might be tagged Action;RPG;Indie. That multi-label structure is what makes the bipartite approach work. One game with three genres creates three edges.
No nulls, no duplicates. Surprisingly clean for a Kaggle dataset. The only preprocessing was exploding the multi-genre column so each game-genre pair got its own row.
import pandas as pd
steam = pd.read_csv('steam.csv')
steam['release_date'] = pd.to_datetime(steam['release_date'])
# Explode semicolon-separated genres into one row per game-genre pair
genres = steam.assign(genres=steam['genres'].str.split(';')).explode('genres')
genres.to_csv('genres.csv', index=False)Building the bipartite network
Each game becomes a node, each genre becomes a node, and edges connect games to their genres. In NetworkX you just tell it which nodes are "Games" and which are "Genres."
import networkx as nx
B = nx.Graph()
B.add_nodes_from(genres['name'].unique(), bipartite='Games')
B.add_nodes_from(genres['genres'].unique(), bipartite='Genres')
edges = [(row['genres'], row['name']) for _, row in genres.iterrows()]
B.add_edges_from(edges)
nx.write_gexf(B, "steam_bipartite.gexf")This gives you ~27,000 game nodes and 29 genre nodes. The game nodes have wildly different degrees (one genre = degree 1, five genres = degree 5), while genre nodes have massive degrees because thousands of games connect to each one.
You can visualize the bipartite graph in Gephi, but it's hard to analyze directly because the two node types sit at such different scales. The projection is where it gets useful.
Projecting to a genre-only graph
The projection collapses the bipartite graph down to just genres. Two genres get an edge if at least one game belongs to both, and the weight is how many games they share. Mathematically it's the bipartite adjacency matrix multiplied by its transpose.
In practice, I prepared the projected edgelist as a CSV with Source, Target, and Weight columns, then loaded it into NetworkX:
df_genres = pd.read_csv('genres-fixed.csv')
G = nx.from_pandas_edgelist(
df_genres, source='Source', target='Target', edge_attr='Weight'
)
print(f"Nodes: {G.number_of_nodes()}") # 29
print(f"Edges: {G.number_of_edges()}") # 277
print(f"Density: {nx.density(G):.3f}") # 0.68229 nodes, 277 edges, density 0.682. Two-thirds of all possible genre pairs co-occur in at least one game. Dense, but not surprising when you consider how liberally Steam games get tagged, and that some genres (Indie, Action) appear on almost everything.
Centrality: which genres hold the network together
I computed four centrality measures: degree (how connected), betweenness (how much shortest-path traffic flows through a node), closeness (average distance to everyone else), and eigenvector (connected to other well-connected nodes).
degree = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
closeness = nx.closeness_centrality(G)
eigenvector = nx.eigenvector_centrality(G, max_iter=1000)
# Sort by degree
for genre, score in sorted(degree.items(), key=lambda x: -x[1])[:5]:
print(f"{genre}: {score:.3f}")Indie and Casual both came back with degree centrality of 1.0. They're connected to every other genre in the network. There is no genre on Steam that doesn't share at least one game with Indie or Casual. Free to Play, Strategy, and RPG are close behind at 0.93.
| Genre | Degree | Betweenness | Closeness | Eigenvector |
|---|---|---|---|---|
| Indie | 1.000 | 0.021 | 1.000 | 0.218 |
| Casual | 1.000 | 0.021 | 1.000 | 0.207 |
| Free to Play | 0.929 | 0.009 | 0.933 | 0.199 |
| Strategy | 0.929 | 0.009 | 0.933 | 0.204 |
| RPG | 0.929 | 0.009 | 0.933 | 0.203 |
| Action | 0.893 | 0.005 | 0.903 | 0.210 |
| Simulation | 0.893 | 0.005 | 0.903 | 0.199 |
Top genres by centrality measures across the projected network
Betweenness is low everywhere, which is a side effect of the density. When most nodes already connect directly to each other, shortest paths don't need to route through intermediaries. In a sparser network, betweenness would be more informative.
Community detection with Louvain
The Louvain algorithm optimizes modularity to find groups of nodes that are more densely connected to each other than to the rest of the network. Running it in Gephi at resolution 0.5 with edge weights on produced 8 communities with a modularity score of 0.273.
Moderate modularity, expected given the density. But the clusters themselves are sensible:
Creative Tools
Game Dev, Utilities, Education, Audio/Photo/Video Production, Animation, Design & Illustration
Mature Content
Nudity, Violent, Gore, Sexual Content
Competitive Physical
Sports, Racing
Core Gaming
Action, Indie
Systems & Sandbox
Strategy, Simulation
Live Service
Massively Multiplayer, Free to Play
Emerging & Deep
Early Access, RPG
Accessible
Adventure, Casual
The clustering coefficient came out to about 0.82. If genre A co-occurs with B and C, then B and C almost certainly co-occur too. The genre space on Steam is tightly interconnected, a "small world" in network terms.
"Massively Multiplayer" and "Free to Play" landing in the same community was expected but still worth noting. At this point they're basically the same genre wearing different labels. Same player interaction model, same monetization, and the network structure confirms it.
How the network evolved over 25 years
To track genre evolution, I sliced the dataset into four temporal windows and built a separate bipartite network for each. The slices: 1995-2005, 2005-2010, 2010-2015, and 2015-2020.
# Temporal slicing example
df_95_05 = steam[
(steam['release_date'] >= '1995-1-1') &
(steam['release_date'] < '2005-1-1')
]
# Build bipartite graph for this era
B = nx.Graph()
B.add_nodes_from(df_95_05['name'].unique(), bipartite='Games')
B.add_nodes_from(df_95_05['genres'].unique(), bipartite='Genres')
edges = [(row['genres'], row['name']) for _, row in df_95_05.iterrows()]
B.add_edges_from(edges)
nx.write_gexf(B, "95_05_steam.gexf")In the earliest window, Action and Indie are the only real hubs. Through the 2000s, Strategy, RPG, Casual, and Adventure grow out of those two. Free to Play shows up in the 2010s, initially attached to Indie games, and quickly fuses with Massively Multiplayer into its own cluster. The overall pattern is branching followed by re-convergence.
- Racing was an independent branch. Unlike most genres that trace back to Action/Indie, Racing evolved separately and gave rise to Sports as a sub-genre.
- Non-gaming genres got absorbed. Education and Utilities started as independent categories but were pulled into mainstream gaming clusters over time.
- Early Access appeared post-2010 as developers started using community feedback loops as a development model. It clustered with RPG, likely because RPGs are the genre most associated with long development cycles and community involvement.
What genres drive engagement?
To tie the network analysis back to player behavior, I attached two node attributes to the genre graph: net ratings (positive minus negative) and median playtime, both aggregated per genre.
steam['net_ratings'] = steam['positive_ratings'] - steam['negative_ratings']
ratings_by_genre = steam.groupby('genres')['net_ratings'].sum()
playtime_by_genre = steam.groupby('genres')['median_playtime'].sum()
nx.set_node_attributes(G, ratings_by_genre.to_dict(), 'Net Ratings')
nx.set_node_attributes(G, playtime_by_genre.to_dict(), 'Median Play Time')
nx.write_gexf(G, "genre_attributes.gexf")Action games had the highest net ratings. Indie came second. But for median playtime, the ranking flipped: Indie led, probably because Indie games tend toward experimental gameplay that keeps people playing longer. Action gets the most positive attention; Indie gets the most time.
I didn't expect this, but lower-priced and free-to-play games showed higher engagement on average. Zero barrier to entry probably helps, and cheap Indie games tend to attract audiences who actually care about the game rather than impulse-buying during a sale.
The regression: when your model teaches you the question was wrong
I also tried predicting average playtime from static game properties. One-hot encoded platforms, genres, categories, and owner tiers, plus numeric stuff: price, achievement count, release year, and a computed ratings_ratio.
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
X = steam.drop(columns=['average_playtime', 'appid', 'name',
'developer', 'publisher', 'median_playtime'])
y = steam['average_playtime']
# One-hot encode categorical columns
X = pd.get_dummies(X, columns=['platforms', 'genres', 'categories', 'owners'])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)Three models: Ridge Regression (alpha=100), Random Forest, and Decision Tree. I dropped median_playtime because it leaks information about the target, and identifiers because they carry no predictive signal.
| Model | R² | RMSE | 10-fold CV RMSE |
|---|---|---|---|
| Ridge Regression | 0.019 | 3,198 | 1,405 |
| Random Forest | 0.003 | 3,224 | 1,610 |
| Decision Tree | -0.35 | 3,753 | 2,160 |
Ridge wins, but all models explain almost none of the variance
An R² of 0.019 means these features explain less than 2% of the variance in average playtime. The Decision Tree actually performed worse than predicting the mean for every game (negative R²). Ridge Regression was the "best" model, but only because regularization prevented it from fitting noise as aggressively as the tree models.
The top feature by importance was ratings_ratio (0.20), followed by release year, achievement count, and price. But even the most important feature barely moved the needle.
At first this felt like a failure. I expected genre, price, and achievements to explain at least something about why people keep playing. But the more I thought about it, the more it made sense. What drives playtime is streamer hype, patch cadence, seasonal sales, multiplayer network effects. None of that lives in a static CSV. The dataset is a snapshot of fixed properties. Engagement is a moving target.
There's also a class imbalance problem. A handful of blockbusters have average playtimes in the thousands of minutes. Most games sit near zero. The distribution is so skewed that the models just learn to predict a low number and give up. The outliers blow up RMSE, but the features don't distinguish a future hit from the ten thousand games with similar metadata.
Worth trying next
Time-series playtime data instead of a single snapshot. If I could get weekly or monthly player counts per game, along with patch dates, discount events, and Twitch viewer numbers, those dynamic signals are much more likely to predict engagement than a fixed metadata table. The static feature approach was the wrong tool for the question I was asking.
Graph neural networks on the genre network. Computing centrality gives you rankings, but GNNs could learn richer interaction patterns between genres and possibly produce embeddings that feed into the engagement prediction. The 29-node genre graph is small enough that even a simple GCN could work without serious compute.
Incorporate game tags alongside genres. Steam tags are user-generated and much more granular than the 29 official genres. Tags like "roguelike", "souls-like", or "walking simulator" capture sub-genre dynamics that the coarse genre labels miss entirely. The bipartite network with tags instead of genres would have a much richer projected graph.
Updated data beyond 2019. The dataset cuts off before the pandemic-era gaming boom and the rise of live-service games. The genre landscape looks different now, and a fresh dataset would show whether the community structure shifted too.
Wrapping up
The network side of this project still looks right to me. The bipartite projection, centrality measures, and Louvain clusters all produce results that match how people actually think about game genres. Indie and Casual really are everywhere. The temporal slicing tells a clear story about 25 years of genre branching.
The regression taught me something different. Static metadata tells you what a game is. It doesn't tell you why someone plays it for hundreds of hours. That gap between what the data contains and what the question requires is worth knowing about early, before you spend weeks tuning hyperparameters on features that were never going to work.
If you're working on game analytics or network-based approaches to recommendation systems, feel free to reach out.