Back to blog
April 2026·10 min read

Learning ML and Deep Learning by Building Everything Twice

I learned machine learning by implementing things twice: once from scratch in NumPy, then again in PyTorch. Linear regression, K-Means, a full neural network in raw NumPy that gets 94% on MNIST, then the same ideas rebuilt with autograd and nn.Module. Then CNNs, RNNs, transformers. This post walks through what I built and what actually stuck.

PythonNumPyPyTorchscikit-learnTensorBoardDeep Learning
All implementations are in two repositories: ML-DL-implementations (algorithms + miniprojects) and PyTorch-course (structured fundamentals).

The approach: scratch first, framework second

When I first started learning ML, I jumped straight to model.fit() and figured I understood what was going on. Then someone asked me what the gradient of MSE loss looks like and I had nothing. I couldn't explain what was happening inside the model I'd just trained.

So I went back and implemented everything in NumPy first, where you have to write every matrix multiply and gradient update yourself. Then I'd rebuild it in PyTorch to learn the framework. The NumPy version forces understanding. The PyTorch version teaches you how to actually ship things.

RegressionsklearnClusteringK-Means · DBSCANNumPy MLPFrom scratchPyTorchFundamentalsCNNsCIFAR · CelebARNNsIMDB · Char LMMiniprojectsMNIST · Mileage · SmileTransformersAttention
The path: classical ML, then NumPy from-scratch, then PyTorch, branching into vision, sequences, and attention

Classical ML: getting the foundations right

Regression and clustering first, because they're small enough to fully understand and they force you through the basics: feature scaling, distance metrics, loss functions.

Regression on real housing data

Linear and polynomial regression on the Ames Housing dataset. On synthetic data, going from linear to quadratic dropped MSE from 570 to 61 and pushed R² from 0.83 to 0.98. On the full housing dataset with dozens of features, the gains were less clean. Picking which features to include and how to scale them mattered more than whether I used degree 2 or degree 3.

python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Quadratic features capture nonlinear relationships
poly = PolynomialFeatures(degree=2)
X_quad = poly.fit_transform(X)

model = LinearRegression()
model.fit(X_quad, y)

y_pred = model.predict(X_quad)
print(f"MSE: {mean_squared_error(y, y_pred):.2f}")    # 61.33
print(f"R²:  {r2_score(y, y_pred):.3f}")              # 0.982

Three flavors of clustering

Running K-Means, hierarchical clustering, and DBSCAN on the same data made the tradeoffs obvious. K-Means needs K upfront and assumes roughly spherical clusters. Hierarchical gives you a dendrogram so you can pick K after looking at the data. DBSCAN skips K entirely and handles weird shapes, but epsilon and min-samples need tuning.

The make_moons dataset is the best demonstration of why DBSCAN exists. K-Means splits the two crescents vertically. DBSCAN traces their actual shape.

The NumPy neural network: 94% on MNIST with no framework

This was the exercise that taught me the most. A multi-layer perceptron in raw NumPy means writing everything yourself: forward pass, activations, loss computation, backprop, weight updates. There's no framework hiding the math from you.

python
class NeuralNetMLP:
    def __init__(self, num_features, num_hidden, num_classes):
        self.num_features = num_features
        self.num_hidden = num_hidden
        self.num_classes = num_classes

        # Xavier initialization
        rng = np.random.RandomState(123)
        self.weight_h = rng.normal(
            loc=0.0, scale=0.1, size=(num_hidden, num_features)
        )
        self.bias_h = np.zeros(num_hidden)
        self.weight_out = rng.normal(
            loc=0.0, scale=0.1, size=(num_classes, num_hidden)
        )
        self.bias_out = np.zeros(num_classes)
Input (784)Hidden (100)Output (10)ReLUSoftmax
The MNIST MLP: 784 inputs, one hidden layer, 10 class outputs

Forward pass: matrix multiply, add bias, sigmoid. Backward pass: chain rule, step by step, from output error back to input gradients. It's tedious to write out. But once you've manually coded the backward pass for a two-layer network, loss.backward() stops feeling like a black box.

On MNIST with a single hidden layer of 100 units: ~94.5% test accuracy. Training reached about 95.6%, so not much overfitting. A PyTorch version with the same architecture gives the same numbers. The math is identical; the only difference is who typed the matrix multiplications.

PyTorch: replacing hand-coded pieces one at a time

Instead of jumping from NumPy to full PyTorch, I wrote the same linear regression four times. Each version replaced one hand-coded piece with a PyTorch equivalent. Doing this made each abstraction layer click in a way that reading docs never did.

1

NumPy only

Manual gradients, manual weight update

2

+ Autograd

PyTorch computes gradients, manual update

3

+ Loss & Optim

nn.MSELoss + optim.SGD handle the math

4

+ nn.Module

Full PyTorch model with nn.Linear

Four versions of the same linear regression, each replacing one hand-coded piece with PyTorch

Step 1: Pure NumPy

Manual everything. Compute the prediction, the loss, and the gradient by hand. Update weights with a learning rate.

python
# Pure NumPy: manual gradients, manual update
X = np.array([1, 2, 3, 4], dtype=np.float32)
Y = np.array([2, 4, 6, 8], dtype=np.float32)
w = 0.0

def forward(x): return w * x
def loss(y, y_pred): return ((y_pred - y) ** 2).mean()
def gradient(x, y, y_pred): return np.dot(2 * x, (y_pred - y)).mean()

for epoch in range(20):
    y_pred = forward(X)
    l = loss(Y, y_pred)
    dw = gradient(X, Y, y_pred)
    w -= 0.01 * dw

Step 4: Full nn.Module

Same regression, same data. PyTorch handles gradients, loss, optimization, and the model. Half the code, same math, and it handles edge cases you'd miss writing it by hand.

python
import torch
import torch.nn as nn

X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)
Y = torch.tensor([[2], [4], [6], [8]], dtype=torch.float32)

class LinearRegression(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.lin = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        return self.lin(x)

model = LinearRegression(1, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(100):
    y_pred = model(X)
    l = criterion(y_pred, Y)
    l.backward()
    optimizer.step()
    optimizer.zero_grad()

What clicked for me: l.backward() is doing the same thing as the manual gradient function from Step 1, just generalized to arbitrary computation graphs. optimizer.step() is w -= lr * dw applied to every parameter. No magic.

XOR: the case for depth

XOR: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. Four data points, and no single line can separate the classes. A single-layer network tops out at 50% because the boundary isn't linear.

Add one hidden layer with ReLU and it jumps to ~95% on training, ~90% on validation. I like this example more than MNIST for explaining why depth matters. You can plot the decision boundary and literally see where the single-layer version fails. Four data points, and the difference between 50% and 95% is one hidden layer.

Convolutions: from scratch, then applied

Before touching nn.Conv2d, I wrote 1D and 2D convolution in NumPy. A convolution is a sliding window doing element-wise multiplication and summing. Written as nested for-loops it looks trivial, but that's the point. You see exactly what the operation does before the framework hides it.

python
def conv2d(image, kernel):
    """NumPy 2D convolution — the operation behind nn.Conv2d"""
    ki, kj = kernel.shape
    out_h = image.shape[0] - ki + 1
    out_w = image.shape[1] - kj + 1
    output = np.zeros((out_h, out_w))

    for i in range(out_h):
        for j in range(out_w):
            output[i, j] = np.sum(
                image[i:i+ki, j:j+kj] * kernel
            )
    return output

Then I applied it to actual datasets: CIFAR-10 classification with a LeNet-style CNN, and CelebA smile detection as a binary task.

CIFAR-10 CNN

A classic architecture: two conv layers with ReLU and max pooling, then three fully-connected layers. The conv layers go from 3 channels (RGB) to 6 to 16, with 5×5 kernels. After pooling, the feature map is 16×5×5 = 400 values, which feed into a 120→84→10 classifier.

python
class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

CelebA smile classification

CelebA was a step up: over 200,000 face images, binary target (smiling or not). Best epoch hit ~89.6% validation accuracy. This was the first project where I ran into problems that don't exist in toy datasets. Data augmentation actually mattered. Class balance shifted the loss landscape. And the model started overfitting on faces fast if I wasn't careful with regularization.

Sequences: RNNs and character-level language models

I started sequences by writing a manual RNN forward pass and comparing it to nn.RNN's output. Same numbers. An RNN is a linear layer applied over and over, passing its hidden state forward each time. Seeing that loop spelled out made the architecture less intimidating.

IMDB sentiment analysis

IMDB sentiment classification with torchtext: tokenize reviews, build vocabulary, embed, feed into an RNN. First epochs are rough (56% accuracy, barely above coin flip), but it learns.

Honestly the bigger lesson here was about infrastructure, not models. torchtext changed its API between versions, and things that worked in the tutorial I was following just broke. I spent more time debugging the data pipeline than the RNN itself.

Character-level language model

An RNN/LSTM trained to predict the next character given previous ones. Feed it a text corpus, let it train. By epoch 9,500 the loss was down to 1.04 and the generated text had recognizable words and sentence fragments. It's the same principle as GPT, just at character scale and with a much smaller model.

Transfer learning with ResNet

Training from scratch works when you have enough data (CIFAR, CelebA). For small datasets it doesn't. I fine-tuned a pretrained ResNet-18 on a small ants-vs-bees dataset two ways:

  • Full fine-tuning: unfreeze all layers, train everything with a small learning rate and a step scheduler.
  • Feature extraction: freeze the backbone, only train the final classification head. Faster, less risk of overfitting, works well when the pretrained domain is close to yours.
python
model = models.resnet18(pretrained=True)

# Replace the final fully-connected layer for 2 classes
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 2)
model.to(device)

# For feature extraction: freeze everything except fc
for param in model.parameters():
    param.requires_grad = False
for param in model.fc.parameters():
    param.requires_grad = True

Transformers: building the attention mechanism

Last stretch: token embeddings, positional encoding, and the attention mechanism. I didn't build a full trainable transformer, but I got far enough to understand what F.softmax(scores, dim=-1) actually does in QKV attention, and why you need positional encoding once you drop the sequential assumption of RNNs.

The way I think about it now: attention is a soft lookup table. The query says "what am I looking for", the keys say "here's what I have", and the values are what gets returned. The softmax over dot products is just a differentiable way to pick which values matter most. Once I saw it that way, the architecture felt less alien.

Miniprojects: putting it together

Alongside the fundamentals, I did a few end-to-end projects to see if I could actually wire everything together into working pipelines:

MNIST Classification

PyTorch MLP + Lightning, 96.5% test accuracy

PyTorch · torchmetrics

Mileage Prediction

Tabular regression, test MSE ~9.59

PyTorch · Pandas

Smile Detection

CNN on CelebA, ~89.6% val accuracy

PyTorch · torchvision

Character LM

RNN text generation, loss ~1.04

PyTorch · LSTM

Tooling that helped

Two things I'm glad I set up early:

  • TensorBoard for watching training runs. Loss curves, accuracy over epochs, image grids of predictions, even the computation graph. When the loss plateaus, you see it immediately instead of staring at printed numbers. I logged MNIST metrics and PR curves per class to figure out which digits the model kept confusing.
  • Poetry for dependencies. Both repos use pyproject.toml with pinned PyTorch versions (2.3 and 2.4). Even for learning projects, you don't want to debug your environment at the same time as your model.

What I'd tell someone starting this path

  • Write backprop by hand at least once. It's tedious. But after you've manually applied the chain rule through two layers in NumPy, loss.backward() becomes a function call you actually understand.
  • XOR teaches more than MNIST for understanding depth. MNIST is the standard benchmark, but XOR shows you why hidden layers matter with 4 data points. You can plot the decision boundary and see the failure.
  • Data pipelines break more often than models. I spent more time debugging DataLoader issues, file paths, and text tokenization than I spent debugging model architectures. Budget time accordingly.
  • Use real datasets early. Synthetic data teaches you the algorithm. Real data teaches you everything else: class imbalance, noisy labels, preprocessing decisions that change your results more than model architecture does.
  • Don't skip the boring parts. Model saving, checkpointing, GPU placement, environment setup. I put these off at first and regretted it. A model you can't reload or reproduce is a model you'll end up retraining from scratch.

Worth trying next

Implement a small transformer from scratch. I built the attention mechanism and embedding layers, but didn't assemble a full trainable transformer. A character-level GPT on a small corpus would close the loop between the RNN language model and the transformer attention components.

Add proper evaluation beyond accuracy. Most of my projects just report accuracy or loss. That's not enough. Confusion matrices, per-class metrics, calibration curves. I should build a reusable evaluation harness instead of ad-hoc metrics in every notebook.

Try distributed training. Everything here runs on one GPU or CPU. I haven't touched DistributedDataParallel or Lightning's multi-GPU support. That's the gap between "I can train a model" and "I can train a model at scale."

Build something with Hugging Face. I understand the transformer mechanism now, but I haven't fine-tuned a pretrained language model on an actual task yet. Classification, summarization, NER, something practical with the ecosystem people actually use in production.

Wrapping up

Building everything twice is slow. There are faster ways to get productive with ML. But when a model misbehaves now, I know which layer to check and what the gradients should look like. When I read a paper, I can map the equations to code without much effort. That understanding came from the NumPy versions, not the PyTorch ones.

The full set: 12 algorithm implementations, 5 miniprojects, 15 PyTorch scripts going from tensors to TensorBoard. None of it is production code. All of it is why I can write production code now.

If you're starting out, I'd recommend the same thing. Build it from scratch first. Then build it properly. The second pass goes much faster when you already know what's supposed to happen.