ML / SensorsCase Study

Activity Recognition with XGBoost

19 human physical activities classified from smartphone and smartwatch sensor data. Final accuracy reached 85.6% after tuning an XGBoost model with randomised search.

XGBoostPythonscikit-learnFeature EngineeringRandom Search

GitHub

85.6%

Overall accuracy

Activity classes

Sensor placements

≥96%

F1 on Sitting / Writing

Overview

Human activity recognition (HAR) matters for health monitoring, fitness tracking, elderly care, and sports analytics. This project built a multi-class classifier to distinguish 19 daily physical activities using raw inertial measurement unit (IMU) data captured simultaneously from a smartphone and a smartwatch.

XGBoost was selected for its strong handling of tabular sensor features, robustness to outliers, and efficiency on medium-scale datasets. Hyperparameter tuning via randomised search pushed accuracy past the 85% threshold.

Dataset & Activities

Sensor readings (three-axis accelerometer and three-axis gyroscope) were collected from two placements: a smartphone and a smartwatch. The dataset covers 19 labelled activity classes:

Walking (A)Jogging (B)Stairs (C)Sitting (D)Standing (E)Typing (F)Brushing Teeth (G)Eating Soup (H)Eating Chips (I)Eating Pasta (J)Drinking (K)Eating Sandwich (L)Kicking (M)Catch (O)Dribbling (P)Writing (Q)Clapping (R)Folding (S)

Methodology

Feature Engineering

Statistical and frequency-domain features were extracted from raw 6-axis IMU windows (mean, variance, energy, correlation between axes) for both sensor placements.

Model

XGBoost gradient-boosted trees were trained on the engineered feature vectors. Separate models were evaluated for phone and watch placements.

Hyperparameter Tuning

Randomised search over key XGBoost parameters (n_estimators, max_depth, learning_rate, subsample, colsample_bytree, gamma) to maximise validation accuracy.

Evaluation

Per-class precision, recall, and F1-score reported alongside the confusion matrix. Overall accuracy computed on a held-out test set.

Best Hyperparameters

n_estimators

982

max_depth

learning_rate

0.102

colsample_bytree

0.94

subsample

0.85

gamma

Per-class Performance (Phone)

Stationary activities (sitting, writing) were classified near-perfectly. High-motion activities with similar kinematics (stairs vs. walking, kicking vs. dribbling) remained the hardest to separate.

Sitting (D)

~96%F1-score

Writing (Q)

~96%F1-score

Soup (H)

~89%F1-score

Drinking (K)

~88%F1-score

Typing (F)

~86%F1-score

Kicking (M)

~52%F1-score

Stairs (C)

~49%F1-score

Dribbling (P)

~49%F1-score

Key Findings

Overall accuracy reached 85.59% after hyperparameter tuning, up noticeably from baseline defaults.
Sitting (D) and Writing (Q) achieved ~96% F1, confirming that highly distinctive postures are easy to identify.
Jogging (B) was frequently confused with Walking (A); the two share similar limb kinematics and are hard to separate without gait-specific features.
Stairs (C) showed lower performance due to its overlap with walking in stride pattern and acceleration profile.
Smartwatch and smartphone placements yielded different confusion profiles, suggesting sensor fusion could push accuracy further.
On the watch, "Teeth (G)" and "Soup (H)" were often confused with Standing (E). Wrist movements alone are too ambiguous for fine-grained eating activities.

What I'd do differently

If I were to redo this project, I'd fuse the phone and watch features into a single model from the start instead of evaluating them separately. The confusion patterns were clearly complementary: watch data struggled with eating activities while phone data handled them better, and vice versa for some motion classes. I left that on the table.

I'd also skip the manual feature engineering and go straight to a CNN-LSTM on raw windowed signals. XGBoost was a solid baseline, but the temporal structure in IMU data is exactly what sequence models are built for. The 85.6% accuracy is respectable, but I think a well-tuned deep model could push past 90% without much more data.

Finally, I didn't think enough about deployment. An activity recognition model that can't run on a phone in real time is an academic exercise. I'd benchmark inference latency on actual mobile hardware next time.

Limitations & Future Work

A single sensor modality per model limits robustness; fusing phone + watch features simultaneously is a natural next step.
Larger, more diverse participant pools would improve generalisability across body types and movement styles.
Deep learning approaches (LSTM, CNN-LSTM) could automatically learn temporal patterns without manual feature engineering.
Real-time inference on-device was not evaluated. Latency and memory constraints of XGBoost on embedded hardware still need investigation.