Back to projects
ML / SensorsCase Study

Activity Recognition with XGBoost

19 human physical activities classified from smartphone and smartwatch sensor data. Final accuracy reached 85.6% after tuning an XGBoost model with randomised search.

XGBoostPythonscikit-learnFeature EngineeringRandom Search

85.6%

Overall accuracy

19

Activity classes

2

Sensor placements

≥96%

F1 on Sitting / Writing


Overview

Human activity recognition (HAR) matters for health monitoring, fitness tracking, elderly care, and sports analytics. This project built a multi-class classifier to distinguish 19 daily physical activities using raw inertial measurement unit (IMU) data captured simultaneously from a smartphone and a smartwatch.

XGBoost was selected for its strong handling of tabular sensor features, robustness to outliers, and efficiency on medium-scale datasets. Hyperparameter tuning via randomised search pushed accuracy past the 85% threshold.

Dataset & Activities

Sensor readings (three-axis accelerometer and three-axis gyroscope) were collected from two placements: a smartphone and a smartwatch. The dataset covers 19 labelled activity classes:

Walking (A)Jogging (B)Stairs (C)Sitting (D)Standing (E)Typing (F)Brushing Teeth (G)Eating Soup (H)Eating Chips (I)Eating Pasta (J)Drinking (K)Eating Sandwich (L)Kicking (M)Catch (O)Dribbling (P)Writing (Q)Clapping (R)Folding (S)

Methodology

Feature Engineering

Statistical and frequency-domain features were extracted from raw 6-axis IMU windows (mean, variance, energy, correlation between axes) for both sensor placements.

Model

XGBoost gradient-boosted trees were trained on the engineered feature vectors. Separate models were evaluated for phone and watch placements.

Hyperparameter Tuning

Randomised search over key XGBoost parameters (n_estimators, max_depth, learning_rate, subsample, colsample_bytree, gamma) to maximise validation accuracy.

Evaluation

Per-class precision, recall, and F1-score reported alongside the confusion matrix. Overall accuracy computed on a held-out test set.

Best Hyperparameters

n_estimators
982
max_depth
6
learning_rate
0.102
colsample_bytree
0.94
subsample
0.85
gamma
0

Per-class Performance (Phone)

Stationary activities (sitting, writing) were classified near-perfectly. High-motion activities with similar kinematics (stairs vs. walking, kicking vs. dribbling) remained the hardest to separate.

Sitting (D)
~96%F1-score
Writing (Q)
~96%F1-score
Soup (H)
~89%F1-score
Drinking (K)
~88%F1-score
Typing (F)
~86%F1-score
Kicking (M)
~52%F1-score
Stairs (C)
~49%F1-score
Dribbling (P)
~49%F1-score

Key Findings

  • Overall accuracy reached 85.59% after hyperparameter tuning, up noticeably from baseline defaults.
  • Sitting (D) and Writing (Q) achieved ~96% F1, confirming that highly distinctive postures are easy to identify.
  • Jogging (B) was frequently confused with Walking (A); the two share similar limb kinematics and are hard to separate without gait-specific features.
  • Stairs (C) showed lower performance due to its overlap with walking in stride pattern and acceleration profile.
  • Smartwatch and smartphone placements yielded different confusion profiles, suggesting sensor fusion could push accuracy further.
  • On the watch, "Teeth (G)" and "Soup (H)" were often confused with Standing (E). Wrist movements alone are too ambiguous for fine-grained eating activities.

What I'd do differently

If I were to redo this project, I'd fuse the phone and watch features into a single model from the start instead of evaluating them separately. The confusion patterns were clearly complementary: watch data struggled with eating activities while phone data handled them better, and vice versa for some motion classes. I left that on the table.

I'd also skip the manual feature engineering and go straight to a CNN-LSTM on raw windowed signals. XGBoost was a solid baseline, but the temporal structure in IMU data is exactly what sequence models are built for. The 85.6% accuracy is respectable, but I think a well-tuned deep model could push past 90% without much more data.

Finally, I didn't think enough about deployment. An activity recognition model that can't run on a phone in real time is an academic exercise. I'd benchmark inference latency on actual mobile hardware next time.

Limitations & Future Work

  • A single sensor modality per model limits robustness; fusing phone + watch features simultaneously is a natural next step.
  • Larger, more diverse participant pools would improve generalisability across body types and movement styles.
  • Deep learning approaches (LSTM, CNN-LSTM) could automatically learn temporal patterns without manual feature engineering.
  • Real-time inference on-device was not evaluated. Latency and memory constraints of XGBoost on embedded hardware still need investigation.