Gold Recovery Process Modeling

Multi-stage regression pipeline to predict rougher and final gold recovery from plant telemetry. Cleaned process data, validated recovery calculations, and optimized against a weighted sMAPE business metric.

Key metrics.

Best model

Random Forest

best final notebook performance

Overall sMAPE

7.26%

weighted business metric on test data

Rougher RMSE

4.70

rougher-stage recovery error

Final RMSE

7.74

final recovery prediction error

What the project tries to solve.

Predict both rougher-stage and final-stage gold recovery from process telemetry so plant operators can estimate output quality earlier in the pipeline.

Status: Study project, being refactored for portfolio use.

Notebook: gold_recovery_process_modeling.ipynb

Repo path: data-science-projects/gold-recovery-process-modeling

This project reads more like real industrial data science than a standard classroom notebook: there are multiple targets, plant-process constraints, custom evaluation logic, and a need to reconcile train and test feature availability.

This project stands out because it looks more like real industrial data science work: process-oriented data cleaning, domain-specific metrics, and separate modeling targets tied to a multi-stage system rather than a toy prediction task.

How I approached it.

Validated the recovery calculation itself before trusting the labels.

Aligned training and test features to avoid relying on unavailable plant measurements at inference time.

Modeled rougher and final recovery separately, then combined them with the weighted business metric.

Compared linear, decision tree, and random forest regressors against the weighted sMAPE objective.

What I would improve next.

Refactor the dual-target workflow into a more production-style pipeline with shared preprocessing.

Add clearer diagnostics around train / test distribution shift and feature drift across process stages.

Explain the business meaning of sMAPE and where the current model would and would not be trusted operationally.

Python Pandas Scikit-learn sMAPE