Used Car Price Prediction

Gradient-boosted regression model for used-car valuation on large, messy marketplace data. Compared LightGBM and CatBoost, handled mixed categorical features, and balanced model quality against inference speed.

Download .ipynb View on GitHub

Snapshot

Key metrics.

Best model

LightGBM

best overall speed / quality balance

RMSE

1,768

best reported test error

MAE

1,076

average absolute pricing error

R²

0.849

variance explained on held-out data

Objective

What the project tries to solve.

Estimate market value quickly from used-car listing attributes so a customer-facing pricing workflow can return a useful number fast.

Project Files

Status: Study project, being refactored for portfolio use.

Notebook: used_car_price_prediction.ipynb

Repo path: data-science-projects/used-car-price-prediction

Why It Matters

This is a strong portfolio problem because it combines messy categorical tabular data, regression metrics that matter to product teams, and the practical tradeoff between model quality and prediction speed.

Why I Chose It

This project is one of the strongest signals for practical tabular machine learning because it deals with real marketplace messiness: missing values, mixed feature types, model comparison, and performance tradeoffs that matter outside a classroom notebook.

Approach

How I approached it.

Cleaned and normalized a noisy marketplace dataset with missing values and mixed categorical fields.

Built preprocessing pipelines for numeric and categorical features.

Compared baseline linear and tree models against boosted methods including LightGBM and CatBoost.

Tracked runtime along with RMSE, MAE, and R-squared to judge production usefulness rather than accuracy alone.

Next Iteration

What I would improve next.

Tighten feature selection and remove weaker date-derived features with clearer leakage reasoning.

Refactor the notebook into a cleaner training pipeline with reproducible config and saved artifacts.

Add error slices by price band and vehicle segment to show where the model is reliable and where it misses.

Stack

Python LightGBM CatBoost Scikit-learn