Used Car Price Prediction
Gradient-boosted regression model for used-car valuation on large, messy marketplace data. Compared LightGBM and CatBoost, handled mixed categorical features, and balanced model quality against inference speed.
Key metrics.
LightGBM
best overall speed / quality balance
1,768
best reported test error
1,076
average absolute pricing error
0.849
variance explained on held-out data
What the project tries to solve.
Estimate market value quickly from used-car listing attributes so a customer-facing pricing workflow can return a useful number fast.
Status: Study project, being refactored for portfolio use.
Notebook: used_car_price_prediction.ipynb
Repo path: data-science-projects/used-car-price-prediction
This is a strong portfolio problem because it combines messy categorical tabular data, regression metrics that matter to product teams, and the practical tradeoff between model quality and prediction speed.
This project is one of the strongest signals for practical tabular machine learning because it deals with real marketplace messiness: missing values, mixed feature types, model comparison, and performance tradeoffs that matter outside a classroom notebook.
How I approached it.
Cleaned and normalized a noisy marketplace dataset with missing values and mixed categorical fields.
Built preprocessing pipelines for numeric and categorical features.
Compared baseline linear and tree models against boosted methods including LightGBM and CatBoost.
Tracked runtime along with RMSE, MAE, and R-squared to judge production usefulness rather than accuracy alone.
What I would improve next.
Tighten feature selection and remove weaker date-derived features with clearer leakage reasoning.
Refactor the notebook into a cleaner training pipeline with reproducible config and saved artifacts.
Add error slices by price band and vehicle segment to show where the model is reliable and where it misses.