Back to index

Evaluating success of a data project

Define a problem

a. Define a success metric
Data

a. Does the model need to work in real time?

b. Does the model need to be trained in real time?

c. Is there an inconsistency between train and test data
Evaluation

a. Train test split at random assumes data is homogenous. Probably not. Mayb slpit by time instead.

b. Baseline model
Features

a. Choose good ones

b. Remove redundant ones
Modelling

a. Does it need to be interperable?

b. How to tune
Experiment

a. Go into production and evaluate and update quickly

Standard process

Have a standard boilerplate way to process data to investigate. Saves lots of reengineering.

e.g. Qualitataive / quantitative cols

Null values

correlations

Uac / roc to score

Classification_report to show the precision of models

Models

Xgboost is new and good

Random forests -- tree decision, good at lots of things.

Use ridge not Lasso as lasso removes some features -- try Eslatic net regression

Blueprint technologies flow chart to find best model. Also one on scikit

Mean absolute error is a good accuracy test.

Move to K fold

Open grid cv to go through parameters. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

ML ideas

tpot for automated optimisation
EthicalML/awesome-production-machine-learning: lots of good learning resources
shap for model explanation:

Deployment ideas

Heroku - dont need VM, just script. Charged per run of script.
Azure, with linux
cloud.digitalocean.com:
- easier to use than google and AWS.
- Good for hosting also. "droplets" are VMs. $5 pm
- need to have a firewall with some ports open, need to lock down VM to make sure can't be hacked.
- more data security than using AWS etc.

Model Selection

in order of evolution:

train on all data, test on all data (code check)
simple train test split
k fold cross evaluation and get averages and sensitivity
Tuly held out test set

Grid Search (e.g. GridSearchCV) does cross valiation and parameter optimisation.

Default measure used is accuracy, but can pass an evaluation metric as scoring parameter.

Get a different decision boundary when using different scoring pameters to tune the model.

Just using cross validation is prone to data leakage, test data used partially for model selection. Hold out a final test set, not used in cross validation.

In practice, use 3 data splits

training set
validation set (model selection)
test set (final evaluation)

Select an evaluation metric which suits the particualar application.