PyCaret can underperform if sample is way too unbalanced
If duplicates take too much row, it can make model dumb
Types of learning
Supervised learning:
classification & regression models
Unsupervised learning:
k-means, PCA
Reinforcement learning
deep learning models
AI winters:
1st: mid 1970s, too weak and think too much
2nd: 1980-1990s, same fucking thing, got better due to advancements and govt funds
AI singularity: AI > human, time is uncertain
M2
preprocess
Preprocess because result might skew a lot if not preprocessed
nans can be removed by doing nothing, filling with a specific value, or impute using knn (get number by considering similar points)
encodings
label encoding is needed when cat number is high, but might introduce non-existent hierarchy in labels
dummy encoding can create multicollinearlity and could cause model to overfit
scaling
Whether scale (StandardScaler) or not:
Scale: distance based methods (SVM, KNN), because magnitude mean serious shits
NOT to scale: trees or need to be intrepreted
whether normalize (MinMaxScaler) or not:
when value need to be in a specific range
not gaussian
sampling
Oversampling is to make up new data points for minority class
Undersampling is to cut data points from majority class
feature engineering
RFE is used to eliminate unnecessary features
cluster label can be used as a feature
metrics
Accuracy can be misleading due to class imbalance
Precision does not consider false negatives
Recall does not consider false positives
F1 cannot fit uneven importance of recall & precision
ROC curve and Area Under the ROC Curve cannot explain model at specific points
M3
metrics
why care about it
model eval & compare & optimization
understaning model behaviour
align with business, aid decision making
types: MSE, RMSE, MAE, MAPE, R2
how to choose:
MSE can be influenced by outliers
RMSE is good when large error is more fucked up
bias is being too naive, variance is being too bitchy
model fit
underfitting -> make model more complex
overfitting -> regularization, to make model dumber by adding penalty term relalted to modle complexity
L1: add abs of weights to loss function
L2: add squared value of weights to loss function
heterosca…city is the concept that variance of the model is changing
multicollineatity is when 2 predictors are highly correlated. Debug by finding all correlations and eliminate
Ensembles
bagging: train multiple identical models by using subsets of original training set (drawn with replacement) and vote to get output
boosting: subsequent models try to correct error from the prev one
stacking: train multiple different models and train another fucking model to model those fucking models
NFL is saying one size cant fit all
M4
Curves
ROC curve, y=TP, x=FP, twerking towards top left is considered good
The area if above 0.8 is considered good
propensity: the likelihood that a given instance belongs to a particular class
errors:
FP = type I error
FN = type II error
cost sensitive classification does not treat these 2 errors equally
can implement by assigning class weights
Rashomon set says that a set of different mother fucking models can perform equally good on the same task
LightGBM vs XGBoost:
LGBM grows tree on leaf wise, thus
faster in training
less resource usage
more sensitive to hyperparam
XGB grows tree on level wise
M5
themes of RAI:
transparency and explanability of model itself
fairness and bias mitigation of model on real life cases
safeguard data from users
help human desicion making
create social impact, and create accountability on social issues
Interpreability vs. explainability
explanability: model’s transformation of data is easy to understand
interpretability: how the model made that fucking transformation, algorithm wise, is easy to understand
Global explanation
PDP: how the change in one feature inpact the output, keeping others constant
SHAP: shows the contribution of each feature of difference btwn model’s prediction and average prediction across all points
Surrogate models: mimic black box while being more explanable
Local explanation
Local Interpretable Model-agnostic Explanations: approximate the model locally using a more explainable model, shows why a row is classified in a certain way, and how each feature contributed to that prediction
debugging and model validation
SHAP
high stake ops
Counterfactural explanations: change value to one feature and see if the result changes
understanding decision boundaries
Anchors: find a set of features that guarantee the same prediction when satisfied regardless of all other things
explaining individual predictions for tree based models
RAI challenges
engage w/ stakeholders, s.t.
ensuring fairness
addressing social implications
implementing mechanisms for transparancy
establish guidelines
M6
instrumental vars: a var that has no absolutely shit on the final decision of y, like your mom on the your decision of being a bitch
Comments are closed