Fuck you

M1

PyCaret can underperform if sample is way too unbalanced
If duplicates take too much row, it can make model dumb
Types of learning
- Supervised learning:
- classification & regression models
- Unsupervised learning:
- k-means, PCA
- Reinforcement learning
- deep learning models
AI winters:
- 1st: mid 1970s, too weak and think too much
- 2nd: 1980-1990s, same fucking thing, got better due to advancements and govt funds
AI singularity: AI > human, time is uncertain

preprocess
- Preprocess because result might skew a lot if not preprocessed
- nans can be removed by doing nothing, filling with a specific value, or impute using knn (get number by considering similar points)
encodings
- label encoding is needed when cat number is high, but might introduce non-existent hierarchy in labels
- dummy encoding can create multicollinearlity and could cause model to overfit
scaling
- Whether scale (StandardScaler) or not:
- Scale: distance based methods (SVM, KNN), because magnitude mean serious shits
- NOT to scale: trees or need to be intrepreted
- whether normalize (MinMaxScaler) or not:
- when value need to be in a specific range
- not gaussian
sampling
- Oversampling is to make up new data points for minority class
- Undersampling is to cut data points from majority class
feature engineering
- RFE is used to eliminate unnecessary features
- cluster label can be used as a feature
metrics
- Accuracy can be misleading due to class imbalance
- Precision does not consider false negatives
- Recall does not consider false positives
- F1 cannot fit uneven importance of recall & precision
- ROC curve and Area Under the ROC Curve cannot explain model at specific points

metrics
- why care about it
- model eval & compare & optimization
- understaning model behaviour
- align with business, aid decision making
- types: MSE, RMSE, MAE, MAPE, R2
- how to choose:
- MSE can be influenced by outliers
- RMSE is good when large error is more fucked up
bias is being too naive, variance is being too bitchy
model fit
- underfitting -> make model more complex
- overfitting -> regularization, to make model dumber by adding penalty term relalted to modle complexity
- L1: add abs of weights to loss function
- L2: add squared value of weights to loss function
- heterosca…city is the concept that variance of the model is changing
- multicollineatity is when 2 predictors are highly correlated. Debug by finding all correlations and eliminate
Ensembles
- bagging: train multiple identical models by using subsets of original training set (drawn with replacement) and vote to get output
- boosting: subsequent models try to correct error from the prev one
- stacking: train multiple different models and train another fucking model to model those fucking models
NFL is saying one size cant fit all

Curves
- ROC curve, y=TP, x=FP, twerking towards top left is considered good
- The area if above 0.8 is considered good
propensity: the likelihood that a given instance belongs to a particular class
errors:
- FP = type I error
- FN = type II error
- cost sensitive classification does not treat these 2 errors equally
- can implement by assigning class weights
Rashomon set says that a set of different mother fucking models can perform equally good on the same task
LightGBM vs XGBoost:
- LGBM grows tree on leaf wise, thus
- faster in training
- less resource usage
- more sensitive to hyperparam
- XGB grows tree on level wise

themes of RAI:
- transparency and explanability of model itself
- fairness and bias mitigation of model on real life cases
- safeguard data from users
- help human desicion making
- create social impact, and create accountability on social issues
Interpreability vs. explainability
- explanability: model’s transformation of data is easy to understand
- interpretability: how the model made that fucking transformation, algorithm wise, is easy to understand
Global explanation
- PDP: how the change in one feature inpact the output, keeping others constant
- SHAP: shows the contribution of each feature of difference btwn model’s prediction and average prediction across all points
- Surrogate models: mimic black box while being more explanable
Local explanation
- Local Interpretable Model-agnostic Explanations: approximate the model locally using a more explainable model, shows why a row is classified in a certain way, and how each feature contributed to that prediction
- debugging and model validation
- SHAP
- high stake ops
- Counterfactural explanations: change value to one feature and see if the result changes
- understanding decision boundaries
- Anchors: find a set of features that guarantee the same prediction when satisfied regardless of all other things
- explaining individual predictions for tree based models
RAI challenges
- engage w/ stakeholders, s.t.
- ensuring fairness
- addressing social implications
- implementing mechanisms for transparancy
- establish guidelines

instrumental vars: a var that has no absolutely shit on the final decision of y, like your mom on the your decision of being a bitch
causal inference: cannot observethe counterfactual outcome
confounders: things that have effect on treatment assignment and outcome
- output might have effect on treatment
- time-varying
- experiments need to be designed like in fucking biology to mitigate it
challenges of causal ML
- time & space complexity
- cannot completely set apart btwn units
- ensuring intrepretability and transparency
correlation != causality
Double ML how:
- control group vs. exp group

Post Views: 81