Big Data Machine Learning

Fuck you

M1

  • PyCaret can underperform if sample is way too unbalanced
  • If duplicates take too much row, it can make model dumb
  • Types of learning
    • Supervised learning:
    • classification & regression models
    • Unsupervised learning:
    • k-means, PCA
    • Reinforcement learning
    • deep learning models
  • AI winters:
    • 1st: mid 1970s, too weak and think too much
    • 2nd: 1980-1990s, same fucking thing, got better due to advancements and govt funds
  • AI singularity: AI > human, time is uncertain

M2

  • preprocess
    • Preprocess because result might skew a lot if not preprocessed
    • nans can be removed by doing nothing, filling with a specific value, or impute using knn (get number by considering similar points)
  • encodings
    • label encoding is needed when cat number is high, but might introduce non-existent hierarchy in labels
    • dummy encoding can create multicollinearlity and could cause model to overfit
  • scaling
    • Whether scale (StandardScaler) or not:
    • Scale: distance based methods (SVM, KNN), because magnitude mean serious shits
    • NOT to scale: trees or need to be intrepreted
    • whether normalize (MinMaxScaler) or not:
    • when value need to be in a specific range
    • not gaussian
  • sampling
    • Oversampling is to make up new data points for minority class
    • Undersampling is to cut data points from majority class
  • feature engineering
    • RFE is used to eliminate unnecessary features
    • cluster label can be used as a feature
  • metrics
    • Accuracy can be misleading due to class imbalance
    • Precision does not consider false negatives
    • Recall does not consider false positives
    • F1 cannot fit uneven importance of recall & precision
    • ROC curve and Area Under the ROC Curve cannot explain model at specific points

M3

  • metrics
    • why care about it
    • model eval & compare & optimization
    • understaning model behaviour
    • align with business, aid decision making
    • types: MSE, RMSE, MAE, MAPE, R2
    • how to choose:
    • MSE can be influenced by outliers
    • RMSE is good when large error is more fucked up
  • bias is being too naive, variance is being too bitchy
  • model fit
    • underfitting -> make model more complex
    • overfitting -> regularization, to make model dumber by adding penalty term relalted to modle complexity
    • L1: add abs of weights to loss function
    • L2: add squared value of weights to loss function
    • heterosca…city is the concept that variance of the model is changing
    • multicollineatity is when 2 predictors are highly correlated. Debug by finding all correlations and eliminate
  • Ensembles
    • bagging: train multiple identical models by using subsets of original training set (drawn with replacement) and vote to get output
    • boosting: subsequent models try to correct error from the prev one
    • stacking: train multiple different models and train another fucking model to model those fucking models
  • NFL is saying one size cant fit all

M4

  • Curves
    • ROC curve, y=TP, x=FP, twerking towards top left is considered good
    • The area if above 0.8 is considered good
  • propensity: the likelihood that a given instance belongs to a particular class
  • errors:
    • FP = type I error
    • FN = type II error
    • cost sensitive classification does not treat these 2 errors equally
    • can implement by assigning class weights
  • Rashomon set says that a set of different mother fucking models can perform equally good on the same task
  • LightGBM vs XGBoost:
    • LGBM grows tree on leaf wise, thus
    • faster in training
    • less resource usage
    • more sensitive to hyperparam
    • XGB grows tree on level wise

M5

  • themes of RAI:
    • transparency and explanability of model itself
    • fairness and bias mitigation of model on real life cases
    • safeguard data from users
    • help human desicion making
    • create social impact, and create accountability on social issues
  • Interpreability vs. explainability
    • explanability: model’s transformation of data is easy to understand
    • interpretability: how the model made that fucking transformation, algorithm wise, is easy to understand
  • Global explanation
    • PDP: how the change in one feature inpact the output, keeping others constant
    • SHAP: shows the contribution of each feature of difference btwn model’s prediction and average prediction across all points
    • Surrogate models: mimic black box while being more explanable
  • Local explanation
    • Local Interpretable Model-agnostic Explanations: approximate the model locally using a more explainable model, shows why a row is classified in a certain way, and how each feature contributed to that prediction
    • debugging and model validation
    • SHAP
    • high stake ops
    • Counterfactural explanations: change value to one feature and see if the result changes
    • understanding decision boundaries
    • Anchors: find a set of features that guarantee the same prediction when satisfied regardless of all other things
    • explaining individual predictions for tree based models
  • RAI challenges
    • engage w/ stakeholders, s.t.
    • ensuring fairness
    • addressing social implications
    • implementing mechanisms for transparancy
    • establish guidelines

M6

  • instrumental vars: a var that has no absolutely shit on the final decision of y, like your mom on the your decision of being a bitch
  • causal inference: cannot observethe counterfactual outcome
  • confounders: things that have effect on treatment assignment and outcome
    • output might have effect on treatment
    • time-varying
    • experiments need to be designed like in fucking biology to mitigate it
  • challenges of causal ML
    • time & space complexity
    • cannot completely set apart btwn units
    • ensuring intrepretability and transparency
  • correlation != causality
  • Double ML how:
    • control group vs. exp group

M7

Tags:

Comments are closed

Latest Comments