Super cool model baseline technique From scikitlearn, this is a super cool way of getting a baseline. In the past I had done this majority class assignment manually. But it is super cool this is kind of built in. from sklearn.dummy import DummyClassifier X, y = get_some_data() X_train, X_test, y_train, y_test = do_some_splitting(X, y) dummy_clf = DummyClassifier(strategy="most_frequent") dummy_clf.fit(X_train, y_train) dummy_clf.score(X_test, y_test)
Handies
Mainly notes from reaading the Natural Language Processing with Transformers book Really nice book! I have the urge to write down for myself some snippets so I can more easily refer to them later. Read a dataset to pandas import pandas as pd from datasets import load_dataset emotions = load_dataset("emotion") # emotions["train"] # this is still a datasets.arrow_dataset.Dataset emotions.set_format(type="pandas") df = emotions["train"][:] # but adding that "[:]" slice grants a DataFrame ! df.head() # Go back to initial format emotions.reset_format() Cool Mini Tokenization example from transformers import AutoTokenizer model_ckpt = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) text = "Tokenizing text is a core task of NLP." encoded_text = tokenizer(text) print(encoded_text) {'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} # oh cool and the tokenizer lets you convert back , tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids) print(tokens) ['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl', '##p', '.', '[SEP]'] And finally ...
Loss functions vs Metric functions I like the phrasing in this SO answer, that loss functions are optimized directly when training but that metrics are optimized indirectly. I was trying to figure out last year why functions commonly used as metrics (F1 and AUC) are not listed in the tensor flow keras loss functions . I did however earlier try using F1 as a loss function when trying to understand my particular problem. (At least one error I ran into hints that it is not that simplle because you need to write extra code for computing the gradient.) But even if you can produce the code to compute a gradient for your custom loss function, maybe some metrics are more expensive to run SGD than others. (Also F1 is clearly less sensitive than a function that uses probabilities or logits directly) . ...
TPR, FPR tpr = 1.0*TP/(FN + TP) # aka recall fpr = 1.0*FP/(FP + TN) # Confusion matrix Given a testdf where first column contains actual labels, 0, 1, and predictions is a list of probabilities, y_pred = (y_prob >= 0.08) confusion = pd.crosstab(index=y_true, columns=y_pred, rownames=['actual'], colnames=['predictions']) predictions False True actual 0 509 132 1 32 22 Also there is a super nice helper in scikitlearn Below, using some pre-baked results from running some of the chapter 2 code from https://transformersbook.com/ . ...
One particularly killer feature about the ColumnTransformer is that you can apply a specific preprocessor for a subset of the columns, and then set remainder="passthrough" for the others import numpy as np from sklearn.preprocessing import (MinMaxScaler, OneHotEncoder, LabelEncoder) from sklearn.preprocessing import Binarizer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn import metrics from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score, train_test_split def make_data(): X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=42, shuffle=False, weights=(0.25,), ) # return X, y X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, ) return X_train, X_test, y_train, y_test # In [90]: Counter(y) # Out[90]: Counter({0: 258, 1: 742}) def forest_one(): preprocessor = ColumnTransformer([ ("scaler", MinMaxScaler(), [0, 1]), ], remainder="passthrough") pipeline = Pipeline([ # ("scale", MinMaxScaler()), ("preproc", preprocessor), ( "clf", RandomForestClassifier( max_depth=2, random_state=0, n_estimators=100, # class_weight="balanced_subsample", # "balanced", "balanced_subsample" or {0: 0.1, 1: 0.9 } weights per class ) ), ]) return pipeline def forest_balanced(): pipeline = Pipeline([ ("scale", MinMaxScaler() ), ( "clf", RandomForestClassifier( max_depth=2, random_state=0, n_estimators=100, class_weight="balanced_subsample", # "balanced", "balanced_subsample" or {0: 0.1, 1: 0.9 } weights per class ) ), ]) return pipeline def e2e(X_train, y_train, pipeline): scorers = ["f1_micro", "roc_auc"] scores = [ [scorer, cross_val_score(pipeline, X_train, y_train, cv=3, scoring=scorer) ] for scorer in scorers ] pipeline.fit(X_train, y_train) return pipeline, scores def holdout_test(X_test, y_test, pipeline): y_preds = pipeline.predict(X_test) f1 = metrics.f1_score(y_test, y_preds, average="micro") y_preds = pipeline.predict_proba(X_test)[:, 1] fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds, pos_label=1) auc = metrics.auc(fpr, tpr) return {"f1": f1, "auc": auc} """ # X = np.array([[-1, 2], [-0.5, 6], [0, 10], [1, 18]]) scaler = MinMaxScaler() print(scaler.fit(X)) print(scaler.data_min_, scaler.data_max_) print(scaler.transform(X)) # X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) """ """ X_train, X_test, y_train, y_test = qp.make_data() p1 = qp.forest_one() p2 = qp.forest_balanced() _, scores1 = qp.e2e(X, y, p1) print("p1", scores1) qp.holdout_test(X_test, y_test, p1) _, scores2 = qp.e2e(X, y, p2) print("p2", scores2) qp.holdout_test(X_test, y_test, p2) """
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression X, y = load_iris(return_X_y=True) clf = LogisticRegression( random_state=0, penalty="l2", class_weight="balanced", # or dict {0: 0.1, 1: 0.9} ).fit(X, y, # sample_weight= # array , n_samples, for each row. ) clf.predict(X[:2, :]) clf.predict_proba(X[:2, :]) clf.score(X, y) clf.decision_function(X) from sklearn import metrics from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False) clf = RandomForestClassifier( max_depth=2, random_state=0, n_estimators=100, class_weight= # "balanced", "balanced_subsample" or {0: 0.1, 1: 0.9 } weights per class ) clf.fit(X, y, # sample_weight= # array , n_samples, for each row. ) print(clf.predict([[0, 0, 0, 0]])) In [16]: pd.DataFrame(X).corr() Out[16]: 0 1 2 3 0 1.000000 0.065124 0.026765 0.028988 1 0.065124 1.000000 0.031176 -0.026317 2 0.026765 0.031176 1.000000 -0.006788 3 0.028988 -0.026317 -0.006788 1.000000 In [17]: clf.feature_importances_ Out[17]: array([0.14205973, 0.76664038, 0.0282433 , 0.06305659]) print(clf.predict_log_proba([[0, 0, 0, 0]])) print(clf.predict_proba([[0, 0, 0, 0]])) print(clf.predict([[0, 0, 0, 0]])) In [18]: print(clf.predict_log_proba([[0, 0, 0, 0]])) ...: [[-1.72562562 -0.19608985]] In [19]: print(clf.predict_proba([[0, 0, 0, 0]])) ...: [[0.17806162 0.82193838]] In [20]: from math import log In [21]: log(0.82193838) Out[21]: -0.19608985023951067 In [22]: print(clf.predict([[0, 0, 0, 0]])) [1] y_true = y y_pred = clf.pred(X) metrics.accuracy_score(y_true, y_pred) Out[29]: 0.925 metrics.confusion_matrix(y_true, y_pred) Out[30]: array([[434, 70], [ 5, 491]]) fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred, pos_label=1) metrics.auc(fpr, tpr) # Out[32]: 0.9255152329749103 metrics.log_loss(y_true, y_pred,) Out[33]: 2.590464201438415 from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler Cross Validation https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold >>> import numpy as np >>> from sklearn.model_selection import ( KFold, StratifiedKFold, # preserves percentage of samples per class. ) >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) 2 >>> print(kf) KFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3] from sklearn import utils utils.class_weight.compute_class_weight() utils.class_weight.compute_sample_weight() Other handy references https://ml-cheatsheet.readthedocs.io/ ...
Read loc = "dbfs:/databricks-datasets/wine-quality/winequality-red.csv" blah_df = spark.read.csv(loc, sep=";", header=True) Map an existing function import spark.sql.functions as F loc = "dbfs:/databricks-datasets/wine-quality/winequality-red.csv" df = spark.read.csv(loc, sep=";", header=True) df = df.withColumn("sugar_rounded", F.round(df["residual sugar"])) df.select("residual sugar", "sugar_rounded").show(5) +--------------+-------------+ |residual sugar|sugar_rounded| +--------------+-------------+ | 1.9 | 2.0| | 2.6 | 3.0| +--------------+-------------+ Also can split a col to a json array Here imagine there is a column , “_c0” which has tab separated data, df = df.withColumn("col_split", F.split(F.col("_c0"), "\t")) And casting df = df.withColumn("foo", df["foo"].cast("double")) unique ids! df = df.withColumn("id", F.monotonically_increasing_id()) df.write.parquet("foo.parquet") User Defined Functions A user defined function needs to be defined with a return type For instance, say there’s a dataframe df with a name column, that have spaces between first and last names say, and you can split them up like so, only grabbing the first 2 , for example, by also using F.lit to specify a literal value being passed to the func as well. import pyspark.sql.functions as F from pyspark.types import ArrayType, StringType def split_name(name): return name.split(" ")[:2] udfSplitter = F.udf(split_name, ArrayType(StringType())) df = ... df = df.withColumn("separated_names", udfSplitter(df.name, F.lit(2))) Quick Spark ml lib Logistic Regression Pipeline Given a dataframe with features you would like to use/transform in a LogisticRegression, similarly to sklearn taking an input without feature names, the spark flavor does the same, taking a single column for the input features. ...
mean/std technique For xgboost/random forest type models per this article , the proposed idea is to use the preditions of all the trees as the prediction space or a kind of an uncertainty interval. I wonder if we can say predictions that a model is more certain about have a tighter distribution of predictions. And conversely that a model is unsure about its predictions if the distribution of predictions is wide. I have a feeling that the LSS approach to XGBoost here tries to automate something like that. References hmm mean/std technique intervals
Initial stab on interpreting Variance inflation factor (VIF) So far my skim on https://en.wikipedia.org/wiki/Variance_inflation_factor and https://en.wikipedia.org/wiki/Multicollinearity tells me that high Variance Inflation Factor (VIF) indicates high multicolinearity w/ one or more other independent variables. And that’s bad because (a) when building a linear model (at least using ordinary least squares (OLS) , not yet sure if this is still true if you use regularization ) , the coefficients calculated for the independent variables can change “erratically” given slightly different data . ...
Create New ssh key With ssh-keygen, w/ a pass phrase too. Let ssh-agent manage the ssh key passphrase With ssh-add ~/.ssh/path/to/key And Save to macbook keychain Save that passphrase with ssh-add -K ~/.ssh/path/to/private/key But apparently according to this stackoverflow answer, with Monterey, ssh-add uses # ssh-add --apple-use-keychain ~/.ssh/path/to/private/key because --apple-use-keychain is the new -K. And similarly --apple-load-keychain is the new -A , to load a key into your ssh-agent after logging in. ssh-add --apple-load-keychain ~/.ssh/path/to/private/key