Handies

Super cool model baseline technique From scikitlearn, this is a super cool way of getting a baseline. In the past I had done this majority class assignment manually. But it is super cool this is kind of built in. from sklearn.dummy import DummyClassifier X, y = get_some_data() X_train, X_test, y_train, y_test = do_some_splitting(X, y) dummy_clf = DummyClassifier(strategy="most_frequent") dummy_clf.fit(X_train, y_train) dummy_clf.score(X_test, y_test)

Mainly notes from reaading the Natural Language Processing with Transformers book Really nice book! I have the urge to write down for myself some snippets so I can more easily refer to them later. Read a dataset to pandas import pandas as pd from datasets import load_dataset emotions = load_dataset("emotion") # emotions["train"] # this is still a datasets.arrow_dataset.Dataset emotions.set_format(type="pandas") df = emotions["train"][:] # but adding that "[:]" slice grants a DataFrame !...

Loss functions vs Metric functions I like the phrasing in this SO answer, that loss functions are optimized directly when training but that metrics are optimized indirectly. I was trying to figure out last year why functions commonly used as metrics (F1 and AUC) are not listed in the tensor flow keras loss functions . I did however earlier try using F1 as a loss function when trying to understand my particular problem....

TPR, FPR tpr = 1.0*TP/(FN + TP) # aka recall fpr = 1.0*FP/(FP + TN) # Confusion matrix Given a testdf where first column contains actual labels, 0, 1, and predictions is a list of probabilities, y_pred = (y_prob >= 0.08) confusion = pd.crosstab(index=y_true, columns=y_pred, rownames=['actual'], colnames=['predictions']) predictions False True actual 0 509 132 1 32 22 Also there is a super nice helper in scikitlearn Below, using some pre-baked results from running some of the chapter 2 code from https://transformersbook....

One particularly killer feature about the ColumnTransformer is that you can apply a specific preprocessor for a subset of the columns, and then set remainder="passthrough" for the others import numpy as np from sklearn.preprocessing import (MinMaxScaler, OneHotEncoder, LabelEncoder) from sklearn.preprocessing import Binarizer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn import metrics from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score, train_test_split def make_data(): X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=42, shuffle=False, weights=(0....