from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression X, y = load_iris(return_X_y=True) clf = LogisticRegression( random_state=0, penalty="l2", class_weight="balanced", # or dict {0: 0.1, 1: 0.9} ).fit(X, y, # sample_weight= # array , n_samples, for each row. ) clf.predict(X[:2, :]) clf.predict_proba(X[:2, :]) clf.score(X, y) clf.decision_function(X) from sklearn import metrics from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False) clf = RandomForestClassifier( max_depth=2, random_state=0, n_estimators=100, class_weight= # "balanced", "balanced_subsample" or {0: 0....

(updated February 26, 2023) · 2 min · 326 words · Michal Piekarczyk

Read loc = "dbfs:/databricks-datasets/wine-quality/winequality-red.csv" blah_df = spark.read.csv(loc, sep=";", header=True) Map an existing function import spark.sql.functions as F loc = "dbfs:/databricks-datasets/wine-quality/winequality-red.csv" df = spark.read.csv(loc, sep=";", header=True) df = df.withColumn("sugar_rounded", F.round(df["residual sugar"])) df.select("residual sugar", "sugar_rounded").show(5) +--------------+-------------+ |residual sugar|sugar_rounded| +--------------+-------------+ | 1.9 | 2.0| | 2.6 | 3.0| +--------------+-------------+ Also can split a col to a json array Here imagine there is a column , “_c0” which has tab separated data, df = df....

(updated February 26, 2023) · 10 min · 2119 words · Michal Piekarczyk

mean/std technique For xgboost/random forest type models per this article , the proposed idea is to use the preditions of all the trees as the prediction space or a kind of an uncertainty interval. I wonder if we can say predictions that a model is more certain about have a tighter distribution of predictions. And conversely that a model is unsure about its predictions if the distribution of predictions is wide....

(updated February 26, 2023) · 1 min · 93 words · Michal Piekarczyk

Initial stab on interpreting Variance inflation factor (VIF) So far my skim on https://en.wikipedia.org/wiki/Variance_inflation_factor and https://en.wikipedia.org/wiki/Multicollinearity tells me that high Variance Inflation Factor (VIF) indicates high multicolinearity w/ one or more other independent variables. And that’s bad because (a) when building a linear model (at least using ordinary least squares (OLS) , not yet sure if this is still true if you use regularization ) , the coefficients calculated for the independent variables can change “erratically” given slightly different data ....

(updated February 26, 2023) · 1 min · 155 words · Michal Piekarczyk

Create New ssh key With ssh-keygen, w/ a pass phrase too. Let ssh-agent manage the ssh key passphrase With ssh-add ~/.ssh/path/to/key And Save to macbook keychain Save that passphrase with ssh-add -K ~/.ssh/path/to/private/key But apparently according to this stackoverflow answer, with Monterey, ssh-add uses # ssh-add --apple-use-keychain ~/.ssh/path/to/private/key because --apple-use-keychain is the new -K. And similarly --apple-load-keychain is the new -A , to load a key into your ssh-agent after logging in....

(updated February 26, 2023) · 1 min · 75 words · Michal Piekarczyk