port my notes from here https://gist.github.com/namoopsoo/fa903799b958ffc9f279cd293e83e9d9 and here https://gist.github.com/namoopsoo/df08c674b4e3e4794e97601682242c51 and here https://gist.github.com/namoopsoo/607f29e923ceaba890588e69293413cf
Handies
tools.trace debugging exceptions and stack trace https://github.com/clojure/tools.trace dependency: [org.clojure/tools.trace "0.7.9"] user=> (use 'clojure.tools.trace)
hmm… using is so the library is built in, but you still have to start useing it.. boot.user=> (is (= 4 (+ 2 2))) java.lang.RuntimeException: Unable to resolve symbol: is in this context clojure.lang.Compiler$CompilerException: java.lang.RuntimeException: Unable to resolve symbol: is in this context, compiling:(/var/folders/7_/sbz867_n7bdcdtdry2mdz1z00000gn/T/boot.user2780891586981282255.clj:1:1) boot.user=> boot.user=> boot.user=> (use 'clojure.test) nil boot.user=> (is (= 4 (+ 2 2))) true lein test Running all tests in a file lein test module/blah/test_file.py Running specific deftest in module_hmm/blah/test_file.py called test-foo lein test :only module-hmm.blah.test-file/test-foo
Passing large dataframes with dbutils.notebook.run ! At one point when migrating databricks notebooks to be useable purely with dbutils.notebook.run, the question came up, hey dbutils.notebook.run is a great way of calling notebooks explicitly, avoiding global variables that make code difficult to lint and debug, but what about spark dataframes? I had come across this https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data nice bit of documentation about using the spark global temp view to handle name references to nicely shuttle around dataframes by reference, given that a caller notebook and a callee notebook share a JVM and theoretically this is instantaneous. ...
comparing really large spark dataframes I had this usecase where I wanted to be able to check if very large multi-million row and multi-thousand column dataframes were equal, but the advice online about using df1.subtract(df2) just was not cutting it because it was just too slow. It seems to me the df1.subtract(df2) approach more or less is a O(n^2) approach where it is necessary to compare each row in df1 with each row in df2. Instead I was wondering, hey if there are known index columns in these dataframes, maybe we can cheat a little and join them first and then do the comparison after joining them. ...
My Rules of text dbutils.widgets (0) Reading a widget that does not exist results in "com.databricks.dbutilsvl.InputWidgetNotDefined"` (1) "dbutils.widgets.text (name, value)" will set the value of a widget only if it does not already exist. If it already exists, this does nothing (2) You cannot change the value of a widget, but you can remove it and then set it again with the same name, with "dbutils.widgets.text (name, value)" . However, if a widget was set in cell1, then cell2 cannot both remove and reset the widget. This will surprisingly have no effect! ...
Do a build login from shell, $(aws --profile my-local-aws-profile ecr get-login --no-include-email --region us-east-1) build, docker build -t name-of-image -f path/to/Dockerfile path/to/docker/context Run your container # run using an image name, # note that -v takes an absolute path... docker run -i -t -v $(pwd)/local/path:/docker/path <name-of-image>:<tag> # or with a specific image id... say "ad6576e" docker run -d=false -i -t ad6576e If you need your container to have your aws creds Nice hack is to map the “root” user of your container .aws directory ...
F test statistic to evaluate the features One F test produces a ratio (called an F-value) comparing the variation between two populations’ sample means and the variation within the samples. With a greater variation between the population samples, we are more likely to reject the null hypothesis that the samples are of the same source distribution. With a higher F-value, the lower the p-value associated for the distribution of this test. [1] . Also good example at the [2] Here below, I had a DataFrame , df with some features, f1, f2, f3, f4 and target , y , Based on my results, f3 is great, f4 barely better than random numbers. f_regression is meant for real number y And f_classif should only be used for a classification problem where y is a class. [3] from sklearn.feature_selection import f_regression, mutual_info_regression def evaluate_feature(df, feature, target): X = np.array(df[feature].tolist()) num_rows = X.shape[0] X = np.reshape(X, (num_rows, 1)) y = df[target].tolist() f_value, _ = f_regression(X, y) print(feature, f_value) num_rows = df.shape[0] # Random X = np.random.rand(num_rows, 1) f_value, _ = f_regression(X, y) print('random, ', f_value) # Random X = np.random.rand(num_rows, 1) f_value, _ = f_regression(X, y) print('random, ', f_value) # Random X = np.random.rand(num_rows, 1) f_value, _ = f_regression(X, y) print('random, ', f_value) for feature in ['y', 'f1', 'f2', 'f3', 'f4', ]: evaluate_feature(df, feature, 'y') random, [0.42851302] random, [0.60725371] random, [0.56094036] y [3.50677485e+16] f1 [52.90786486] f2 [900.76441029] f3 [4145.1618757] f4 [1.22335227] Refs [1] https://www.statology.org/what-does-a-high-f-value-mean/ [2] https://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html [3] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif ...
Appreciate this post on helping to choose between a few available tests in determining if there are meaningful relationships between feature data. In particular, ANOVA compares two variables, where one is categorical (binning is helpful here) and one is continuous. Chi-square is useful for two categorical comparing two cateorical varables, on the other hand. And Pearson Correlation can be used between two continiuous variables But the caveat is that this test assumes both variables are normally distributed And outliers should be chopped off with some preprocessing.
get the parents of a <blah-branch> git rev-list --parents <blah-commit> beaaaafffff1111111111111111111 fe0000aaaad111111111111111 One of them will typically be the hash of <blah-branch> itself