Trying Databricks

https://databricks.com/try-databricks 2021-03-21 Running A quick start notebook Based on the notes here, it is pretty easy to create an auto-scaling cluster. Not sure yet what events prompt the cluster to get more workers. But I would be curious to try a job that uses fewer workers and more workers, to see how the outcomes compare. I also like ethat this notebook supports SQL and also python , using what looks like first line as %python to indicate the language. Is this spark sql or sql ? From the quick start notebook… CREATE TABLE diamonds USING csv OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true") 2021-04-03 Revisit my earlier problem Last time , I found this CDC dataset called “COVID-19_Case_Surveillance_Public_Use_Data.csv” My basic initial question I would like to answer is “how do the symptomatic rates compare by age bin”, since this dataset has an onset_dt column, which is eithr blank if no symptoms and has a date if symptoms. More dataset metadata.. 22.5 M rows each row is a de-identified patient created: 2020-05-15 updated 2021-03-31 (not sure what was being updated though) Temporal Applicability: 2020-01-01/2021-03-16 Update Frequency: Monthly columns Column Name Description Type cdc_case_earliest_dt Calculated date–the earliest available date for the record, taken from either the available set of clinical dates (date related to the illness or specimen collection) or the calculated date representing initial date case was received by CDC. This variable is optimized for completeness and may change for a given record from time to time as new information is submitted about a case. Date & Time cdc_report_dt Calculated date representing initial date case was reported to CDC. Depreciated; CDC recommends researchers use cdc_case_earliest_dt in time series and other time-based analyses. Date & Time pos_spec_dt Date of first positive specimen collection Date & Time onset_dt Symptom onset date, if symptomatic Date & Time current_status Case Status: Laboratory-confirmed case; Probable case Plain Text sex Sex: Male; Female; Unknown; Other Plain Text age_group Age Group: 0 - 9 Years; 10 - 19 Years; 20 - 39 Years; 40 - 49 Years; 50 - 59 Years; 60 - 69 Years; 70 - 79 Years; 80 + Years Plain Text race_ethnicity_combined Race and ethnicity (combined): Hispanic/Latino; American Indian / Alaska Native, Non-Hispanic; Asian, Non-Hispanic; Black, Non-Hispanic; Native Hawaiian / Other Pacific Islander, Non-Hispanic; White, Non-Hispanic; Multiple/Other, Non-Hispanic Plain Text hosp_yn Hospitalization status Plain Text icu_yn ICU admission status Plain Text death_yn Death status Plain Text medcond_yn Presence of underlying comorbidity or disease Plain Text Get data in there Per the Databricks web console I can specify an S3 bucket and create a table from my file like that And they refer to “DBFS” as “Databricks File System” from the example you can load from the File Store like sparkDF = spark.read.csv('/FileStore/tables/state_income-9f7c5.csv', header="true", inferSchema="true") # then you can create a temp table from that df sparkDF.createOrReplaceTempView("temp_table_name") THere was also an interesting note in the help notebook about permanent tables available across cluster restarts… # Since this table is registered as a temp view, it will only be available to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame. # Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data. # To do so, choose your table name and uncomment the bottom line. permanent_table_name = "{{table_name}}" # df.write.format("{{table_import_type}}").saveAsTable(permanent_table_name) I am looking for how to do this w/ s3… Ah according to docs you mount s3 files as regular files then proceed as usual ok will try that … aws_bucket_name = "my-databricks-assets-alpha" s3fn = "s3://my-databricks-assets-alpha/cdc-dataset/COVID-19_Case_Surveillance_Public_Use_Data.csv" s3fn = "s3://my-databricks-assets-alpha/cdc-dataset/COVID-19_Case_Surveillance_Public_Use_Data.head1000.csv" mount_name = "blah" dbutils.fs.mount("s3a://%s" % aws_bucket_name, "/mnt/%s" % mount_name) display(dbutils.fs.ls("/mnt/%s" % mount_name)) Funny thing I was trying to run this cell in the databricks notebook but it would not run and no error was given. But the reason I am pretty sure is that no cluster was attached to the notebook. ...

March 21, 2021 · (updated February 26, 2023) · 7 min · 1435 words · Michal Piekarczyk

Spark Weekend

Trying out Spark this weekend These are just my casual notes from doing that, updating them as I go along. Following this post to get kubernetes running in Docker for mac Per this post , I just ticked the “Enable Kubernetes” option in the docker settings. Kubernetes is taking quite a while to start up though . several minutes. kind of weird? Download spark image From here 2021-01-24 ok backup my docker images Per notes , I backed up local docker images, Like this… docker save citibike-learn:0.9 # image:citibike-learn, tag:latest, image-id:1ff5cd891f00 # image:citibike-learn, tag:0.9, imageid:c8d430e84654 Then I did the factory reset. And Enabled Kubernetes and wow! Nice finally got the green light. And restoring with docker load like this docker load -i citibike-learn-0.9.tar Ok now I can continue trying to get spark setup.. Per the post , I grabbed spark albeit 3.0.1 , instead of 2.x ( from here ) , because according to the release notes , 3.0 and 2.x are sounding very compatible. ./bin/docker-image-tool.sh -t spark-docker build … following along… kubectl create serviceaccount spark # serviceaccount/spark created kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default # clusterrolebinding.rbac.authorization.k8s.io/spark-role created And submitting an example job bin/spark-submit \ --master k8s://https://localhost:6443 \ --deploy-mode cluster \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=spark:spark-docker \ --class org.apache.spark.examples.SparkPi \ --name spark-pi \ local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar Taking 4 minutes so far. Not sure how long this is meant to take haha. ...

January 23, 2021 · (updated February 26, 2023) · 17 min · 3414 words · Michal Piekarczyk