https://databricks.com/try-databricks
2021-03-21
Running A quick start notebook
- Based on the notes here, it is pretty easy to create an auto-scaling cluster.
- Not sure yet what events prompt the cluster to get more workers.
- But I would be curious to try a job that uses fewer workers and more workers, to see how the outcomes compare.
- I also like ethat this notebook supports
SQL
and alsopython
, using what looks like first line as%python
to indicate the language.
Is this spark sql or sql ?
- From the quick start notebook…
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
2021-04-03
Revisit my earlier problem
- Last time , I found this CDC dataset called “COVID-19_Case_Surveillance_Public_Use_Data.csv”
- My basic initial question I would like to answer is “how do the symptomatic rates compare by age bin”, since this dataset has an
onset_dt
column, which is eithr blank if no symptoms and has a date if symptoms.
More dataset metadata..
- 22.5 M rows
- each row is a de-identified patient
- created:
2020-05-15
- updated
2021-03-31
(not sure what was being updated though) - Temporal Applicability:
2020-01-01/2021-03-16
- Update Frequency: Monthly
- columns
Column Name | Description | Type |
---|---|---|
cdc_case_earliest_dt | Calculated date–the earliest available date for the record, taken from either the available set of clinical dates (date related to the illness or specimen collection) or the calculated date representing initial date case was received by CDC. This variable is optimized for completeness and may change for a given record from time to time as new information is submitted about a case. | Date & Time |
cdc_report_dt | Calculated date representing initial date case was reported to CDC. Depreciated; CDC recommends researchers use cdc_case_earliest_dt in time series and other time-based analyses. | Date & Time |
pos_spec_dt | Date of first positive specimen collection | Date & Time |
onset_dt | Symptom onset date, if symptomatic | Date & Time |
current_status | Case Status: Laboratory-confirmed case; Probable case | Plain Text |
sex | Sex: Male; Female; Unknown; Other | Plain Text |
age_group | Age Group: 0 - 9 Years; 10 - 19 Years; 20 - 39 Years; 40 - 49 Years; 50 - 59 Years; 60 - 69 Years; 70 - 79 Years; 80 + Years | Plain Text |
race_ethnicity_combined | Race and ethnicity (combined): Hispanic/Latino; American Indian / Alaska Native, Non-Hispanic; Asian, Non-Hispanic; Black, Non-Hispanic; Native Hawaiian / Other Pacific Islander, Non-Hispanic; White, Non-Hispanic; Multiple/Other, Non-Hispanic | Plain Text |
hosp_yn | Hospitalization status | Plain Text |
icu_yn | ICU admission status | Plain Text |
death_yn | Death status | Plain Text |
medcond_yn | Presence of underlying comorbidity or disease | Plain Text |
Get data in there
- Per the Databricks web console I can specify an S3 bucket and create a table from my file like that
- And they refer to “DBFS” as “Databricks File System”
- from the example you can load from the File Store like
sparkDF = spark.read.csv('/FileStore/tables/state_income-9f7c5.csv', header="true", inferSchema="true")
# then you can create a temp table from that df
sparkDF.createOrReplaceTempView("temp_table_name")
- THere was also an interesting note in the help notebook about permanent tables available across cluster restarts…
# Since this table is registered as a temp view, it will only be available to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.
permanent_table_name = "{{table_name}}"
# df.write.format("{{table_import_type}}").saveAsTable(permanent_table_name)
- I am looking for how to do this w/ s3…
- Ah according to docs you mount s3 files as regular files then proceed as usual
- ok will try that …
aws_bucket_name = "my-databricks-assets-alpha"
s3fn = "s3://my-databricks-assets-alpha/cdc-dataset/COVID-19_Case_Surveillance_Public_Use_Data.csv"
s3fn = "s3://my-databricks-assets-alpha/cdc-dataset/COVID-19_Case_Surveillance_Public_Use_Data.head1000.csv"
mount_name = "blah"
dbutils.fs.mount("s3a://%s" % aws_bucket_name, "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))
Funny thing I was trying to run this cell in the databricks notebook but it would not run and no error was given. But the reason I am pretty sure is that no cluster was attached to the notebook.
Then when I started the cluster creation process and then tried executing a cell, I was seeing “Waiting for cluster to start: Starting Spark” in the output.
Now
AccessDenied
. Will try to tweak the Role which I created for databricks . Ah I see it has no s3 access at all…Hmm I tweaked permissions and tried again but now got a new error that the directory is already mounted.
So going to try just the second part..
Now getting Access Denied for
getFileStatus on s3a://my-databricks-assets-alpha/:
,I unmounted
dbutils.fs.unmount("/mnt/%s" % mount_name)
, gave all the read s3 permissions available, but still getting that Access Denied forgetFileStatus on s3a://my-databricks-assets-alpha/:
,But oddly enough when I go to my AWS EC2 the workers which were created have no “IAM Role"s associated with them. so that’s weird.
Trying to use these docs to troubleshoot s3a
Ok going to just try the upload instead because cannot figure out the permissions. But I feel it is possibly because of the missing IAM Role on the ec2 workers.
Wait oops!
- At the very top of this sample notebook I completely ignored the link , https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html , which has super detailed IAM role instructions #$%*#$$#!! haha
- Ok going to try the upload route anyway just so I can possibly get started ..
upload
- ok I stepped away from my upload of this 1.5 gig file, and when I came back there was no evidence of success or error hehe,
- I looked around the file system w/ the notebook and stumbled upon this interesting dir, which looks like it has my file
import os
print(os.listdir("/dbfs/FileStore/tables"))
# ['COVID_19_Case_Surveillance_Public_Use_Data.csv']
- So going to try making a table from it…
# File location and type
file_location = "/dbfs/FileStore/tables/COVID_19_Case_Surveillance_Public_Use_Data.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
- Got …
AnalysisException: Path does not exist: dbfs:/dbfs/FileStore/tables/COVID_19_Case_Surveillance_Public_Use_Data.csv;
- Maybe without the leading
/dbfs/
? - Ah bingo.. so it’s not an absolute path but like a relative path..
# File location and type
file_location = "/FileStore/tables/COVID_19_Case_Surveillance_Public_Use_Data.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
Ok that was pretty fast..
And create that table..
# Since this table is registered as a temp view, it will only be available to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.
permanent_table_name = "covid"
# table_import_type?
df.write.format("json").saveAsTable(permanent_table_name)
- Took 50 seconds
- try read ..
select * from covid limit 10
- wow ok actually worked … I am now seeing first ten rows ..
quick stat
%sql
select age_group, count(1)
from covid
group by age_group
order by age_group asc
ok try that group by from last time..
select age_group, sum(case when onset_dt is null then 0 else 1 end)/count(1) as asymptomatic_rate
from covid
group by age_group
order by age_group
- And side notes, in the databricks browser notebook , I generated this above plot by clicking “bar chart” option and generating a “Bokeh” looking graphic, then downloading it. But I had tried to export the notebook as
ipynb
and the images did not get saved . Instead it looked like only the tabular data was saved.
And although this data is pretty sparse, look at the comorbidity data too
hosp_yn |Hospitalization status|Plain Text
icu_yn |ICU admission status|Plain Text
death_yn |Death status|Plain Text
medcond_yn |Presence of underlying comorbidity or disease|Plain Text
select medcond_yn, count(1)
from covid
group by medcond_yn
medcond_yn|count(1)
--|--
Unknown|1290951
No|938017
Yes|944195
Missing|10242673
select sum(case when hosp_yn)
from covid
group by medcond_yn
hosp_yn | count(1) |
---|---|
Unknown | 2139571 |
No | 4830822 |
Yes | 703734 |
Missing | 5741709 |
%sql
select medcond_yn,
sum(case when hosp_yn = 'Yes' then 1 else 0 end)/count(1) as hospYes_rate,
sum(case when hosp_yn = 'No' then 1 else 0 end)/count(1) as hospNo_rate,
sum(case when hosp_yn = 'Unknown' then 1 else 0 end)/count(1) as hospUnknown_rate,
sum(case when hosp_yn = 'Missing' then 1 else 0 end)/count(1) as hospMissing_rate
from covid
group by medcond_yn
medcond_yn | hospMissing_rate | hospYes_rate | hospUnknown_rate | hospNo_rate |
---|---|---|---|---|
Unknown | 0.12071333458822217 | 0.06108210148952207 | 0.6504840230186894 | 0.16772054090356645 |
No | 0.22393943819781517 | 0.03035552660559457 | 0.021185117114082153 | 0.7245199180825082 |
Yes | 0.10189102886585928 | 0.21121802170102574 | 0.04902694888238129 | 0.6378640005507337 |
Missing | 0.5154523628744176 | 0.03875697291127033 | 0.12044365762726195 | 0.32534700658705007 |