spotting the difference
I had a crazy issue where a databricks job was taking over 13 hours and failing which had tzken just over 1 hour previously on another workspace. Turned out after a bit of staring at logs I had a face palming moment because yet again I got bit by spot instances/spot workers. I found in my logs Spark UI these weird errors, “Executor 1 removed” “Executor 2 removed” ,.. thinking memory issues or asymmetric shuffle issues but then hovered my mouse and saw ...
