I had this interesting situqtion, where I wanted to plot some numbers that were nested inside of struct columns. They were row counts in a delta table history output, but in any case, I tried to plot them, but my plot treated them as categories.
Ok realizing they were strings, I cast them to integers, but then I got nulls. After a bit of trial and error I realized they were probably laerger than 32bit!
Casting to big int, aka, long
, did the trick.
spark.createDataFrame(
[["3731556164"], ["3731530835"], ["1731530835"]], ["numOutputRows"]
).withColumn(
"numOutputRows_i", f.col("numOutputRows").cast("int")
).withColumn(
"numOutputRows_l", f.col("numOutputRows").cast("long")
).show()
+-------------+---------------+---------------+
|numOutputRows|numOutputRows_i|numOutputRows_l|
+-------------+---------------+---------------+
| 3731556164| null| 3731556164|
| 3731530835| null| 3731530835|
| 1731530835| 1731530835| 1731530835|
+-------------+---------------+---------------+
posted my answer on stacko too