Tutorial 2: More PySpark, and Geospatial Visualisations
Contents
Tutorial 2: More PySpark, and Geospatial Visualisations#
This week we will cover a lot content that is essential for succeeding in Project 1. We’ll look at how to do more advanced data manipulation in PySpark, and will also cover how to visualise our TLC data on the map of NYC.
We’ll spend pretty much the whole tutorial working through the notebook, which should act as a valuable reference for you as you continue working on Project 1.
Project 1 Reminder#
Important
If you haven’t started, get started…
Tutorial Outline#
In the session today we will:
Quickly look at the content we missed at the end of last week
Discuss Spark’s usage of lazy evaluation
Troubleshoot broken
.parquetschemasLook at more operations in PySpark
Renaming
Creating derived columns
User Defined Functions (UDFs)
Look at Spark SQL
Take a break
Install and process shapefiles with
GeoPandasProduce geospatial visualisations with
foliumProvide tips for including useful visualisations in Project 1
Produce other visualisations, including heatmaps for visualising correlation
Look at just a few functions from
pyspark.ml
There’s a lot to cover, and this might be the most technical/dry tutorial, but it’s all valuable for Project 1 and building your understanding of Spark.