Tutorial 1: Introduction, Project 1, and PySpark#

Week 1 is a big tutorial. Here’s what we’re covering:

  1. Intro and overview of the subject.

  2. Git and Jupyter.

  3. The Prerequisite notebook.

  4. Project 1.

  5. Intro to PySpark.

Git, Jupyter, and the Prerequisite Notebook#

The tutorial content is hosted on GitHub here. To get started:

  • open the prerequisite notebook

    • if you’ve already done this, thanks!

    • if not, make sure you’re reading our Canvas announcements

  • clone the repository

  • run the prerequisite notebook as a Jupyter notebook

    • you can choose to run Jupyter:

      • with Anaconda

      • in Visual Studio Code

      • with JupyterLab (pip install jupyterlab and jupyter-lab)

      • with Jupyter Notebook (pip install notebook and jupyter notebook)

This was a prerequisite, so do this while we discuss Project 1 if you haven’t already.

Digression: Project 1#

Project 1 is released. It is due on the first day of week 5.

Important

If you haven’t started already, start today.

  • I’m serious!!

  • No one listens to this, and everyone complains that we didn’t give you long enough in week 5.

  • This used to be due at the start of week 4, so we have little sympathy for people who didn’t listen to this warning.

  • The handbook says this will take 30 hours. This isn’t a joke.

View the assignment specification and complete the Project 1 Canvas module today!

In short, project 1 asks that you undertake an independent investigation into a data set we have selected for you. The data set contains information about taxi journeys in New York City, and it is up to you to come up with an interesting research topic surrounding this data set. Some (overused) examples include:

  • How do weather conditions impact demand for taxis in the different boroughs of NYC?

  • What was the impact of COVID-19 on the usage of taxis throughout the city?

Try and come up with an interesting and unique question that makes use of an external data set. If you’re struggling to come up with a research topic I would recommend:

  • Thinking about your other interests, would they interact with taxi demand somehow?

  • Looking at the data set, look at the features it contains and consider that your investigation can use any of these.

    • Produce some initial visualisations, does anything interesting stand out?

  • Looking at what external data is available, this can guide your research topic.

  • Considering geospatial data/analysis. We will look at visualising this in the coming weeks and this can be a good way to develop new skills and produce interesting results.

In general, strong submissions show evidence of a well planned and refined research question that makes central use of an external data set. Submissions that pick general/bland questions (eg. how does the day of the week impact tips, etc.), and submissions that make limited use of external data (eg. adding data that does not meaningfully contribute to their research question) receive lower marks.

Intro to PySpark#

Hopefully we’ve got about an hour left to spend on the tutorial 1 notebook.

Open this up in your Jupyter instance and we’ll start talking through the content.

If you’re struggling to run the notebook, ask the people on your table for help, then ask me.