Big Data: Introduction to Apache Spark

Welcome to the Spark course page

This material was initially used for the Big Data: Introduction to Apache Spark course, part of the Data Science Master 2 program of the University of Evry Val d'Essone. It is intended to support teaching the use of the Apache Spark big data library through its Python interface. At the end of the course, the students are expected to (hopefully) be able to manipulate and analyze data, and to apply basic learning algorithms using Spark.

Overview

This course is organized to be divded in 4 sessions of 4 hours.

Setup

The students can use any computer with access to the internet during the course, or even a tablet with an external keyboard, if they are into those things.

The hands-on labs and assignments are intended to be done using an online Jupyter Notebook-compatible platform. A good option is provided by Databricks, and a basic Spark 2 installation can be used for free (Databricks Community). For this platform, a registration is needed at this page.

Course Content

Notebooks

Data Files