INFOMDW: Data Wrangling

Materials for the
ADS course
(INFOMDW)

The materials on
this website are
CC-BY-4.0 licensed

Introduction

With the current advances in the data collection process, we collect vast amount of data that comes from different sources and follows different structures. The volume, variety and velocity of collecting the data pose extra challenges on maintaining the quality of the data, which influences any data analytics and decision-making task. In order to prepare the data for the different tasks, data wrangling steps ensure the transformation of raw unstructured data into clean, organized and suitable formats. The course is designed to balance conceptual understanding with practical implementations (applied & real-world data), using both SQL and Python for data extraction, integration, preparation and validation.

Prerequisites

We assume that students have knowledge on:
1. Programming (Python)
2. Statistics at a typical undergraduate level (mean, variance, basic linear regression)
The course is open to students from: the master in applied data science (ADS) and the student from the master of data science (DASC)
Objectives

After completing this course, students should be able to:
1. Extract parts of the data that are relevant for the data analytics task.
2. Validate the data against integrity constraints.
3. Prepare the data and test its suitability for the analytics process using different data preparation techniques including cleaning, normalization, discretization and reduction.
4. Handle and process large volumes of data, for example from such as data streams, and integrate the data from multiple sources.
5. Discover the bias in the data and apply bias mitigation algorithms.

Policy

This course is one of the mandatory courses for Applied Data Science Students.

Weekly course flow

A regular week in this course consists of two lectures (Tuesday and Thursday) and one tutorial session (Thursday afternoon). The material is introduced on a theoretical level in the lectures and then put into practice in the tutorial sessions. The practical work done in these tutorial is drawn from real life situations that allow the students to experience how to solve data science problems. In addition, students will spend time during each week on two take-home group assignments and two individual peer-graded assignments. The lectures are in-person. The required readings should be read before the lecture. These are not optional. The tutorial sessions are in-person interactive sessions in which you apply the methods you learn about in the lectures. The answers to the exercises that are discussed during the tutorial sessions will be uploaded after the sessions. The skills acquired in the lectures and the tutorials provide the basis for doing the assignments. These assignments are handed in via Brightspace.

Synchronous course policy

INFOMDW is an offline-first course, with mostly in-person lectures and tutorial sessions. It is important for interactive and collaborative learning that the course is offline-first. If you miss a session, e.g., due to sickness, you should catch up in the regular way:
1. Read the readings
2. Go through the lecture slides
3. Do the practicals
4. Ask your peers if you have questions
5. (after the above) ask the lab teacher for further explanation

Who to ask what

There are many teachers in this course. If you have questions, first ensure the answer isn’t in this syllabus and then follow the table below:

Question type How to ask

Course proceedings Email course coordinator (Hakim)

Content - general Email / ask the teaching assistant

Practical content Email / ask the teaching assistant

Assignment content Email / ask the teaching assistant

Lecture content Email the lecturer

Grading policy

Your final grade in the course consists of the following grading components:
1. Group assignments (20% of the final grade). There are two group assignments. Each assignment is graded and worth 10% of the final grade.
2. Individual assignments (10% of the final grade). There will two individual assignments where each student should submit their own report for the assignment and grade 3 submissions from other students. Each assignment is 5% of the final grade.
3. Midterm exam (35% of the final grade): In the 5-th week of the course, there is a midterm exam with multiple choice and open questions that covers the first part of the course.
4. Final exam (35% of the final grade): At the end of the course, there is a final exam with multiple choice and open questions that covers (mainly) the material after the midterm.
To pass the course, the weighted final grade across all components must be at least 5.5.
In order to qualify for the resit exam:
1. the final grade must be greater than or equal to 4.0 and strictly less than 5.5; and
2. a minimum of two assignments must have been submitted; and
3. at least one of the exams has been attended; missing both exams will result in a direct ND grade.

Question type	How to ask
Course proceedings	Email course coordinator (Hakim)
Content - general	Email / ask the teaching assistant
Practical content	Email / ask the teaching assistant
Assignment content	Email / ask the teaching assistant
Lecture content	Email the lecturer

Material

Required Software

In this course, we will use a variety of software, but mainly SQLite, Python and R. Try to install both on your computer by the start of the course; we will also have a set-up computer lab on the first day to help you with this process.

Installing DB Browser for SQLite: For the SQL parts, we recommend installing DB Browser for SQLite. Installation instructions for MAC, Windows, and Linux can be found here.

Installing Python & Jupyter: For the python parts of the course, we will use Google Colab , which is an interactive online notebook environment; this means no installation is necessary! However, you do need a google account, so make sure you have one (or make one specifically for the course).

Reading Material

ID Title and authors url

DBSC [Database System Concepts] Silberschatz, Korth, Sudarshan db-book.com

MMDS [Mining Massive Datasets] Leskovec, Rajaraman, Ullman mmds.org

PDA [Python for Data Analysis, 3E] Wes McKinney wesmckinney.com/book/

DMCT [Data Mining: Concepts and Techniques] Han, Kamber, Pei 3-rd edition

MHRB [A survey on bias and fairness in machine learning] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. URL

ID	Title and authors	url
DBSC	[Database System Concepts] Silberschatz, Korth, Sudarshan	db-book.com
MMDS	[Mining Massive Datasets] Leskovec, Rajaraman, Ullman	mmds.org
PDA	[Python for Data Analysis, 3E] Wes McKinney	wesmckinney.com/book/
DMCT	[Data Mining: Concepts and Techniques] Han, Kamber, Pei	3-rd edition
MHRB	[A survey on bias and fairness in machine learning] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A.	URL

Schedule

You can find the up-to-date class schedule with locations on mytimetable.uu.nl.

Week Date Topic Type Reading

1 Introduction to the course Lecture Visiting the course website

1 Setting up your computer Tutorial

Week	Topic	Type	Reading
1	Introduction to the course	Lecture	Visiting the course website
1	Setting up your computer	Tutorial

INFOMDW: Data Wrangling

Introduction

Prerequisites

Objectives

Policy

Weekly course flow

Synchronous course policy

Who to ask what

Grading policy

Material

Required Software

Reading Material

Schedule