INFOMDW: Data Wrangling
Introduction
With the current advances in the data collection process, we collect vast amount of data that comes from different sources and follows different structures. The volume, variety and velocity of collecting the data pose extra challenges on maintaining the quality of the data, which influences any data analytics and decision-making task. In order to prepare the data for the different tasks, data wrangling steps ensure the transformation of raw unstructured data into clean, organized and suitable formats. The course is designed to balance conceptual understanding with practical implementations (applied & real-world data), using both SQL and Python for data extraction, integration, preparation and validation.
-
Prerequisites
We assume that students have knowledge on:
- Programming (Python)
- Statistics at a typical undergraduate level (mean, variance, basic linear regression)
-
Objectives
After completing this course, students should be able to:
- Extract parts of the data that are relevant for the data analytics task.
- Validate the data against integrity constraints.
- Prepare the data and test its suitability for the analytics process using different data preparation techniques including cleaning, normalization, discretization and reduction.
- Handle and process large volumes of data, for example from such as data streams, and integrate the data from multiple sources.
- Discover the bias in the data and apply bias mitigation algorithms.
Policy
This course is one of the mandatory courses for Applied Data Science Students.
-
Weekly course flow
A regular week in this course consists of two lectures (Tuesday and Thursday) and one tutorial session (Thursday afternoon). The material is introduced on a theoretical level in the lectures and then put into practice in the tutorial sessions. The practical work done in these tutorial is drawn from real life situations that allow the students to experience how to solve data science problems.
In addition, students will spend time during each week on two take-home group assignments and two individual peer-graded assignments.
The lectures are in-person. The required readings should be read before the lecture. These are not optional.
The tutorial sessions are in-person interactive sessions in which you apply the methods you learn about in the lectures. The answers to the exercises that are discussed during the tutorial sessions will be uploaded after the sessions.
The skills acquired in the lectures and the tutorials provide the basis for doing the assignments. These assignments are handed in via Brightspace.
-
Synchronous course policy
INFOMDW is an offline-first course, with mostly in-person lectures and tutorial sessions. It is important for interactive and collaborative learning that the course is offline-first. If you miss a session, e.g., due to sickness, you should catch up in the regular way:- Read the readings
- Go through the lecture slides
- Do the practicals
- Ask your peers if you have questions
- (after the above) ask the lab teacher for further explanation
-
Who to ask what
There are many teachers in this course. If you have questions, first ensure the answer isn’t in this syllabus and then follow the table below:
-
Grading policy
Your final grade in the course consists of the following grading components:- Group assignments (20% of the final grade). There are two group assignments. Each assignment is graded and worth 10% of the final grade.
- Individual assignments (10% of the final grade). There will two individual assignments where each student should submit their own report for the assignment and grade 3 submissions from other students. Each assignment is 5% of the final grade.
- Midterm exam (35% of the final grade): In the 5-th week of the course, there is a midterm exam with multiple choice and open questions that covers the first part of the course.
- Final exam (35% of the final grade): At the end of the course, there is a final exam with multiple choice and open questions that covers (mainly) the material after the midterm.
- the final grade must be greater than or equal to 4.0 and strictly less than 5.5; and
- a minimum of two assignments must have been submitted; and
- at least one of the exams has been attended; missing both exams will result in a direct ND grade.
| Question type | How to ask |
|---|---|
| Course proceedings | Email course coordinator (Hakim) |
| Content - general | Email / ask the teaching assistant |
| Practical content | Email / ask the teaching assistant |
| Assignment content | Email / ask the teaching assistant |
| Lecture content | Email the lecturer |
Material
Required Software
In this course, we will use a variety of software, but mainly SQLite, Python and R. Try to install both on your computer by the start of the course; we will also have a set-up computer lab on the first day to help you with this process.
Installing DB Browser for SQLite: For the SQL parts, we recommend installing DB Browser for SQLite. Installation instructions for MAC, Windows, and Linux can be found here.
Installing Python & Jupyter: For the python parts of the course, we will use Google Colab , which is an interactive online notebook environment; this means no installation is necessary! However, you do need a google account, so make sure you have one (or make one specifically for the course).
Reading Material
| ID | Title and authors | url |
|---|---|---|
| DBSC | [Database System Concepts] Silberschatz, Korth, Sudarshan | db-book.com |
| MMDS | [Mining Massive Datasets] Leskovec, Rajaraman, Ullman | mmds.org |
| PDA | [Python for Data Analysis, 3E] Wes McKinney | wesmckinney.com/book/ |
| DMCT | [Data Mining: Concepts and Techniques] Han, Kamber, Pei | 3-rd edition |
| MHRB | [A survey on bias and fairness in machine learning] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. | URL |
Schedule
| Week | Date | Topic | Type | Reading |
|---|---|---|---|---|
| 1 | Introduction to the course | Lecture | Visiting the course website | |
| 1 | Setting up your computer | Tutorial | ||
Materials for the