Session number: 11
Expected participants: Relevant to all, special relevance to those working on data projects
Type: Training
Length: 2-3 hours
Exercises: Yes
Web based exercises: Yes
What to bring: Slides, Web-Connected Laptop
Session Flow:
- Understanding data cleaning - The facilitator should outline the importance of cleaning Open Data both for publishers preparing to Open Data and for users accessing data for their projects. The facilitator should guide the participants through the main types of errors found in Open Data such as mixed numerical scales, duplicated records, redundant data or spelling errors.
- Tools for data cleaning - The facilitator should introduce key tools for data cleaning including OpenRefine, Excel and any others relevant to the audience. The facilitator should highlight the key features of each solution and describe for participants the selection criteria for choosing the right tool for the problem.
- Practical data cleaning - The participants will undertake a data cleaning exercise using OpenRefine (exercise and datasets below). The facilitator should first play the OpenRefine video (below) then offer support to the participants in guiding them through the exercise steps.
Resources:
- OpenRefine - Open source data cleaning tool
- OpenRefine Video - To be shown to participants before the OpenRefine exercise
- OpenRefine Exercise - Data cleaning exercise using OpenRefine
- Exercise Dataset 1 - Louisiana Secretary of State Officials open dataset
- Exercise Dataset 2 - Projects open dataset
- Exercise Dataset 3 - UK GP Earnings open dataset
- OpenRefine Pro - Hosted OpenRefine service that can be used for courses by special request
Companion eLearning Modules:
When running this session, we recommend that participants complete the following eLearning module before attending:
Completion of the module will help your learners develop a shared understanding of the material before the course and allow you to focus in greater depth on those topics of most interest to the trainees.