Home icon
Data Visualisation Guide

Tidying data

4 minutes read

Tidy data

Tidying data is the practice of turning untidy data into tidy data, and is a necessary step in the preparation of data visualisation based on the Grammar of Graphics. Below are some of the more common operations for tidying data.

Separating data

This table contains 2 variables in the rate column:

country year rate
Belgium 2020 14.444/11.522.440
Belgium 2021 14.791/11.554.767
Bulgaria 2020 23.128/6.916.548
Bulgaria 2021 22.305/6.951.482
Czechia 2020 8.610/10.494.836
Czechia 2021 8.990/10.693.939

To make the table tidy, the values in the rate column need to be split into 2 new columns. This operation is often called separating or spreading a column into new columns. In order to separate or spread a column, you need to provide the character (or characters) that are used as the separator between the values in the column. In this case, the separator is the forward slash “/”.

After separating the values in a column, you need to give the newly created columns new column names (which are the variable names of the created tidy data set).

Making data wider

The table below is untidy, because the count column contains the values for 2 variables (cases and population).

country year type count
Belgium 2020 cases 14.444
Belgium 2020 population 11.522.440
Belgium 2021 cases 14.791
Belgium 2021 population 11.554.767
Bulgaria 2020 cases 23.128
Bulgaria 2020 population 6.916.548
Bulgaria 2021 cases 22.305
Bulgaria 2021 population 6.951.482
Czechia 2020 cases 8.610
Czechia 2020 population 10.494.836
Czechia 2021 cases 8.990
Czechia 2021 population 10.693.939

To make the data tidy, the values in the count column need to be organised in two columns, based on the values in the type column. This operation is called making data wider, or pivot data wider.

Making data longer

Consider the following untidy data table:

country 2020 2021
Belgium 14.444 14.791
Bulgaria 23.128 22.305
Czechia 8.610 8.990

This table is untidy, because the 2020 and 2021 column names are values of the year variable, so this variable is spread over multiple columns. The tidy version of this table looks like this:

country year cases
Belgium 2020 14.444
Bulgaria 2020 23.128
Czechia 2020 8.610
Belgium 2021 14.791
Bulgaria 2021 22.305
Czechia 2021 8.990

The operation of taking the names of multiple columns and “collapsing” them into a single column (in this case this is the year column) and putting the values of the original columns into another column (this is the cases column here) is called making data longer, pivotting data longer or unpivotting data).

This operation usually results in tables that are longer (with more rows, and less columns) than the original table. You can read more about wide and long data in Wide versus long data.

Transposing data

Below is another example of an untidy data table:

country Belgium Belgium Bulgaria Bulgaria Czechia Czechia
year 2020 2021 2020 2021 2020 2021
cases 14.444 14.791 23.128 22.305 8.610 8.990
population 11.522.440 11.554.767 6.916.548 6.951.482 10.494.836 10.693.939

This table is untidy, because it has the variables as rows and the observations in the columns, instead of vice versa. The operation to make this data table tidy is called transposing a table. Transposing a table will make rows out of the columns, and columns out of the rows.

Related pages

Intro to tidy data

Advantages of tidy data

Grammar of Graphics in practice: Tableau

Grammar of Graphics in practice: Vega-Lite

Grammar of Graphics in practice: ggplot2

Grammar of Graphics in practice: Observable Plot

Tidy data