Wide versus long data

5 minutes read

Data structures

To make a line chart of the employment rate by country, many visualisation tools expect your data to look like this:

Country	2017	2018	2019	2020	2021
Belgium	69,8	71	71,8	71,5	71,9
Bulgaria	71,4	72,4	75	73,4	73,2
Czechia	78,4	79,8	80,2	79,6	79,8
Denmark	77,8	78,7	79,4	78,8	79,8

However, you might notice that the data you downloaded from the source is structured differently. It might look like this:

Country	Year	Value
Belgium	2017	69,8
Belgium	2018	71
Belgium	2019	71,8
Belgium	2020	71,5
Belgium	2021	71,9
Bulgaria	2017	71,4
Bulgaria	2018	72,4
Bulgaria	2019	75
Bulgaria	2020	73,4
Bulgaria	2021	73,2
Czechia	2017	78,4
Czechia	2018	79,8
Czechia	2019	80,2
Czechia	2020	79,6
Czechia	2021	79,8
Denmark	2017	77,8
Denmark	2018	78,7
Denmark	2019	79,4
Denmark	2020	78,8
Denmark	2021	79,8

Both tables contain exactly the same data, but the tables are structured very differently.

The first table is using the wide data format. This is a common structure in spreadsheets, and is very convenient to look up values: you quickly know where to look in the table when you want to know what the employment rate in Bulgaria in 2019 was, for example.

With this format, it is also easy to calculate the average employment rate for each country: you can add a column to the right of the table to calculate it with a spreadsheet formula. In the same manner you can add a row at the bottom of the table to calculate the average employment rate over all countries by year.

The wide format is also efficient: it does not contain cells with the same values.

But the wide format has its limitations. What if you would like to publish employment data for men and for women separately, and maybe also break the data down by age class? With the wide format, you need to create a separate table for each combination of gender and age, which is not handy to work with.

The long data format has repeated values (in the table above each country name is present 5 times, and each year 4 times), is less effective for looking up values visually, and requires some work if you want to calculate averages. But it has a major advantage over the wide format: if the data needs to be broken down by gender and age, you can just add these columns to your data:

Country	Year	Value	Gender	Age
Belgium	2017	69,8	M	18-30
Belgium	2018	71	M	18-30
Belgium	2019	71,8	M	18-30
Belgium	2020	71,5	M	18-30
Belgium	2021	71,9	M	18-30
Belgium	2017	71,4	F	18-30
Belgium	2018	72,4	F	18-30
Belgium	2019	75	F	18-30
Belgium	2020	73,4	F	18-30
Belgium	2021	73,2	F	18-30
Belgium	2017	78,4	M	30-50
Belgium	2018	79,8	M	30-50
Belgium	2019	80,2	M	30-50
Belgium	2020	79,6	M	30-50
Belgium	2021	79,8	M	30-50
Belgium	2017	77,8	F	30-50
Belgium	2018	78,7	F	30-50
Belgium	2019	79,4	F	30-50
Belgium	2020	78,8	F	30-50
Belgium	2021	79,8	F	30-50
Bulgaria	…

This table is of course a little wider than the long table that did not contain the gender and age breakdown. But it is still called a ‘long’ table because each row contains a value for one numerical variable only. In this example: the value for the employment rate. The other cells in each row contain values for the dimensions country, year, gender and age, specifying to which value of the respective dimension the employment rate refers to. These dimensions are categorical variables that can be used to break down the values of the numerical variable.

Compared to the wide format, the long format data has column names that “make sense”: they describe what they contain. In the wide format, the column names “2017”, “2018”, … do not describe what they contain: they are actually part of the data.

The long data format reflects how data is stored in databases. It is a format that is very machine readable, and it allows for easy data processing like filtering and aggregating over multiple dimensions.

(Un)pivoting

Because the first two tables in this section contain the exact same data, they can be transformed into one another. The process of going from long data to wide data is called “pivotting” a table. You choose which one of the two dimensions (Country or Year) goes in the rows and which of them goes in the columns, and Excel (or any other spreadsheet or data analysis tool) will take care of putting the values in the right cells of the wide table. Pivotting a table in Excel is demonstrated in the Working with data: cleaning data in practice T7 LINK video.

Unpivoting data (going from wide to long format) is also possible but is less common and not straight forward in Excel.

Advantages of tidy data

Tidying data

Data structures

Wide versus long data

(Un)pivoting

Related pages