Missing values and outliers

1 minute read

Cleaning data

How you should handle missing values in the data depends on the data itself and the goal you have in mind for analysis or visualisation. In some cases, you may want to filter out the records containing missing values, in others you might assign an average or interpolated value to missing values, and in other cases yet you might fill in the missing data with data from secondary data sources. In the latter case, you need to make sure that the secondary data source contains data comparable to the one you were already using.

When you check the range of values in numerical columns in the data, you might spot outliers, values that are much higher or lower than most other values. These can be due to errors, like a misinterpretation of the decimal sign in the data, or a manual data entry error, or the outliers do not represent errors, and are just part of the data.

If outliers in the data are due to errors, they should be rectified. If outliers represent real data, they can be filtered out as to not distort the analysis and visualisation of the other data, or they can be taken into account by using statistical techniques that are less sensitive to outliers. For example, to summarise the data you could use the median instead of the mean.

Cleaning data

Missing values and outliers

Related pages