From messy Excel data to a tidy dataframe: a script for the Flemish municipal monitor data

Feb 27, 2020 5 min read

Data is often stored in excel files with a messy layout. Messy for research purposes, that is. One way around this is to cut and paste your way to a tidy format with observations in rows and variables in columns. That’s one way, but maybe a script can do the work for you. This post has an example worked out.

The excel data comes from the municipal monitor of the Flemish government. Representative surveys have been conducted in all 308 municipalities in Flanders. The Agency for Local Government shares the [excel data](/files/survey 2.zip) with researchers. The data is in Dutch.

The data is organised in 9 excel workbooks with different topics. Every workbook has a sheet for every item in the questionnaire. The first line is a description of the topic. The variable names are in the second row and the observations in the remaining rows. In some rows, there are totals of subgroups of municipalities. What we need is a database with municipalities in rows and all variables in columns, preferably in one .csv file.

If you just want to download the end result as an integrated csv file; it’s here.

Two packages are used. Readxl for reading excel data into R and dplyr for data wrangling.

library(dplyr)
library(readxl)

First, unzip the map in your working directory. Next, we need a list of all the excel workbooks in the unzipped map. Here, the excel workbooks are in a map called ‘survey 2’. We only need the files that start with ‘Survey’. The ^ sign says ‘begins with’.

listfiles <- list.files("survey 2/", pattern = "^Survey")

Next, we will loop over the workbooks (i loop) and over the sheets within the workbooks (j loop) The script says that we want data for every workbook and for every sheet.

the excel_sheets() function allows retrieving a list of sheets for every workbook in the listfiles vector. With the paste0() function, the path to the right workbook is created.

Next, the j loop loops through the sheets in workbook i. Each sheet has the topic of the item in cell A1. We will need this topic to name the variables in the final dataset. With read_xlsx(), we can extract this cell by defining a range (here: the first row). ````read_xlsx()returns a dataframe that needs to be unlisted. We just need a vector. The vector is stored in an object calledvarname```.

When running the script, we see a lot of warnings because R wants to give names while applying read_xlsx(). No problem, we just want the vector.

We also need the categories for the answer. The variables in the sheets are the municipal percentages on the answering categories. A sheet with an item with a 5 point Likert hence has 5 variables. The approach is the same as for the variable name.

Now, we can compose a vector of names to use (nametouse in the script). We need to paste varname (i.e. the topic of the sheet) to the categories of the answers in the varcat vector. If not, variables will be named “agreed” or “yes”, which is not very useful. Yet, we don’t want the pasting for the unique NIS identifier of municipalities and for the year. We will need those to join datasets in the next stage.

Now, we read the core data. We read sheet j of workbook i. The range is constrained to the first 309 rows. Some older data for 13 cities is added to the sheets and the ambition here is to only retrieve the 2017 wave. The left join will make sure only 2017 data is retained.

For the colnames, we use the nametouse vector we just created.

In the first loop if(i=1 and j = 1) the dataframe is created.

In the following loops else, the temporary file is added to the dataframe with left join using the NIS identifier and the year.

This is the whole script.

for (i in 1:length(listfiles)) {
    sheets <- excel_sheets(path = paste0("survey 2/",listfiles[i]))
  for (j in 1:length(sheets)) {
    
    varname <- unlist(read_xlsx(path = paste0("survey 2/",listfiles[i]),
    sheet=j,range = cell_rows(1:1), col_names = F))
    
    varcat <- unlist(read_xlsx(path = paste0("survey 2/",listfiles[i]),
    sheet=j,range = cell_rows(2:2), col_names = F))
    
    nametouse <- c("NIS","Jaar",paste(varname,varcat[-(1:2)]))
    
    tmp<-read_xlsx(path = paste0("survey 2/",listfiles[i]), sheet=j,
                   range = cell_rows(3:309),col_names = nametouse)
                   
    if (i==1 & j==1) {
      monitor_data <- tmp
    } else {
      monitor_data <- left_join(monitor_data,tmp, by = c("NIS","Jaar"))
    }
  }
}

The totals of the answers add up to 1. Those variables are not needed in the dataframe. We remove variables that contain the string “Total”. The end result is a dataframe with 857 variables.

monitor_data_clean <- select(monitor_data, -contains(" Total"))

Finally, we save the dataframe in a suitable format. A csv file is probably most convenient and can easily be important Stata, SPSS or free software such as Jamovi or JASP. With the haven package (library(haven)), you can write to proprietary software formats directly with write_sas() for SAS, write_dta() for Stata, or, write_sav() for SPSS.

write_csv(monitor_data_clean, "GemeenteMonitorSurvey2017.csv")

If you want see another case, this blog by Sophie Bennet may be useful as well.

REFERENCES

Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2019). dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr Hadley Wickham and Evan Miller (2019). haven: Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files. R package version 2.2.0 https://CRAN.R-project.org/package=haven

Method blog English

Wouter Van Dooren

Professor of Public Administration

My research interests include public sector performance; performance information, accountability and learning; and conflict in public participation