What exploratory data analysis (EDA) has taught me about improving design

First, it’s important to mention I am not a data scientist. I work in digital design and am doing my MBA. I have been in the fortunate position of taking a business analytics class which has afforded me a limited view of data science and statistics. I enjoy using the insights I gather and relating them back to my daily job.

With this framing in mind let me share with you how design and data connect for me and what insights this has afforded.

Its hard to get your hands on raw meaningful data

We all assume there are abundant amounts of data ready for us to gather insights from, and while that could be the case (if it was stored and transformed correctly) often times data is not available, in a useful format, or is meaningful. The data reality hit me hard. The data I wanted, ‘number of users’, ‘user time through workflows’ & ‘where users get lost’, was just not available. Extrapolating what I can get and use became my next focus. With this learning in mind, if you want to be able to measure your designs properly with identified data you have to know what data you want to gather and build it into the product before launch. Planning is therefore required. Otherwise you will be left with more qualitative than quantitive data.

## 'data.frame':    48842 obs. of  15 variables: ##  

Data doesn’t have a binary state in having impact or not. It’s more subtle, variables have strong or weak connections.

Assuming you get past the first hurdle of getting any data (which I did) you are then left naively looking for insights. I thought I would see strong indicators and clear directions for improvement. Again, this was not so simple. What I start to recognise is strong and weak connections. Variables show connection to each other but perhaps not to showing improvements. As well as variables showing me they have nothing to do with making improvements. The reality of data, as it turns out, is incredibly nuanced.

My data corrolations

Insights and extrapolations are still human generated

This is related to the above consideration, you can automate your data gathering to a point and then you are left to finish the thought (& report on your own). Exploring data (& AI) as it turns out did not relieve me of the work it takes to be concrete on insights. As one human with her data I found myself pulling and stitching different information back together again. This part, for me, was manual. I am sure there are people out there who are able to program this, but I am not one of them. Pulling together your improvement report, if you have limited programming skills, is a hands-on activity.

My data grid of ‘relatedness’

You will spend time trying to fix your code

I wish this was not true, but the gathering and analysis was where I sunk most of my time in this exercise. I spent a LOT of time fixing my code, trialing chart generation packages, and getting stuck in syntax. Genuinely around half of my time was fixing my own mistakes. My console log (and terminal) can collaborate my story. If you want to get to the next level in modelling data for design improvements you will have to learn to code.

bash: syntax error near unexpected token `"tidyverse"'
Attaching package: ‘dplyr’ 
The following objects are masked from ‘package:stats’:     filter, lag 
The following objects are masked from ‘package:base’:     intersect, setdiff, setequal, union

Data can have bias and errors

This was an uncomfortable learning for me. I had always believed that data was agnostic. To a certain extent it is, but the data frame is human, the context of where the data sits is also human. I was not prepared to discover this. It first hit me with a ‘gender’ field that was displayed as a binary variable. You are either male or female. There were no other options. I do not want to discuss gender identity in this article but I do want to point out scientifically there are more genders, and that scientifically there are differences, people who have both reproductive organs for example. But what we do with data is ‘binning’. As human users of data we group it into chunks to make it more ‘manageable’ for us to process. We don’t like having handful of outliers so we group. This pre-processing of data before or after gathering is where we inflict our own bias, and where data can have bias’s built in as a result.

ggplot( data = dataset ) +    
geom_bar( aes( x = gender, fill = variable ) ) +    scale_fill_manual( values = c( "blue", "red" ) ) 

ggplot( data = dataset ) +    
geom_bar( aes( x = gender, fill = variable ), position = "fill" ) +    scale_fill_manual( values = c( "blue", "red" ) ) 

It’s what you do with the data that actually matters.

Doing data analysis only matters if you are actually going to use the insights, otherwise, why bother? Knowing what to do to improve a piece of software only matters if you are actually prepared to make the changes needed and then measure the impact of your change again to see if you are getting improvement. Just gathering the information is not enough. Use what you know, suggest improvements, roadmap the feature improvements, and then check again.

Otherwise it becomes a nice article to share on your blog. Yes, I just ‘self-owned’.

A few of my data pairs