Data preparation is not a chore.
- I often find that users and vendors think of data preparation and analytics as separate processes. You’ll read that Data preparation accounts for about 80% of the work of data scientists.
- Perhaps through association with the phrase data cleansing, users certainly seem to see it as a dirty job. Vendors agree, pitching tools that ease the burden of getting data ready for analysis.
- I can’t agree. Whether you call it data wrangling, data blending or some other new phrase, the distinction between data preparation and real analytics is often unrealistic and damaging.
- As an analyst, the insights you may have are determined by the nature of the data you work with. To borrow a phrase from the designer, Don Norman (who borrowed it from JJ Gibson), the affordances of the data – its actionable properties offered up to analysis – are largely determined by your data preparation. The analysis you can do – or more exactly, the analysis that you perceive you can do – is determined by the shape of your data, its quality, completeness, range of values, range of time and so on.
- None of these properties of your data has one perfect outcome that you can prepare or clean the data for. De-duplicating records; inferring missing values; concatenating fields; joining tables: each of these actions removes information about the nature of the source system, which may yet be useful in scenarios you have not happened upon. I have seen duplicate credit-card swipes cleaned up for the benefit of sales analysts, only to the despair of the fraud-prevention team.
- You must know something in advance about your intended use of data in order to prepare it effectively for that case. Equally, you must know something about your data in order to know what uses are appropriate. Analytics is a synthesis of the work you do with the data and the subsequent aggregations, algorithmic computations, visualizations or other manipulations you apply.
- In short, data preparation and analytics are two sides of the same coin.
- As you would expect, vendors in the BI space have taken some very different approach to this problem. I am not going to try to review all the tools and techniques here. But I always have some tools at the top of my mind – you’ll soon recognize them as my favourites – and their methods are quite various enough for one blog post.
- Although some tools are super smart in their analytics – ThoughtSpot, I am talking about you – they throw the needs of data preparation over to other applications, and most likely to other people in your organization. However, many serious BI tools today have data preparation either built-in or available to the app in some way, but even then, most often as a very distinct process: an unfortunate, messy prelude to the real work of building a dashboard, or visualization or running algorithms.
- In Qlik you need to drop into a specific data source mode, though the visual tools they provide are attractive and smart. YellowFin also has a data profiling and preparation workspace integrated, with data virtualisation helping to address concerns about replicated data sets. Tableau’s new Maestro product will be a separate workspace which features a visual data flow and some promising algorithmic help for data quality. Similarly, Microsoft’s Power Query is a separate tool and users find it’s Excel base enables it to perform powerful column transformations. All of these tools can be used by the analyst, but they all impose some discontinuity between preparation and analysis that impedes the users flow of thinking.
- Alteryx has exceptionally powerful data preparation tools – it is often used as a standalone ETL application for Tableau, Qlik or PowerBI users. Yet it has an excellent analytic stack too, especially for machine learning. There is a well-thought-out workflow between those two capabilities, although there are still some disconnects.
- There are other approaches. Domo, to my mind is most successful when focussed on a business user, often an executive, who needs analytic power and insight but would not think of themselves as analyst. In their case, data connectivity is often through rich application connectors which deeply understand the technical and business metadata of their sources: the result is data that is ready for use by that class of user. Domo data connectivity is smart, more so than the data preparation, which goes by the unfortunate name of ETL Magic . This is very productive for executives. You might compare these smart connectors favourably with Microsoft’s supposedly wide range of rather rudimentary connectors which remind me of the old British sitcom, set in a tailors’ shop: Never mind the quality, feel the width.
- ClearStory, to my mind, takes an even smarter approach, without deep understanding of business application metadata, but with outstanding support for Hadoop and NoSQL sources. ClearStory finds patterns, discontinuities, potential derivations and insights in data in a way that is mostly seamless with its analytic user experience – they call it Harmonization. Doesn’t that sound nicer than data wrangling? If ClearStory’s distinctive style of analytics works for you, there will be no dirty work involved, but rather a continuous process of discovery.
- I am not a fan of multi-purpose tools, so don’t think I am advocating some sort of monstrous chimera of ETL and analytics. However, I do strongly believe that good analysts and data science need to be able to work with their raw material – their data – in a thoroughly engaged manner. The tools I have mentioned all take different approaches, but each have some advantages for analysts who realise they need to get their hands dirty in order to do their best work.
Whether for data preparation or analytics, there is no one-size-fits-all solution for your business. Different teams have varying capabilities, diverse needs and ultimately different objectives. The TreeHive Analytics Gameplan process can help you to identify well-scoped, achievable steps towards your analytic ambitions. You can read more about it here.