27 August 2019
What is a data lake, where does it live and why is creating one a critical but often overlooked step in enabling digital transformation?
You might have heard of a data lake. And like a lot of people, you're probably wondering - what is it? Is a data lake different to a database? And are they actually important for enabling digital transformation?
Why a data lake?
So why would you actually need to create a data lake? There are lots of use cases, but in essence, you need a data lake if you're working with big data, that exists in lots of different forms, that requires lots of different treatments in order to be useful.
An example - imagine there are a number of sensors monitoring a chemical process, with data-points from each sensor being recorded into an historian, with a new data point being recorded every 5 seconds.
The output of this process is sampled by lab technicians, who perform tests and record their results daily in a spreadsheet. There's a problem though - the results lack accuracy. So to achieve accurate results, additional datasets need to be analysed in conjunction, in the form of maintenance logs and calibration certificates.
Results improve with this approach but then machine learning is suggested to further increase accuracy and solve different problems occurring.
While this is a fantastic step towards digital transformation, there is a good chance that:
- the required datasets overlap significantly
- multiple different teams are copying and using them independently
- there is a major duplication of effort and data.
This is the point so many organisation reach. But as in this example, it's a case of the cart going before the horse. Before you start implementing digital transformation in the form of artificial intelligence and machine learning - you first need to have the right data available in the right format.
... it's a case of the cart going before the horse. Before you start implementing digital transformation in the form of artificial intelligence and machine learning - you first need to have the right data available in the right format. And this is why data lakes are such an important piece of the puzzle.
What's actually needed - is a data lake.
Data lake, data warehouse or database?
Once you know that you're dealing with massive amounts of big data, it's important to consider if a data lake is actually the best solution - there are lots of ways data can be stored and accessed, including databases and data warehouses. The first and most obvious difference between these different forms of storage is the type of data they can accept. A database and a data warehouse can only house structured data, that's been organised and arranged. A data lake is much more versatile however, as it can accept structured, unstructured or semi-structured data. A data lake accepts data in it's raw form and this makes it highly agile. Unlike a data warehouse or database, which can be very costly and time consuming to update or change, the lack of strucutre in a data lake allows developers and data scientists the ability to easily configure and reconfigure datasets and models.
Typically, if you're working with big data and you want to use machine learning to extract insights from it - a data lake is going to be your best bet for getting the right access and agility.
An appetite for data
Training a machine learning model to the point where it can actually provide useful insights typically requires huge amounts of data - and often from more than one source.
Copying large datasets around is slow and often infeasible especially since data processing often needs to be co-located with the data itself. Overlaying these issues is the fact that different teams of subject matter experts will usually need self service access to the data to draw out usable insights.
When these factors coalesce, establishing a data lake is the next logical step. But it's probably time for a more detailed definition of what exactly a data lake is.
What exactly is a data lake?
A data lake is a system or repository of data stored in its natural/raw format.
In essence, creating a data lake involves removing the duplication of effort by collecting data from disparate sources into one location.
Raw data is collected in a single place, from a multitude of unique data sources, across departmental silos, from legacy systems, at different cadences, in different formats, and it can be structured or unstructured.
The tools for processing data are collocated with the data in the data lake, so there's no requirement to repeatedly copy data in order to work with it. This gives the ability to deal with much larger data sets, without the need to wait for them to download to local machines.
As we can’t presume from the outset to know all of the potential ways that our data can be used, a data lake is ideal as the place to store it in its raw form. The data is catalogued and available for use with exploration and visualization tools which can query across data from all sources. With this approach, it's even possible join finance, production, and OH&S data to gain new and intriguing insights.
Once a user need is determined, it also becomes easy to transform subsets of the data and save it into staging areas for use. An example might be a widely distributed monthly report dashboard that requires inputs from many sources.
Where does a data lake live?
Typically data lakes exist in the cloud, where there is access to cheap storage, an always growing eco system of services for processing, and access to elastic resources that only need to be paid for when the need arises to run transformation workloads.
Getting data into the lake
The data loading part of the story is simple. For data in larger batches it's ideal to provide a secure endpoint to upload csv, json or zip files. For real time data it's best to establish streaming infrastructure to publish to.
The more difficult part is getting the data out of disparate data sources. For example - legacy devices that use protocol gateways or need custom code developed to extract their data and transform it to a format ready for sending to the data lake. In other cases, you have a historian that requires additional tools or connectors be purchased or built to extract the data from this source.
Types of data
- Batched data
- Extracted from historians, databases, operation and maintenance logs
- CSV file uploaded periodically to a secure endpoint
- Streaming data
- IoT devices
- Realtime data
- Streaming analytics, alerts, dashboarding and feedback
- MQTT streams
Once there is data in your data lake it becomes easy to explore and visualize.
Exploration options are virtually countless but some common approaches include
- visualising time series plots for a number of sensors
- looking for clustering to see if a system exhibits clearly defined states or modes of operation.
It's also important and recommended that when exploring data, you consider which features are useful and which aren't. Perhaps some sensors are essentially duplicates of each other, or others are misconfigured, broken or providing invalid data. In cases like this, it can be useful to remove these datapoints from the dataset being explored. This will reduce the amount of data that you're working with and improve the quality of predictions being made.
Predictive analytics (powered by artificial intelligence algorithms) is a deeply technical and rapidly advancing field. But once you have a workable data lake in place it becomes that much easier to progress to training predictive models.
We can train models to identify abstract states of a system and determine which of those states the system is most likely to be in now. For example we could define a state as "due to fail within the next 24 hours". A model can then give a prediction for which state the system is in, along with the confidence of that prediction.
We could also train a model to predict the signature behaviour of a system in normal operation. For example the current of a motor driving a pump - when the measured current deviates from the signature current we could raise an alert that an anomaly has occurred.
Digital transformation often starts with a data lake
We're not overstating it when we say that a data lake is more often than not the critical, missing piece, that's blocking our clients on progressing their digital transformation journey. Implementing innovative new technology practices and platforms, using artificial intelligence, requies huge amounts of data and a data lake is usually the best form that data can take.
Social MediaShare this article