Description: While IoT discussions often focus on different types of sensors, what is equally important is what you do with the data that has been collected. Buzzwords used loosely, like ‘big data’ or ‘analytics’, can obscure the critical step of transforming data into valuable insights. How are complex formulas implemented in the collection of data? How are cleansing operations performed? How is data presented in a way that people can really make decisions with it? This case study looks at what to do with IoT data after it is collected.
Source: Webinar titled Obtaining Analytics from IoT Data by Jorge Lizama, lead architect for data and analytics solutions, GHD
Biography: Jorge is part of GHD’s initiative to unlock new value for clients by combining data analytics with the company’s understanding of global infrastructure and built assets. With more than 10 years’ experience in data technology, Jorge’s career has included leading consulting companies in Australia and Latin America.
Data analytics encompasses many buzzwords including big data, in-memory computing, cloud reporting and cloud infrastructure. The underlying question is how to make data really helpful, which this article will illustrate. We start with the identification of the requirement and the problems faced, cover the technical instrumentation, then look at why certain technologies are decided on and the final outcomes achieved.
The case study is focused on a project implemented in 2016. The brief was around observing, studying and understanding the vibrations or the movement of a building, which hosts a facility with highly sensitive instrumentation. There was a need to understand if a new tunnel that was being built under the building could potentially cause problems for the laboratory through increased movement or vibrations. The instruments are so sensitive that any movement beyond a certain threshold potentially causes a problem either in the readings or in the outputs this instrumentation provides.
The building is a very tall, lean tower. It was thought that wind was the main cause of movement and the concern was that the tunnel underneath could increase the susceptibility of the structure to wind-driven movement or vibrations, beyond the threshold for sensitive instrumentation.
The project involved monitoring any events that potentially can create a movement over a certain threshold. The aim was to identify potential factors affecting this movement, capture these events, alert clients about them and provide insights about these events.
The project began with installing instruments including weather monitoring. Seismic sensors were installed to understand the movement, measure the displacement from a specific centre with the Trimble sensor. Also installed were slope or tilt meters to be able to detect any type of slope from a centre or from a 90 degrees angle for the building.
All data is collected by a server-based system that is able to capture the data every time one of the sensors provides a reading over a threshold.
The monitoring systems sends email alerts when the threshold is reached. For example, in a scenario of more than 100 mm of deviation or vibration, an alert is sent by email to relevant people, and data is collected for the event to improve understanding of what happened.
Obtaining analytics from big datasets
The volume of the data is massive, with 12 sensors collecting more than a million records a day, which makes it very hard to analyse the data as a whole. That's the crux of the problem. When the data set is large enough or complex, it cannot be dealt with in a standard database or a spreadsheet in the normal way.
The concept of big data is very loose. How much data is big data? A good definition is that it is something that cannot deal with traditional technologies. Normally, a million records a day would not be called big data but if the questions being asked of that data cannot be easily answered by traditional technologies, then it becomes complex.
Our challenge was to understand how the movement of the structure correlates to all the readings. We started by combining the datasets and looking at how the readings from each sensor were related. For example, how are the direction and the magnitude of a displacement related? The displacement happens in two measures – distance and position of the displacement. So how do they correlate with each other? Is it relevant? How is the relationship visualized in a way that is meaningful? These are questions that are not simple to address using real time analytics.
This is where the suite of approaches that encompass the Internet of Things provide a potential solution. However, a practitioner needs to determine which of the approaches outlined below is relevant for any given problem.
No to Predictive Analytics
For this case study, even though someone could call the correlational analysis advanced analytics, or similar, there is no need to use any advanced tool. Any rapid data miner should be able to create a simple visualisation of real time analytics that are able to give a clear correlation between the different trends.
Obviously, looking at the data, there are a lot of things that could be put in place using predictive analytics. For example, understanding what leads to an event and how this may predict when another event can happen. But in this case the aim is not about predicting movement because there is not much that can be done about the vibrations. The challenge is about understanding if the vibrations are going to have an effect on instrumentation or not.
No to Big Data
With big data one considers the use of a Hadoop type of environment. In this case, we had 300 million to 400 million records to be analysed in one go, but this can get to as small as 20 gigabytes of data. So the volume of data can be compressed enough to be managed without any need to go into big data.
Big data is excellent to store and retrieve incredibly massive amounts of data, but the key operations in this case are things like joining databases, different datasets and doing the computing. The problem of the displacement sensor is that the deviation and the direction of the deviation can’t really be interpreted in just one dataset. It has to be transformed. This is going to take a lot of computation for processing it. That’s not what Hadoop is for.
Other tools have to be worked with aside from Hadoop, like Spark, to be able to do that type of computing. We are not talking about a big data environment by itself. So for this case we discard the need for big data.
No to SQL Database
SQL databases are beautiful to put data in. The can be ideal for a lot of sensors, but with just 12 being considered in this case, it’s not really ideal. SQL databases have the same problem with Hadoop when getting data out, especially when including the real-time analytics.
Cloud computing would normally be a perfect environment to build this application in. In IBM, SAP or Amazon’s cloud or any of the clouds that are around, it should be kept in mind that there are enough data environments to create a full real-time analytics environment to be able to do what is needed. The only reason that cloud computing wasn’t used is in this case was because all the elements we needed were already in place in-house at GHD.
Data integration is needed because it was known that the current Trimble Pivot Database was not able to cope with the type of real-time analytics needed to be done. The database could store the data, but was not able to handle complex queries. With real-time analytics, all the computation needs to happen on the spot and provide fast outcome.
Data integration makes use of Extract, Transform, Load (ETL) tools. There are other new players like Enterprise Feedback Management (EFM). These are the tools that are utilised to bring the data from one point to the other.
In Memory Computing
In terms of the core data environment, what is in-memory computing and why it is used? In short, it provides computation power! In databases, one of the biggest problems is before-aggregation computations, after which the data is joined to the aggregate, making it easier to process data. But when one needs to do this record by record, other data environments simply cannot handle it, or will take days to actually deliver the outcome.
In-memory computing is able to take on this type of challenge and was needed in this case. Since in-memory computing usually has very good compressing algorithms, it is possible to also utilise the space.
In-memory computing means doing all the computing and complex calculations in Random Access Memory (RAM). In-memory is estimated to be up to 5000 times faster than computing that accesses disk storage. It is useful for real-time analytics that needs to happen on the spot.
Popular vendors include Apache, Ignite and very recently Geode, as well as the niche ones like MemSQL and VoltDB. There are also the big brands like IBM, SAP, Oracle, Microsoft.
Data can be moved from one point to the other. Someone can store it and create all the calculations that are needed to be able to have the outcomes, but how can they be presented?
That's where the interactive analytics kicks in. Interactive analytics is basically a set of tools that allows data analysts to investigate the data and create visualizations that present easy to read results and figures, allows them to share it with other people, and answer the question of the users or the client. The main vendors in this space include Power BI from Microsoft and Tableau, IBM, SAP Oracle, SAS, MicroStrategy and QlikTech.
Presenting data in charts and graphs enables users to understand what is happening at a glance and then refine further investigations of a complex phenomenon (such as building vibrations). This is what was needed in this case.
The solution we adopted is illustrated below:
The sensors capture the data to a 3G router. This is sent to a Trimble Pivot system that collects the data and is the monitoring system generating the alert. This fulfils the core task of recording every event and alerting the relevant people about each occurrence.
Next is the batch load and transformation component, the ETL component. In this case the SAP product is utilised but it could be AWS for example or any other product; SAP have their own data orchestration system.
SAP HANA is used for doing the in-memory computing. Around 400 million records are able to be compressed to around 20 gigs of in-memory computing. That's one of the key advantages about in-memory. It has very good compressing capability, so can handle a lot of data.
It has some limitations though. The biggest available in-memory solution that anyone can input in a place or in a server at this point is 2 terabytes. If you are accumulating a lot of data then the limit will be reached.
Also, just to give an idea of costs, doing that in in AWS is around $13-$14/hr. So, it comes with a price tag. Obviously, as technology advances, in-memory computing will start becoming a bit more of a commodity, but still it is a little bit pricey.
Next is data analysis, featuring a dashboard interface for users and decision-makers. SAP have one called Lumira, which is one of the competitors of Tableau and Power BI.
Finally, we have a publishing server. That is where these data analysts are able to publish the outcomes. This Dashboard answers the client's questions that are asked in real time; the questions go to in-memory computing. Every half an hour, they system receives updates from the data service component of the sensors.
That's pretty much the flow. From sensor to data, to decision makers is a lap of around half an hour. It can actually be pushed to real time but it’s not needed in this case.
The outcome is that the client was able to make the key inference for the project. Two charts were used showing two different metrics below.
The left hand diagram is the displacement of the building showing the direction of the displacement and the displacement distance. The other diagram shows velocity of the wind and its direction.
Looking at the two diagrams together shows the building is very clearly being displaced in one direction, northwest, while the biggest gusts of wind are either south or east. So this first visualisation seems to indicate that they are not related. This is what the client was most concerned about, that the wind may affecting the displacement and that this would be exacerbated by the tunnelling. Much time was spent digging into the data and eventually it was confirmed that there is no impact from the new tunnel that is being built.
Question and Answers
Question: What do these tools such as Hadoop and Spark do?
Answer: Hadoop is a very intelligent file system for storing massive data in a cheap way and then retrieving it at a very fast speed. It also creates redundancy. If there’s a lock file from IoT data with a million records, these will be submitted, they cannot be captured directly by Hadoop. Some kind of middleware has to be put in between but eventually the lock file is sent to Hadoop.
A key thing about Hadoop is that it allows the data to be accessed by the cheapest laptop or computer available. Hadoop avoids the need for massive servers that are very expensive.
Spark is an in-memory computing component that goes over the top of Hadoop. That's when the processing of data starts because Hadoop itself is just putting data in and retrieving data. It’s not for doing analytics, or doing real time analytics.
Initially in-memory computing was done with a product called Hive, which is kind of built-in with Hadoop but then Spark came in which is now the preferred option by many. What it does is give the client the capability of doing analytical applications, like running statistical models and running complex mathematical models.
Question: How would in-memory computing work if this was a cloud-based solution?
Answer: Many of the cloud vendors such as IBM, SAP and AWS offer their own kind of in-memory processing product. So the client effectively has an in-memory database but it's not on their network.
The only trick is that clients have to take care that there could be data restriction on where the data lives. The cloud vendors will tell you, for example, that their in-memory databases are on data centres in Sydney but the disaster recoveries of those cloud environments are in Singapore and not Australian soil. This can be an issue if, for example, you are dealing with government clients.
Question: How will in-memory computing become integrated into greater business functions?
Answer: In-memory computing is used to support core business functions than even more than big data. For example, for government to support their internal finances and budgeting they utilise a planning component of an ERP. In-memory computing can be used for forecasting and trends from a whole year of data that’s pretty much impossible with classical databases.
Question: How long is the data retained in memory?
Answer: In most of the newest in-memory environments a client chooses what they have in a way of hot and cold type of data management strategy. Hot data is that which is really needed for the current computation and a data management strategy will push it into in-memory. Even if they want to add another year of data they can say “I don’t need any memory for the current now, so I can send it down to big data and let it wait there”.
So the most advanced in-memory environments work on top of big data. They are connected so that if more data is needed a click of a button will give the capability to manage that.
Question: Does in-memory computing induce any high risks in terms of security?
Answer: Not at all, because the access to it is exactly the same with any other database environment. If there is a power outage then maybe the whole in-memory system could fail, but there's always a persistency on disc. In-memory usually runs on Linux machines, so it is the same as any other database system.
For cloud data environments, all the vendors have very well in-built security models. A client can choose to leave everything open or shutting everything down to a machine that nobody can access except by that client utilising one specific laptop. Some cloud security systems are better than internal systems in individual companies.
Question: If it is much faster to access in-memory data, does it make it sense to just hold the raw data or to pre-process it in anticipation of query requests for faster response to user queries?
Answer: the only reason to pre-process data when working with in-memory computing is probably to have smaller data. If it’s already pre-processed, aggregated and created as a smaller data, then less in-memory space will be utilised. If it's just for the sake of trying to speed up things, to be quite honest, it is already quite fast by itself. At this point the main metric that makes in-memory challenging is the cost which can be up to $50/hour.
FINDING OUT MORE
You can view a recording of the webinar, on which this webinar was based.
If you want to know more about the Internet of Things and the various components like those described in the above case study, then do please join Engineers Australia’s Applied IoT Engineering Community by registering on this website. We run a regular program of webinars exploring the applications, opportunities and challenges of these technologies.