When working with large, distributed, multi-detector sensor suites, such as are found in large scale Internet of Things (IoT) deployments, the sheer volume of available data can delay timely data analysis. For example, a suite of 5,000 roadway sensors transmitting at a rate of one measurement per second will generate 300,000 data points per minute, or 4.32 million measurements per day! This vast flood of data will quickly swamp a brute force collection and analysis methodology, so a more calculated approach is required. Take Figure 1 for example: A spacecraft that has hundreds of sensors reporting data several million data points per minute. How can data of such great volume be captured, processed, and analyzed sufficiently in real time? This article discusses various methods of large dataset analysis which can be used to produce actionable information.
Figure 1. Sensor signal processing pathway
Time Series Analysis Techniques
There are many types of sensors available for creating IoT sensor suites. For example, there are micro-sensors for temperature, geographic position, and many others. Most of the data from these sensor values is reported to the collection board (for example, Intel® Edison, as shown in Figure 1) in the form of a single value, such as angle of tilt or temperature, collected at a particular timing frequency. The first step to process these datasets is to convert the information to a lower dimension to reduce the overall number of data points while retaining important aspects of the data, such as anomalies or other out-of-band events. A variety of techniques have been developed for this purpose. Note that it is possible to perform much of this preprocessing and data conversion directly on the collection board, dramatically reducing the quantity of data that must be transmitted to the data collectors (see Table 1).
Table 1. Data simplification techniques (for example, dimension reduction)
Figure 2. Conversion of time series data to discrete segments.
Large-Scale Data Management
To process large quantities of data, you must have sufficient storage allocated for that data. In order to take advantage of the iSAX approach it is necessary to first have a way to handle the storage requirements and the distributed processing needed for large scale processing. There are many ways to manage large-scale data storage and analysis; with regards to time series processing the Apache Distribution of Hadoop is the most efficient file system. For more information, see Kampf and Kantelhardt, “Hadoop.TS: Large-Scale Time-Series Processing”.
This technique uses a distributed storage technique on the converted time series data that breaks the data files into segments that contain a portion of the full time series dataset. Using MapReduce, an analysis method in which a time series dataset is analyzed across multiple servers, the output is combined to produce the result. The iSAX technique is well suited to this kind of distributed data analysis, because it can operate equally well on smaller sub-sections of the overall sensor data time series.
Real-Time Sensor Data Analysis
As noted earlier, the SAX technique converts time series data from time points to a symbolic sequence, such as represented by a defined set of letters (ARTYNNTSS…). To handle millions (or in some cases, billions) of time series datasets, an indexing method has been developed that allows for efficient storage of the symbolic sequence, with consequent processing improvements. In an interesting scientific domain crossover, the selected symbols for a converted time series sequence were matched against the single letters representing amino acid residues in a protein sequence. This approach allowed the researchers to use previously developed sequence-matching algorithms to search for anomalies and patterns (see Transell, “The Use of Bioinformatics Techniques to Perform Time-Series Trend Matching and Prediction”). For time series data that involves multiple sensor types (such as multivariate), the National Aeronautics and Space Administration has developed a technique to monitor multiple space vehicle sensors for real-time predictive analytics on potential issues before they occur (see Iverson et al., “General Purpose Data-Driven System Monitoring for Space Operations”).
Another useful technique for large-scale real-time data analysis is to find so called “out-of-band” quality control failures. In this case, the time series data is analyzed to find measures that fall outside predefined quality bands (for example, a sensor temperature reading that exceeds a predetermined limit). Detection of these non-conformant readings may trigger an alarm or a direct initiation of additional business rule–based steps to verify the anomaly and avoid false-positive notifications (see Figure 1). A more detailed description of how to perform this kind of analysis by using the Intel IoT Analytics site is available in The Internet of Things—Analytics: Using the Intel® IoT Analytics Website for Data Mining.
As the IoT finds increasing applications in personal, corporate, educational, and industrial settings, is likely that the quantity and frequency of sensor data collection will increase exponentially. To use this information effectively, we must develop techniques to properly capture, condense, convert, and analyze millions of time series datasets. Often, these analyses will be required in near real time to meet the specific needs of businesses or individuals (for example, self-driving car networks). By applying techniques similar to those outlined in this article, it will be possible to scale a sensor network from a few individual sensors to many thousands of sensors without requiring a complete reworking of the data collection network.
- Visit the Intel Developer Zone sensors web site to learn more.
- Visit the Intel® IoT Gateway web site to learn more.
- Explore the IoT gateway offerings to compare features and necessities.
Login to leave a comment below. If you are not registered go to the Intel® Developer Zone to sign up
For more such intel resources and tools from Intel, please visit the Intel® Developer Zone