You can read our previous article on Media Services here.
What is Big Data?
Big Data refers to the massive volumes of data that are collected with great velocity, and in various contexts. Any collection of data available, from the weather forecasts to updates on Facebook could be classified as Big Data. There’s around 130 exabytes of data available in the world today, and it is expected to increase to about 40000 exabytes by 2020. Around 90% of the data in the world today has been collected over the past 3 years, and with the rate at which the Internet is growing, this isn’t a surprise.
For Big Data to provide insights for decision making purposes, it needs to be accessible, cleaned, analyzed and presented in an appropriate manner. Data without analysis is of no purpose. Traditionally, there have been various BI (Business Intelligence) tools that provide reports, alerts and notifications based on data available on local servers. Probably the most popular technology stack available for Big Data analysis is HDInsight, formerly known as Hadoop. Microsoft’s Azure platform gives Hadoop the benefits of cloud implementation in the framework called Azure HDInsight.
Analytics on Azure are possible through two services – SQL reporting, and HDInsight. SQL reporting is primarily a subset of the reporting services included with SQL server that allows you to build reporting into application(s) running on Windows Azure, in various formats such as HTML, XML, PDF, Excel etc. In this article, we will be focusing more on HDInsight and the machine learning aspects of Windows Azure for this post.
Why Analytics on the cloud?
Approaches based on analyzing historic data are outdated, considering the exponential proliferation of data. Today, there is greater demand for real-time solutions and prediction of future trends through predictive analytics. This makes it perfect for cloud platforms such as Azure, which offers unlimited computing power and storage space with a pay-as-you-go model. Azure offers the user the ability to auto-scale as per the data growth, and the built-in functions for analytics and reporting eliminate the need for any 3rd party tools. This makes the entire process of analyzing data a lot easier compared to months of developing a solution using Hadoop & Java programming. Seamless integration with other Azure services like Websites and Databases makes analysis easier and much more comprehensive.
A Hadoop cluster is a special type of computational cluster designed chiefly for storing and analyzing huge amounts of data in a distributed computing environment. Azure HDInsight offers you a powerful scripting environment called PowerShell to manage them. These clusters also have a second headnode added to them to ensure more reliability and availability of the service. Data storage is a lot more efficient and economical with Blob storage. The unstructured transactional data can be stored in an Azure Table Storage or any other NoSQL database for high performance solutions.
A Brief Overview of HDInsight
HDInsight is 100% compliant with Apache Hadoop, and is built on top of the Hortonworks Data Platform (HDP). It works in a master-slave fashion where the head node controls the overall operation of the cluster. As already mentioned, a secondary head node is also included providing for more reliability and availability. The data can be stored either on HDFS (Hadoop directional flow System) or on Azure blob storage – the latter having the following advantages:
1) Data is available on the blob storage despite the cluster being provisioned or destroyed – this makes it very cost effective as there is no need to keep the cluster to access the data.
2) Storing it on Blob storage lets other tools/processes to access this data for other processes.
User data and job metadata either way reside in Windows Azure Blob Storage. The master node thus reads the job metadata from the Blob storage, and uses it to do the processing – the results of which are also stored in the Blob storage.
Here is a quick run through on how to provision an HDInsight Cluster. First, you ensure that you have the following:
• An Azure subscription.
• Office 2013 Professional Plus, or Office 365 Pro Plus, or Excel 2013 Standalone, or Office 2010 Professional Plus.
How to provision an HDInsight Cluster
Step 1: Creating an Azure Storage Account
Since HDInsight uses Azure Blob storage or WASB, we need to specify an Azure storage account while provisioning an HDInsight cluster.
Sign in to the Azure Management portal, go to NEW -> DATA SERVICES -> STORAGE -> QUICK CREATE. Enter URL, LOCATION and REPLICATION, and then click CREATE STORAGE ACCOUNT. Once the STATUS of the new storage account is change to ONLINE.
Step 2: Provisioning the HDInsight Cluster
Sign in to the Management Portal. Click HDINSIGHT and the click NEW on the bottom left corner. Go to Data Services>HDInsight>Hadoop and enter the given credentials. Cluster name would be name of the cluster. Cluster size would mean the number of data nodes you want to deploy – the default is 4. Then enter the password for the admin account. Then select the storage account you created previously from the dropdown box.
Then click Create HDInsight Cluster, and once the status column shows Running.
You have now successfully provisioned an HDInsight cluster. You can then run queries on it and import the results into Excel or other BI tools, to help you process it further.
Predictive analytics is used to look at trends across volumes of gathered data to find applicable insights for the future – such as the probability of raining, or fluctuations in the stock market.There are a bunch of platform developers that are hurrying to offer better ways to build machine learning and predictive analytics. Azure sidesteps the programming required for this, and offers Machine Learning and Predictive analysis as a service, easing and speeding up building analytics for businesses. Azure Machine Learning lets you build cloud-based predictive analytics solutions that can be assembled from templates and common workflows, rather than being written from scratch by a programmer or a data scientist. These solutions can then be published as APIs and consumed by other third-party application or other Azure services. Therefore, all you need to do in order to do predictive analytics using Azure ML is:
1) Upload, or import online, the data that you wish to base your analysis on.
2) Build and validate a model. (Predicting consumer segments on different days of the week)
3) Create a web service that uses your models to make fast, real-time predictions.
Azure Machine Learning (ML) has very effective algorithms set in place, to offer powerful predictive analytics. The developmental tool of Azure ML, ML Studio, uses simple drag and drop gestures to set up entire processes – sometimes without even needing a single line of code. Azure has pre-built ML templates that cover standard machine-learning functions, which can be used as-is. For the veteran data scientists, Azure also supports more than 350 R-language packages, thus letting the user build upon these templates and customize it or expand it to the developer’s liking. Sharing and collaborating through ML Studio is a breeze as well, with other people not needing to even pay for an Azure subscription to work with you.
Azure ML has a lot of useful in-built algorithms for classification (multiclass and two-class decision forests), regressions (Bayesian linear, linear regression, neural network regression, Decision forests etc.), and clustering (standard K-means) among others. In short, Azure ML helps the beginner do away with the heavy programming involved in creating analytics, while enabling experts by giving them a platform to do away with redundancies.
By offering powerful cloud-based predictive analytics without any major operational costs, Azure makes machine learning more accessible to developers. Now, it’s your turn to tinker with the platform.
You can read our introductory slideshow here.