Friday, October 31, 2014

Big data: A pragmatic overview


1. Introduction


We generally look at big data as a new buzzword, which will give a cool tag to our products. So inadvertently, we just use some of the popular technologies to make sure that we have some kind of big data analysis in our application. It may not always be done correctly or for any intended use. The same thing I remember when web2.0 took off. Every company wanted to be a part of the bandwagon without knowing the real meaning of the technology. Using NoSQL, Hadoop is not the big data.  We need to understand the use cases of big data, its advantages and pitfalls before we can actually make any real value out of it.
There are numerous references available on this topic most of which talk about the 6 Vs (Volume, Variety, Velocity, Variance. Veracity, Value) of big data and the challenges they pose. They will then delve into NoSQL, Map Reduce and related issues.  But that is quite theoretical and difficult for a person to appreciate who is just trying to get introduced to this topic.
I came across this book [8] last month that seems to cover big data concepts in a more pragmatic way using lots of case studies and real world examples. This book is one of the very few ones in the market, which talks mostly about value proposition that big data brings to the table, rather than delving into the nitty-gritties of the technologies. You will start understanding why Big Data can be a game changer in the near future. Cloud, NoSQL, Map Reduce, Hadoop are all enabler, but you need to know why you want to use them and when to use them.

In this article I am trying to summarize the key concepts that the book talks about and the various case studies that it has taken to drive the point across. I have also tried to add my inputs wherever I felt the description needed more explanation or I had some extra information.
Wherever I use “authors”, it means I am referring to a section from the book, while in case of additional information I have provided the related references at the end.
According to the authors, there is no rigorous definition of big data. In general if the volume of data to be analyzed is so huge that it cannot be stored in memory of computers used for general processing and cannot be analyzed using conventional analytical tool we can call it as a big data problem.
Authors begin the book with giving us a brief tour of reality. In todays world we cannot employ the traditional methods of collecting data and performing the analysis simply because the data is huge and noisy (messy). It might not always be possible to calculate the exact value due to the noisy nature of data, but trends/predictions is the next best that we can strive for. In this article I will discuss that predictions or correlations is more valuable than finding the exact value. At its core big data is all about probabilistic analysis to come out with some predictions.
In this article I will first discuss the challenges we have at our hands when dealing with big data analysis. Then will discuss about how data if collected in the right form (datafication) can help us to analyze it and use it in the way, which was unimaginable sometime back. I will then discuss on how big data affects the valuation of a business. Finally I will discuss about the risks that big data analysis poses in its current state.


2. Challenges

2.1 Challenge#1: Lots of data is available to Analyze

We today have the ability to collect, store and analyze vast amount of data. We are collecting data from sensors, cameras, phones, computers and basically anything that can connect to the internet. We have so much of data to work with that we no longer need to limit our analysis to a sample of data. We have the technology to gather every thing possible and then make our predictions. The conventional approach of taking samples from the data is not required anymore since we have the affordable technology to work with large set of data.
Sampling is a good and efficient way to perform the analysis but the accuracy of the results depends largely on the sample set. If the sample set is not random then the results could be biased. It is also very difficult to find the trends in a sample set since the data is discrete and not continuous. When you have large amount of data your predictions are more accurate. Previously we did not have the technical ability to work with all the data so we tend to take samples but today with the affordable storage technology and massive computing power we can have the sample space of N=ALL, i.e. take all the data for the analysis.

Often, the anomalies communicate the most valuable information, but they are only visible with all of the data.

-Google flu predictions are relatively accurate since it is based on billions of search queries it gets per day.
-Farecast a company that predicts if the airfare for a seat is going to go up or down, analyze about 225 billion fight and price records to make that prediction [8,10].
-Xoom, a firm that specializes in international transactions, sometimes back found out a fraud in certain Card transactions originating from New Jersey. It was able to do that by finding out a pattern in the transaction patterns. This was evident only when they analyzed the entire pattern. The random samples would have not revealed it. Sampling can sometime leave the important data points and we may not uncover the patterns.

2.2. Challenge#2: Messiness of data


Highly curated data is a thing of past when only a small amount of information was at hand so you wanted to use it for getting the exact results in the best possible way.
But now we are dealing with huge amount of data (Petabytes of data). Messiness (error, noise) may crop in due to many reasons
  • Simple fact that the likelihood of the error increases as you add more points.
  •  Inconsistent formatting of the data from different sources
  • When collecting data from 1000 sensors, some sensor might give faulty readings adding to the errors
  •  Collecting data at higher frequency can result in out of order readings due to network delays
Cleaning is an option only with small amount of data where we could look for errors, formatting issues, noise and remove them before the analysis was done ensuring the correctness of the result. But in today’s world when the data is huge, it is not feasible to clean up all the data with errors. Also the velocity of the data is so high that it will become a highly challenging task to work in real time to guarantee the cleanliness of data (Basically not possible).
But then book suggests that More data beats the clean data

Author argues that even if we leave some degree or errors and noise in the data, there effect will be nullified since we are going to use huge number of data points. Any particular reading from the sensor may be incorrect but the aggregate of many readings will provide a more comprehensive picture and nullify the effect of the erroneous reading. Instead of exactness analysts should strive for “Is it Good enough?” We can give up a bit of accuracy in return for knowing a general trend.

Sometimes the messiness of data can be used to ones advantage as Google spell check has done. Its program is based on all the misspellings that users type into a search window and then “correct” by clicking on the right result. With almost 3 billion queries a day, those results soon mount up [10].


Google also used the same concept in their translate service which can translate text from one language to another. These translations are based on the probability of occurrence of the word in a particular order, how a particular word in English is generally used. This was based on the analysis of huge amount of data that they have collected. There will be some documents that are not good in language but they work on probability and not on exact numbers, so such kind of messiness is ignored.

Messiness aids in creating big datasets, which are ideal for performing a probabilistic analysis.

There are several other examples where messiness of big data got neglected in the probabilistic analysis and collection of all the possible data helped in improving the existing process. British Petroleum could find out the effect of oil on corrosion of pipes by collecting the pipe stress data from sensors over a period of time. Some of the sensors might have given wrong data as physical devices tend to undergo wear and tear but with data coming from large number of sensors, this noise would have insignificant effect on the calculations.

Inflation index that is used by policy makes to decide on interest rates, salary hikes is calculated by gathering the price index for various commodities from different parts of the country. This is done manually and is a highly time consuming and expensive task. Decision makers need to wait for a long time to get the results. PriceStats, employed techniques to scrap data from different websites about the commodities prices and trained their algorithm to analyze that data for calculating the inflation, making the process of calculating the inflation price index largely automatic and extremely fast. Again the web-scrapped data is not always correct or updated, but with probabilistic approach such errors can be neglected. Also people might be interested in knowing the trend and not the actual values on a daily basis so the messiness of data is acceptable.

Consider, Flickr about “tagging” photos online, the tags are irregular and include misspellings. But, the volume of tags allows us to search images in various ways and combine terms that would not be possible in precise systems [1]

2.3. Challenge# 3 Correlations and not Exactness

With so much of data to analyze and with unavoidable messiness, the best we can strive for is to find out some trends. Exact answers are not feasible and in most of the cases not required as well. For e.g. based on the machine data we can predict that the engine is most likely to fail after 2 months but the exact date is not possible to calculate as the data is not 100% accurate.
Predictions or correlations will tell you what is going to happen with a probability but not why it will happen. It helps businesses “foresee events before they happen” allowing them to make informed and profitable decisions. Usually the correlations are analyzed by joining data with one or two data sets.

If a user watched Movie A, what is the chance that he will like Movie B. If the user has bought Book A, will he be interested in Book B. Amazon, Netflix were the firsts in the space of recommendations which is based on deriving correlations between the item the user has bought and similarity score between the other items available in the store.
There are lot of case studies that the author have given:
By analyzing the data, WalMart found that there is a correlation between the buyers buying habits and the weather conditions. People prefer to buy some type of products in specific weather. WalMart use this analysis to update their inventory with different set of products accordingly.

Google predicted flu infected person by finding the patterns in the search strings. The person infected with a particular disease is more likely to search a particular set of keywords.

FICO, invented a Medical adherence score, it measures how well we handle a drug prescription. The number tells about our likelihood of taking the drug prescriptions as they were written for us. The algorithm figures out who does or doesn't fill their prescriptions, the more adherent we are, the higher the score. This score is based on different personal parameters like what does a person buy, where does he lives, what are his food habits etc. FICO seems to have found a correlation between the person habits and his adherence to follow the prescriptions. This data is valuable for the health insurers, who can concentrate their follow up with the people who have lower FICO score and not bother others. [2]

AVIVA can predict from the personal data of the customer, if he suffers from a medical condition or not. This helps it to avoid usual medical tests for customers who have been predicted healthy

Retail stores like Target uses the female customers buying patterns to predict if the person is pregnant or not. This helps them is giving coupons targeted for purchasing various products for different stages of pregnancy.

Authors argue that Good correlation is the one where change in one variable is going to affect the other while a bad or not so useful correlation is when the change in is one variable is not going to affect the other variable significantly. The real challenge is to find the correct variables to establish the correlation.

In one of the case study authors discussed that New York city public utility department can predict which Manhole is going to explode based on the Age of the manhole and prior history of explosions. The correlation of a Manhole likeliness to explode with its age and past history is very intuitive, but to arrive to this conclusion by analyzing huge amount heterogeneous data was a big data challenge.

3. Datafication

Big data is all about analysis of large chunk of data to come up with something useful. The data is the key input to any big data use case. Even though we say that the data can be messy, incomplete, heterogeneous, but one thing is fundamental that is it should be in a form that can be read/analyzed by the computers. Process of converting data in a form that it can be used for analysis is termed by authors as datafication (Make the data quantifiable).
There is a fine difference between digitization and datafication. Digitization is to convert the data in a form that can be stored in the computer or storage solution while the datafication is to save the data in the form that can be analyzed by the algorithms.

Generally people talk about digitizing every written document (scanning and storing it to a disk) so that you can retrieve it whenever you want. But the data will have little value from the analysis point of view if it remains only a set of images, as the content from those documents is not in the form that we can analyze. We cannot create indexes on the images and so cannot search for the text inside the images. Datafication is the next step, which means to transform these documents or any information in a way that can be analyzed by the computers. Google books (http://books.google.com/)is a classic example. They first scanned millions of books so that they are available for online reading that is digitization, they then applied the OCR technology to extract the words from the chapters and stored them in a form so that search index could be created this is datafication.
There are some case studies, which emphasize on the importance of the datafication.

Maury’s work of extracting data from old sea logs to find shorter and safer sea routes. This is one of the first examples of large-scale datafication of content. The data that was available in the old navy logbooks was manually extracted and converted to a form that could be analyzed. When tabulated properly along with the added information like the weather conditions, tidal behavior helped in finding shorter routes, which saved lot of time and money.  Even though this entire work was done manually and does not fall under the realm of modern big data analysis but it suggests the importance of data in unveiling unknowns.

Datafication of the vital signs of person’s health has become easy with the advent of the health bands. They collect the data about the vital signs and store them to a server where the analysis takes place. Person can get valuable trends about his health from these inputs

Datafication of location

This is the ability to track the location of a person or a thing in a consistent manner using GPS has led to various interesting use cases. In some countries, the insurance premium depends on the place where you are driving and when you are driving the car. The place where the theft is high, the premium is likely to go up. You no longer have to give a fixed annual insurance fee based primarily on the age, sex, past record.

UPS uses the GPS tracker in the trucks to get the data about the various routes. They can quantify the things like which route took most time, how many turns, traffic signals, areas prone to traffic congestions (where the truck moved slowly). This data is used to optimize the routes so that the trucks can take shorter routes with less number of turns and traffic congestions. This has helped in saving millions of dollars in fuel cost.

The ability to collect one’s location through the smart phones, home routers is turning out to be very useful. The target specific ads are shown to the user based on where he is and where he is predicted to go. The locations of people’s Smartphone is used to predict the traffic situations and plotted on the Google maps.

Datafication of relationships

Social interactions are something, which was always there but was not formally documented. These relations can be in the form of friendships, professional links, general emotions.

Facebook enabled the datafication of social relationships i.e. friendships in the form of social graph. This graph can give you information about your friends, FOAF, your posts, photos, likes almost everything that you do on Facebook. The best part is that the social graph can be machine analyzed by the algorithms to find out the details about a person and his interactions. The credit card companies use social graphs before giving the credit cards, Recruiters check Facebook profiles to judge a persons social behavior [7]
F-commerce is a new business paradigm that is based on if a friend has bought the product; his friend is likely to buy it.
Bing boosts the rankings of the movie theaters, restaurants while displaying the search results based on the likes and posts of user’s Facebook social graph. There predictions are based on the premise that if your friend likes a theatre or a restaurant you are likely to enjoy that too.

Twitter datafied the sentiments of the people. Earlier there was no effective means to collect the comments/opinion of people about a topic. But with twitter it is very convenient. Sentiment analysis is a big data way of extracting information from the tweets.
Hollywood movies success/failure predictions are now guided by tweets about that movie. All media channels these days monitor the tweets on various topics to find out people opinion about various issues.
There has been research on how to perform real time sematic analysis of the tweets to find out breaking news [3]. Tweets are analyzed to find out traffic jams, cases of flu, epidemics etc. [4,5]

Investment bankers are using tweets to sense the mood of the people about a company and its policy. Favorable tweets means people are more likely to invest in the company.

LinkedIn has pioneered the datafication of the professional network. They represented the professional relationship in the form of a graph where you can see how you are connected to another colleague. You can get recommendations of your work from colleagues. You can find out what kind of work is being done by colleagues in different companies and try to make career decisions. The professional graphs are these days heavily used by the recruiters to know how the person is judged by his colleagues, his skillsets, how well connected he is etc. This was not possible earlier by just scrutinizing a resume[7].

4. Value of Data

Value of data lies in its use and potential multiple reuses. With so much of data being collected, new previously unthinkable uses are coming up. More then the primary use of the data it is the secondary or the tertiary use that is more valuable. Sometimes the value can only be unleashed by combining multiple data sets together--Data mashups (joining 2 or 3 data sources together). These kinds of data joins yields reveal correlations, which are quite surprising and business impactful. It is this untapped potential that is making big data analytics so much lucrative and demanding.
The companies like Zillow combine property data with local businesses data to calculate the walk score, which tells the potential buyers how walkable the daily stores, schools, restaurants are from the particular property.

The data collected for one purpose can be used in multiple other uses. The data value needs to be considered in terms of all the possible ways it can be employed in the future and not simply how that is being used in the present.

-Google collected data for street view but used the GPS data to improve the mapping service and for functioning of its self-driving car.

-The ReCaptcha service from Google is another example where the data collected for one purpose is being used to serve multiple uses. The ReCaptcha images are generally from the Street View, and Google books which were not successfully digitize by the computer. The user when enters the text helps Google to digitize the images. Google is using this for improving maps, books and street views.

-Farecast harnessed the data from previously sold tickets to predict the future price of the airfare.

-Google reused the search terms to uncover the prevelance of the flu.

-SEO or search engine optimization is primarily driven by the search terms.

-Companies make use of the search queries to find out what the users are looking for and then may be strategize their products

-Google reused the data from the books to feed into their translate service to improvise the translations
-FlyONTime.com combines weather and flight data to find out the delays in flights at a particular airport.

Sometimesa data that is otherwise thought to be useless like typos or clicks can be used in innovative manner. This has been termed as Data Exhaust. There are companies which use the click patterns on the website to find out the UI design issues. Google is a pioneer in using this extra data to improve upon their services.

A Study conducted by GE suggests that service data of aircraft engine can provide insights that can result in 1% fuel efficiency which has a potential to save about 1Bn$ per year. This has been referred to as Power of 1.
Today we can collect sensor data from the machines. The data can be continuously monitored for any anomalies or a pattern that may indicate that a particular part is about to get wear down. The service engineers can proactively visit the site and take corrective actions. Rather than providing reactive maintenance, service engineers can provide proactive care to the engines. The servicing can be on need basis rather than as per AMC. This also means zero unplanned downtime. The maintenance happens when the part is about to get wear down and not when it has stopped working. The data from the engine will create a service request if the analytics finds that the part requires some servicing [6,9].

Any company’s worth in todays time is not only its book value (physical asset), but also on the kind of data it generates and maintains. The value of data is again largely depends on the secondary usage potentials. Major reason for aggressive valuation of Facebook, whatsApp, Twitter is because of the massive user data that they hold.

5. Risks

With the huge amount of data and the ability to analyze it for different purposes have resulted in some negative and dangerous scenarios too which if not handled properly can cause panic among people. Authors have highlighted 3 main Risks
Privacy:
Google through the Street View can show people houses, cars. This can give burglars an idea of the person’s wealth. Even though Google on request fuzz out the images but still it gives an impression of something valuable to hide.
Based on the search strings from particular company employees, the competitors can predict the kind of research they are performing.
Google searches Gmail inbox to find out your hotel bookings; your travel dates and then sends you reminders before your travel. It also plots this data on the Google map, so that when you search for that hotel, it displays you the dates of your reservations against that. This is a serious privacy violation but Google finds it a business opportunity.

Propensity: Big data predictions can be used to judge a person and punish him. Bank may deny a loan to the person just because the algorithm predicted that the person will not be able to pay loan. US Parole department is using the predictions whether the person is likely to commit the crime in future to make the parole decisions. This might be correct sometimes but in some cases it might restrict the right person to get what he should have got.

Dictatorship of data: Overreliance on the data and predictions is also bad. The quality of data may lead to biased predictions, which can be wrong and end up to catastrophic decisions. It now appears that America went to Vietnam War because of few high officials obsession about the data.

6. Conclusion

I have tried to give a summary of the Victor Mayer and Kenneth Cukier book [8] in my best possible ability. I tried to bring in some more examples and my own understanding about the subject while writing this article, which I thought, might give some more clarity on the topic of discussion. Even though I could see mixed reviews about the book on the net, but I personally feel that this is a book for everyone. The book is excellent for someone who wants to know what is big data without getting into technical aspects (mostly managers, architects, principle engineers). The book also acts as a reference for someone who is already working with big data but do not understand practical value of the technology (mostly developers).
There are few other sections of the book, which discuss, how as an individual/business one should find a place in the value chain of the big data. I thought this needs some in-depth discussion and may be a good idea to write another article focusing on big data value chain


References

  1. http://www.data-realty.com/pdf/BigDataBook-TenThings.pdf
  2. http://patients.about.com/od/followthemoney/f/What-Is-The-FICO-Medication-Adherence-Score.htm  
  3. Predicting Flu Trends using Twitter data, IEEE Conference on Computer Communications Workshops (April 2011), pp. 702-707 by Harshavardhan Achrekar, Avinash Gandhe, Ross Lazarus, Ssu-Hsin Yu, Benyuan Liu
  4. TwitterStand: News in tweets, by Jagan Sankaranarayanan , Benjamin E. Teitler , Michael D. Lieberman , Hanan Samet , Jon Sperling
  5. Semantic twitter: analyzing tweets for real-time event notification In Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software (2010), pp. 63-74 by Makoto Okazaki, Yutaka Matsuo
  6. http://www.ge.com/docs/chapters/Industrial_Internet.pdf
  7. Assessing Technical Candidates on the Social Web, IEEE Software, vol.30, no. 1, pp. 45-51, Jan.-Feb. 2013, by Andrea Capiluppi, Alexander Serebrenik, Leif Singer
  8. Big Data: A Revolution That Will Transform How We Live, Work and Think. Viktor Mayer-Schnberger and Kenneth Cukier, John Murray Publishers, UK,2013
  9. http://rohitagarwal24.blogspot.in/2013/12/a-software-engineer-perspective-on-iot.html
  10. “Data, Data Everywhere”, Kenneth Cukier,The Economist Special Report, February 2010, pp.1-14