What’s ‘big data’? How big should data be to be called ‘big’?

Big data talk usually centers around those four V’s: We say that data is big if it has volume, variety, velocity and veracity.

Volume? That’s easy to understand. It means a lot and lot of data; much, much more data than what a single server’s disk can hold.

Variety? That means not just numbers (Tendulkar’s test and ODI scores), but also text (all that prose and poetry in praise of Tendulkar), video (Tendulkar’s straight drive) and audio (Tendulkar exclaiming “Ailaa!”).

Velocity? That’s data coming in real fast. Data on the trajectory of a launch vehicle, data on the growing spread of the Ebola virus, data about how rapidly some song is going viral on the Internet.

Veracity? Data that is valid, accurate, verifiable, not manipulated and trustworthy.

What do you do with all this big data? You use this data to find associations, insights or connections. And the real fun starts when you discover hidden associations; associations that were otherwise hard to imagine or deduce.

For example, associations that tell you that the beta blocker you take to check hypertension could also possibly block cancer. Or purchasing patterns that tell you that the buyer making these purchases may be a pregnant young lady even if her father still doesn’t know.

That’s what got everyone excited. You apparently only had to throw a lot of data into a number-crunching machine, and  unexpected – but valuable – associations could emerge out of the blue!

Changing paradigm

It wasn’t like that in the past. Fifty – or even twenty – years ago data was scarce! Those days we first postulated a plausible hypothesis, gathered data to verify it, and then used this data to declare the hypothesis true or false.

But suddenly – with a surfeit of data from digital networks, social media, sensors, probes,  smartphones and everything else – the boot is now on the other foot. It is now data first, and hypothesis later.


And today’s data isn’t just numbers neatly arranged in a spreadsheet or database. Such structured data (the orange colored boxes) may be no more than a small fraction of the big data pie; most of the data (in blue boxes) is unstructured or semi-structured.

It could be data read off from mails or log files; sales, medical or travel documents; transcribed conversations from phone calls and meetings; or photos, scanned pictures, animations (detailing a process) and videos (footage from CCTVs).

How do we play with all this big data? To start with, we distribute this data across multiple servers. If your first server disk fills up, seamlessly go to the next … and you can go on to fill up dozens or hundreds or thousands of server clusters with data.

How do we process this multitude of data? Instead of moving the data to the processor, we move processors to the data! We write programs to tell the processors what (and how) to do, and we then ask all processors to start processing in tandem. In other words, we do massive parallel processing of this multitude of data.

Just as an example, imagine that you have an enormous mountain of cricket-related data. You want to count how many times the names (Cronje) and (Gilchrist) appear in this ‘mountain’. The big data way would be to distribute this data across a hundred servers, allot a processor to each server, instruct each processor to look for these two names, and then ask all hundred processors to ‘go’. Each processor would finish its count in a jiffy, and you could get the total count an instant later.

Let us suppose that (Cronje) and (Gilchrist) appear approximately the same number of times. We could now ask the processors to count the combined instances of the words (Cronje and dollar) and (Gilchrist and dollar). If the (Cronje and dollar) count is much higher, then we might have found a hidden association!

Here’s another example. Instead of a word like (Cronje), let us consider a Karnataka car number plate, say (KA 10 N 10). Imagine that we have 100 hours of video footage across  Bangalore streets acquired over a week. We will distribute one hour of footage to each of the 100 processors, ask each processor to ‘look’ for the car (KA 10 N 10), and note the time, date and location of every static frame in the footage in which the car was spotted. By aggregating the data from each of the 100 processors, we can track the movement of the car over the week.

While (KA 10 N 10) belongs to a much loved Indian cricketer – Anil Kumble – we could just as well be tracking car numbers possibly belonging to a terrorist organization. With big data we could track hundreds of such suspicious vehicles, and their interactions, on a regular basis.

Here’s another interesting example: a flu epidemic is breaking out, and the health authorities are trying to determine where the flu will hit. They stumble upon a valuable but unexpected ally: Google searches and Twitter and FB feeds! By scanning all this social media data one can identify states or cities with the most queries or messages relating to a fever, headache or sore throat! Plot these instances as dots on a map and you instantly get the big picture.

And there are more: If you own a car you are more disciplined about taking your medication; strawberry Pop-Tarts sales shoot up seven-fold before a hurricane; and you are more likely to be cleared to board a crowded plane if you’ve asked for a vegetarian meal!

When should you call in the elephant

In all these examples we used Hadoop’s programming framework to solve big data problems: Sometimes Hadoop works like a charm, but at other times it struggles or fails. So when does Hadoop get it easy (see list below), and when is it hard?


Speaking informally, Hadoop likes to be left alone to do its stuff. You can pile an enormous mountain of data on thlogo-elephant2[1]is elephant’s back and it cheerfully takes on the burden … but don’t break its rhythm or repeatedly change the tune! The elephant doesn’t like to be hassled or rushed.

So call in the elephant if you want to count or sum really huge volumes of archived data; if you want to make those ‘top 3’, ‘top 5’ or ‘top 10’ lists; if you want to match patterns or images; or if you want to discover correlations that allow Google or Amazon to make recommendations, and Netflix to know, even before you know, what films you might like!

Also call in the elephant if you want to extract, transform and load data from diverse sources; classify or group people with similar likes or interests; look for evidence of fraud in a huge messy data pile, or scan thousands of medical images to detect signatures of malignancy.

However there are many situations where you want the elephant to be agile and responsive. You want to query an online database and obtain near instant answers, or you want to rapidly solve problems that involve processing of streaming data. In such cases the Hadoop elephant struggles, essentially because it can’t manage this more complex parallelism. A horse named Spark will probably be required to provide the sparkle.

Big data in action

So, whether you ride an elephant or a horse, the objective of the big data engine remains the same: Uncover new associations or patterns, and exploit these patterns to help society or business.

If big data analysis reveals the reasons why a subscriber moves from Vodafone to Airtel, then set up systems to identify and prevent such churn. If  you suspect that a next gen Nick Leeson may attempt another heist, then learn from the Barings Bank mistakes and create checks and processes to quickly identify and punish potential rogue traders.

That’s how things should evolve. Big data should lead to a bigger bang.

Today everyone seems to agree that the big bang will happen when big data marries artificial intelligence (AI). Big data will provide the knowledge, and AI will convert this knowledge into winning actions.

AI, what’s that?

Most of us think that intelligence is ‘human’; we are the smart guys and computers are the dumb morons. Could computers get smart, could they think? If yes, that would be ‘artificial’ intelligence or AI.

What should a computer do to be called ‘smart’ or ‘intelligent’? Rapidly calculate the value of Pi up to its first million digits? Or multiply two very big numbers in a fraction of a nanosecond? We all agree today that there’s nothing smart about that; all computers were built for such dumb computations.

DeepBlueWould a computer be smart if it consistently defeated Garry Kasparov at chess? Deep Blue created a flutter by achieving a famous win, but it wasn’t necessarily smart. The massively parallel and powerful Deep Blue could calculate millions and millions of future positions and choose the best based on a ‘win-maximizing’ algorithm. But it didn’t have a mind of its own.

If a computer uses algorithms with explicitly programmed (‘if-then’) rules, that tell you exactly what to do in every conceivable case, then it cannot really be called smart.

A computer becomes smart if it can, in some sense, ‘break free’. With explicitly programmed rules you are always telling the computer:”do this” or “don’t do this”. A smart computer, on the other hand, will try to figure out what to do on its own. It will try to learn.

The learning platform is provided by neural networks with architectures that mimic the human brain. Individually every neuron is pretty dumb and simple-minded, but collectively,  they can be taught to become very smart.

Cross the road

Consider, for example, the problem of teaching a computer how to cross a road. If the algorithm only lists very precise and explicit rules (“Look left”, “look right”, “don’t cross if you see a car or cat”, “use only zebra crossings”, “wait for the green light” etc.) that’s not smart. But if the computer is shown videos of 10,000 people who safely crossed the road, and also of 10,000 people who failed to cross the road, and if it then consistently crosses the road safely, then you have to say it has become ‘smart’.

How did the computer ‘learn’ to cross the road? Basically it was ‘trained’. Every time it tried to cross the road and failed it was told “Look, this is why you failed”. Every time it successfully negotiated a difficult stretch it was ‘rewarded’. A vast variety of such experiences are created and ‘stored’. As these simulated experiences proliferate, the computer keeps getting ‘smarter’. Today’s ‘driver-less’ cars are a good example of how smart you can get.

How did Google learn to write and translate?

There was a time when you lost marks in your English composition exam for wrong spellings. When you wrote ‘seperate’ instead of ‘separate’, ‘accomodate’ instead of ‘accommodate’, or ‘occurence’ for ‘occurrence’. That won’t happen now because your computer will automatically insert the correct spellings.

How does the computer do it? You might guess that it has a built-in dictionary, and a library of the most common ‘wrong’ spellings that it would look up and correct. But that’s exactly like those ‘if-then’ rules. It doesn’t make the computer smart.

Today’s smart computer learns from the available statistics. It uses big data analytics to count how often ‘separate’ appeared versus how often ‘seperate’ appeared (remember our example of Cronje versus Gilchrist?). The more it scans, the more it keeps finding the spelling ‘separate’ … and the better it learns that ‘separate’ is indeed the right spelling.

Human_Translation_vs_Machine_TranslationIn fact Google has taken this statistical learning idea much farther: it uses it for translations! The explicit rule-based translator, attempted in the 1950s and 1960s, had dictionaries linking synonyms from different pairs of languages with a lot of syntax rules (telling for example that, when you translate from English to Hindi, the verbs should appear at the end). Such ‘machine translators’ failed badly beyond a point.

Google did something very different. It entered millions of documents in which authenticated versions of the same text were available in multiple languages. So, when you tried to translate, e.g., a French sentence to English, Google scanned all the available French -> English translated texts, did its statistical ‘machine learning’ and came up with the most probable ‘good’ translation.

If the ‘intelligent’ user accepts this translation, Google learns from this feedback. If the user wants to tweak the translation a bit, Google willingly lets you do it. But if the user eventually rejects it … well, even that offered valuable learning. The learning never stops.

An added advantage of statistical machine translation is that syntax rules are not needed, and the idiom is often captured more successfully. A little while ago I  asked Google to translate <plus ça change, plus c’est la même chose> into English. It did an admirable job to come up with <what goes around comes around>.


Indeed we aren’t far away from the day when a computer will use ideas of  statistical machine learning to ‘storify’. Give it some keywords, point it to some data sources, hint at the broad message you want to convey, and the computer will come up with a very plausible ‘copy’. But I do wonder who will hold the copyright.

For now there are already algorithms for ‘data storytelling’.  Data, encapsulated in a dazzling animated presentation with bar charts, pie charts, trend analysis, statistical hypothesis testing, and balloons that grow till they pop, is incredibly powerful. But if you tag a story to this data it becomes unforgettable. Quill, from Narrative Science, is a good example of data storytelling.

Doctor in Jeopardy!?

Consider the example of your old doctor named Watson. Dr Watson examined you diligently, asked you questions about your symptoms, allergies and existing medical condition, occasionally asked for specialized medical tests and scans, made a diagnosis, wrote out his prescription … and you recovered quickly.

A year later you fall ill again, but Dr Watson is sadly no more. So you call in IBM’s ‘artificial’ Watson. He’s already won Jeopardy!, but can he treat you for your suspected jaundice? Try telling Watson your symptoms, and he’ll surprise you with great insight. This is because Watson can quickly dig up all your old medical records and prescriptions, compare it with the largest medical data repository ever created, and make guesses on your likely ailment with probability estimates for every option!

watsonSo is this ‘Dr’ Watson smart? Douglas Hofstadter, the eminent professor of cognitive science doesn’t think so, but he’s probably looking at a much more technical definition of AI. IBM Watson shows exceptional mastery in natural language processing. Instead of returning a bland ‘I’m-not-programmed-to-answer-this-question’, it wades through a mountain of unstructured data and images, and makes surprisingly good guesses about your ailment. Watson is a fast learner, and it will get better and better as the learning improves. Already Watson can probably identify lung cancers with greater accuracy than real flesh-and-blood doctors.

It is interesting that while Watson takes natural language as an input to get to insight, Quill offers natural language as an output by crunching analytical insight. Either way natural language is poised to become a vital cog in the future AI engine.

What does the AlphaGo win mean?

In March 2016, Google’s AlphaGo defeated one of the world’s best Go players, Lee Sedol, with effortless ease. This brought back memories of Kasparov versus Deep Blue, although the two contests were hardly comparable.

In terms of complexity, Go is much, much more complex than chess. So complex that the ‘brute force’ enumeration method that worked with Deep Blue would completely fail in this case. For AlphaGo to win, it definitely had to ‘learn’ on the way. There has been a huge amount of discussion on AlphGo’s enigmatic Move No 37 in the second game that it won. The probability that a professional Go player would make that move was 1 out of 10,000. “It’s not a human move; I’ve never seen a human make such a move!”, Fan Hui, a European Go champion exclaimed. But AlphaGo made that move “through its own process of introspection and analysis” … and won!

This win, rather like Leicester City winning the English Premier League, was a very big surprise. No one expected it to happen for many more years. Scientific American wrote that it has “startling implications for the future of artificial intelligence”. Fan Hui could only keep gushing: “So beautiful, so beautiful!”

AI and automation

While Watson and AlphaGo could well be the future face of AI, today’s AI is likely to be more practical and down to earth. Think of the first level (“T1 or L1”) support that companies offer, often out of call centers. Things are pretty trivial and straight-forward here and you wonder whether you really need a ‘man-in-the-loop’, especially if he has to be awake at 3 am in the night.

Apart from offering his human voice (which Donald Trump will ridicule if it is Indian) there isn’t much else that this person can offer. He’s usually looking at a template, asking questions in the proper sequence, and giving answers with suitable decorum even when the chap at the other end is unbelievably rude.

You can almost guess the outcome. T1 will probably escalate the problem to T2 who will figure out that he needs to enter the user’s network to delete some cache files, change some database settings or do something else that’s certainly not earth-shaking.

So why do we need T1, or indeed even T2 in many cases? T1 is essentially gathering data and clues interactively to reach a preliminary ‘diagnosis’, and T2 is trying to be the ‘doctor’ attempting different possible interventions.

It seems eminently feasible to use AI to create a kind of ‘learning software robot’ that can perform both the T1 and T2 functions faster, smarter and better. Indeed, if you read the papers, that’s exactly where every Indian IT company wants to go. Automation reduces head counts, speeds up delivery, guarantees greater reliability, and increases customer confidence.

The cat and the eagle

 What about the future? Although AI has gone through at least two ‘blow hot blow cold’ phases in the past, this time things are likely to stay hot. Kevin Kelly, the founding executive editor of Wired, says that AI will keep improving because of “this perfect storm of (cheaper) parallel computation, bigger data and deeper algorithms”.

Kelly also suggests that we are really heading towards artificial smartness, the ‘nerdily autistic’ sibling of artificial intelligence. While that appears slightly callous, it is also strangely reassuring. We may not need blade runners to vanquish the rogue AI replicants that we all privately fear.

But the AI nirvana is still many years away. For now we are trying to add learning layer over learning layer on the neural network platform to get computers to recognize a cat based on its paws, claws and whiskers (but I still haven’t heard “meoow”). Facebook is using these layers to classify and identify faces … and so the adventure continues.

James Vincent explains this AI learning progression very well in a recent article in The Verge. He says that AI will one day be like a plane that can fly. But it can never fly like the eagle that swoops down with infinite grace to capture its prey.

2 thoughts on “Big data, AI and all that

  1. Hello Sir,

    Although most of the technologies are stamped with some animal and named with a brand new non existent miningless name. Kidds are good at namimg things, Doug Cutting’s son used to call his elephant toy “hadoop” so he named the framework hadoop. But I liked the way you interpreted hadoop/spark with elephant/horse. Amazing example for explanation in this article !

    As of now data processing on clustered nodes of hadoop distributed file systems is linear with respect to scalability over petabites of data, I am sure there will be time when linear nature to get turned to projectile, 10, 20, 30 years down the line? No idea!. If at all AI getting benefit of the best features of hadoop like file system and data locality. Hope to see AI in action before hadoop terns out to be projectile in nature.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s