We ask this question every year in India: *Will it be a good monsoon year?*

Speculation typically starts in April, and the anxiety grows rapidly as we approach the end-weeks of May. There are the official forecasts from the India Meteorological Department (IMD), and, in recent years, also forecasts from other players such as Skymet Weather, Potsdam Institute for Climate Impact Research (PIK), and South Asian Climate Outlook Forum (SASCOF).

The answer everyone wants to hear is that it will be a ‘normal’ monsoon.

India’s long-period average (LPA) rainfall, during the summer monsoon months of June through September (JJAS), over the past 50 years, is about 90 cm. This means that if you gather all the JJAS monsoon rainfall in a heavenly bucket and pour it all over India, there would be a water column 90 cm high everywhere in the country. We define this to be the ‘normal’ figure.

If the prediction is within __+__5% of the LPA, the rainfall is defined to be ‘normal’. If it is a further __+__5% away from the LPA, then it is either ‘above normal’ or ‘below normal’. And if the deviation is more than __+__10% from the LPA then it is either ‘excessive’ or ‘deficient’. ‘Deficient’ rainfall is more worrisome, and there is then talk of a ‘bad’ or ‘failed’ monsoon.

Even a normal rainfall could be bad news: there could be individual regions with very high or very low rainfall, or the rainfall may not be evenly spread across the four monsoon months. Good techniques for monsoon prediction are therefore vitally important, and *we must ideally develop such techniques in India itself *because the rest of the world isn’t impacted and doesn’t really care.

How does one predict rainfall? The best way would be to solve the equations of the underlying physics and dynamics – which, surprisingly enough, are well-known and well-validated. But these equations are treacherously non-linear: they require tremendous computing power, they don’t give straight-forward answers, and predictions usually become completely unreliable when you look sufficiently far ahead; the eminent mathematicians Vladimir Arnold and Edward W Lorenz (also known for his ‘toy’ weather prediction equations) said that no prediction can be trustworthy after a week or 10 days.

Why? What happens? Here’s an informal explanation. Predictions using very complicated non-linear equations usually have an ‘orderly’ and a ‘chaotic’ component. Over time, the chaos overwhelms order, and, beyond a certain threshold, it is *all *chaos. It is ridiculous to expect predictability from this point onward … unless we have much more prior information and insight.

The other way to predict rainfall is by empirical means. This is a classical statistical approach: we use a regression equation in which the amount of rainfall is the dependent variable, and there are a host of (hopefully independent and informative) predictors.

This is how monsoons have been predicted for over a century (IMD is attempting a dynamical or mathematical approach for the first time this year). Over all these years, some predictor variables dropped off, and some new variables came in. Some predictors grew in impact, others diminished in impact. The story kept evolving.

One of these predictor variables: El Nino Southern Oscillation (ENSO), or simply ‘El Nino’, has, during the last 10-15 years, acquired superstar status, becoming the Sachin Tendulkar or, now, Virat Kohli, of the prediction game. El Nino is about a greater warming of the waters in the Pacific Ocean a few months before the monsoon onset. It has been observed that rainfall can be scarce in El Nino years. The ‘opposite’ (greater ocean cooling) is La Nina, which is believed to bring in abundant rain. That’s why Indian scientists, and Indian media even more, love La Nina and hate El Nino.

This love affair, however, has only modest justification if you look at real numbers. The El Nino/La Nina phenomenon – *when it happens*; it is __not__ an annual event – accounts for only about 33% of the total variation in the monsoon prediction system, *which is indicative, but unlikely to be statistically significant*. In fact, we’ve quite often had good rains in El Nino years, and bad rains in La Nina years.

We’re therefore puzzled by the undue prominence given to El Nino in the Indian media. Think about it this way: the monsoon is an amazingly powerful, highly persistent, and unbelievably ubiquitous phenomenon capable of colossal fury and ferocity. It is like an awesome parade of elephants running amok with majesty and authority. How much can a solitary variable influence its staggering march?

Indeed, this is what creates hope for the forecaster. The monsoon is really big; and really quite orderly. It therefore seems eminently feasible to predict the monsoon rainfall *for the limited Indian region* for periods *ranging up to a month* (rather than just 7-10 days), especially if we play with *averaged data* … that always holds the promise of greater order and predictability.

We will now recount a fascinating story. But, before that, let us (again informally) explain how we do rainfall prediction. We’ll only talk of the dynamical (not empirical) route because that’s where we expect to do big and interesting things in the future.

We’ve already talked of the equations used to predict rainfall. These equations are complex and daunting, and *can only be solved numerically*. To kick off the computing, we need to know the initial values of the input variables (relating to heat, moisture, temperature, wind velocity, radiation, boundary layers etc., etc.). This data is now easy to obtain with the proliferation of satellites, sensors and weather agencies around the world (we can, e.g., get the data from National Centers for Environmental Predictions, or NCEP, in USA).

Let us start! Our process to numerically solve the equations will involve ‘time-marching’. Let us, say, obtain the initial values of the input variables as on 00:00 hours on 1 July 2006 and kick off the number crunching. We will crunch across 256 latitudes, 512 longitudes and 18 vertical levels, to cover the ‘entire’ atmosphere around the globe. The first pause in our ‘time-march’ will be at 00.15 hours on 1 July 2016. As we pause, we notice that the values of the input variables have changed. An indirect computation using the initial and changed values of the input variables gives us an *estimate of the likely rainfall* in these 15 minutes. We resume our time-march, now kicking off the computation with the changed values of the input variables, and begin crunching for the next time-step of 15 minutes. At the second pause, at 00.30 hours, we find that the values of the input variables have evolved further. As before, we again estimate the rainfall during the period 00.15 to 00.30 (which we add to the first estimate), and resume our time-march with the next set of input variables. This way we keep time-marching, cumulating our rainfall estimate and refreshing the input variables. After 96 time-steps (which, depending on the computing power, can be completed in seconds or minutes) we have advanced by a full day! We continue our forward march! When we finish marching forward by a month, *we will have our rainfall estimate for the month* (often expressed as a daily average; e.g., 10.75 mm/day).

It is not advisable to base the monthly rainfall prediction on just one computation. An established practice, therefore, is to create an ‘ensemble’ of 4-5 computations, using different sets of initial conditions, and obtain 4-5 estimates of the monthly rainfall. By averaging these 4-5 estimates we get a value for the whole month that seems ‘safer’.

Starting 2006, one of the authors (U N Sinha) led an annual exercise to predict the *monthly July rainfall*. The prediction was made by running the *Varsha* code on the Flosolver computer. Flosolver was developed at National Aerospace Laboratories (NAL) by India’s Council of Scientific and Industrial Research (CSIR); *Varsha* grew out of NCEP’s T-80 atmospheric global circulation model. The Flosolver-*Varsha* computing engine received generous funding from several Indian national agencies such as New Millennium Indian Technology Leadership Initiative (NMITLI), championed by Department of Science and Technology (DST), and Ministry of Earth Sciences (MoES).

Why July? Chiefly because the monsoon is ‘best behaved’ in July, and July also receives the highest rainfall among the monsoon months. The approximate split of the 90 cm in the JJAS months is: June 18cm, July 30 cm, August 25 cm and September 17 cm; so the average daily July rainfall is about 10 mm.

Year after year, *Varsha*’s rainfall prediction for July was communicated to IMD, which is India’s official rainfall forecaster. Generally speaking, Varsha’s prediction agreed reasonably well with the observed July monsoon rainfall … till 2013! That year Varsha’s July prediction was just below 7 mm/day, while the observed rainfall was about 10 mm/day (9.95 mm/day, to be precise).

It would have been easy to dismiss the 2013 failure as “one of those bad years”. It would have been even easier to offer sufficiently convincing explanations about the “innate underlying treachery” of non-linear equations. But these would be excuses, and no one is ever truly appreciative of good or bad excuses.

We needed to do better; we needed to return to the drawing board.

The first guess was that the ensemble size wasn’t large enough. Could it be that averaging over just 4-5 predictions wasn’t capturing all the inherent variation? So, we labored hard and long to create an ensemble of 24 predictions for July 2013. *But the average over the larger ensemble did not change*. It was still about 7 mm/day.

This was a puzzle … that we’ll investigate in a moment. But first we explain how we generated our 24-prediction ensemble for the month of July. As before, we ‘time-marched’ forward by a month (in actual practice, it is helpful to go forward by about 40 days to get some extra cushion if required … and then discard the poor end predictions), with initial values at 00:00, 06:00, 12:00 and 18:00 hours (for which NCEP has the data), *for six days*: 28, 29, 30 June, and 1, 2, 3 July of 2013. That gave us our 6 x 4 = 24 predictions. Note, in passing, that our “July monthly forecast” actually started on 4 July, and briefly spilled into August, but that shift is not hard to fix.

Let us return to our puzzle: Why did our July 2013 predictions average to 7 mm/day, and come nowhere close to the actual 9.95 mm/day? We decided to look at each of our 24 predictions individually, and see if there was something unusual happening anywhere. And then we found something truly intriguing: Our July 2013 prediction using the input variables as at *12:00 on 29 June 2013* was 10.02 mm/day. But when we next used input variables as at *18:00 on 29 June 2013* – just 6 hours later – our July 2013 prediction dropped to 7.08 mm/day! (Look at the picture that appears below)

Curious, very curious! We had simply pushed everything forward by six hours, without changing anything else. Surely this small push shouldn’t alter our monthly rainfall prediction by too much, if at all? For more reassurance, we checked our computation again. There were no errors and no confusion. But the value still dropped suddenly by some 30%. How? Why? And the story doesn’t end here. When we advanced by six more hours and kicked off the time-march using the input variables as at *00.00 hours on 30 June 2013* we got 7.07 mm/day. So this time absolutely nothing changed after six hours. Curiouser and curiouser, as Alice might have said.

Clearly this was the chaotic behavior in our non-linear equations announcing itself loud and clear. In a deterministic world, small perturbations in the input lead to small perturbations in the output. But such good behavior breaks down in a chaotic world where we are confronted with ‘sensitive dependence to initial conditions’, and the unexpected consequences of the ‘butterfly effect’ (“the flapping of a butterfly’s wings in Brazil can cause a tornado in Texas”).

Is this chaotic, non-linear world completely ill-behaved? No! Our first value of 7.08 mm/day reappeared almost instantly as 7.07 mm/day. It was persistent!

Puzzled, but excited, we looked at all our July 2013 predictions (in mm/day) starting with input variables as on *28 June, 00:00 hours*, and advancing the input variables by six hours, to get: 10.56, 8.96, 8.76, 8.82, 9.58, 8.92, 10.02, 7.08, 7.07, 8.02, 4.10, 5.01, 7.51, 5.59, 5.57, 4.00, 5.86 and so on.

These numbers aren’t random, there is perceptible persistence. There is a marked tendency to cluster around two or three bands or intervals. For example, there is some clustering just under 9 mm/day, and also some clustering above 5.5 mm/day; so, there was indeed a significant probability that our random ensemble of 4-5 predictions would average to 7 mm/day. And, since the predicted values kept dropping from the highs of 8-9 mm/day to 4-5 mm/day (2013 was indeed curious!), our ensemble using all the 24 predictions also averaged to about 7 mm/day. What, then, is the story that these numbers trying to tell?

The most obvious story or message is: *Don’t average blindly, and averaging over larger ensembles may not get you any closer!* Averaging can in fact be counter-productive because it masks some real information that could otherwise have been discovered.

The less obvious, but more important, story or message is: *Each prediction we compute is valid*, *and a potentially winning candidate *(e.g. the prediction of 10.02 mm/day obtained after time-marching from 12:00 hours on 29 June 2013 was unerringly accurate).

That’s how non-linear equations behave: they have multiple solutions which are all valid! See the pictures below: in deterministic systems, there is only one solution and small input perturbations only create a small error band around the final solution; but chaotic systems can admit multiple solutions!

That, in essence, is the real problem. Which of these multiple solutions should we pick? And why?

To investigate further, we immersed ourselves into a massive hind-casting adventure. We generated ensembles for every year from 2000 to 2016. In practically every instance we found a significant clustering around 2, 3 or at most 4 values. The challenge was to pick the most probable cluster. This was rather easy for some years: the correct prediction was staring you in your face when you looked at the clusters. For other years it was trickier. Even for 2002 – the biggest prediction nightmare in recent times – our prediction didn’t do as badly as some others.

What, then, is the secret? If our predictions fall into two or three candidate clusters or bands, how do we pick the correct band?

The answer is not easy to obtain, but we think we can get there with some thoughtful, but intense, data analytics, and innovative data visualization. For many months now, some of us at CSIR’s Fourth Paradigm Institute (CSIR-4PI) are working on cluster analysis and visualization constructs. We are devising a set of rules, based on these constructs, to help us make the most robust monsoon rainfall prediction for the month of July. Over time, these rules, which today involve visual judgment, will be automated. This journey should take the project into the realm of machine learning and artificial intelligence (AI), which is clearly the wave of the future. But that AI adventure is only just starting, and it would be premature to say anything more at this stage.

But what we can certainly say, nay *assert*, right now, is that the computing paradigm is poised to change. For far too long it has been: *model first, data later*. Every time India’s monsoon prediction isn’t good enough – which, sadly, is often enough – we think something’s wrong with our model and try to tweak it. That ‘tweak balm’ provides some brief relief, but soon the pain surfaces somewhere else. This has to change; In future, the data – *not* the model – must drive the prediction.

We know the fundamental equations to predict weather, our computing power and prowess is increasing phenomenally, the data gathering ecosystem is growing exponentially in size and reliability … so shouldn’t we also change the way we compute and predict? There’s too much at stake here; not just the nation’s economy but, even more, the well-being of the nation’s people. We can’t afford to fail.

*Srinivas Bhogle with U N Sinha and Vidyadhar Y Mudkavi. This note will also appear in the IISA newsletter. Sinha was recently awarded Cray’s A P J Abdul Kalam Award for his lifetime achievements in fluid dynamics and parallel computing; Mudkavi heads CSIR’s Fourth Paradigm Institute. *

“The first pause in our ‘time-march’ will be at 00.15 hours on 1 July 2016. After this first iteration, we can obtain our estimate of the likely rainfall in these 15 minutes, and new computed values that become our next set of input variables. We plug in these computed values and begin crunching for the next time-step of 15 minutes.”

Slightly confused. From what I understand, you take a whole bunch of variables , let us say 100 variables, and come up with an estimate of the amount of rainfall that would occur in the first 15 minutes of July 1st. Then you say “… and the new computed values that become our next set of input variables.” How did the one variable which is the amount of rainfall become new set of 100 variables?

It’s not as if rainfall is the dependent variable and these 100 (actually more like 10) variables are predictors. We’re time-marching. The input variables are evolving. We estimate rainfall indirectly based on the changed values of the input variables after every time-step.

OK, Srinivas now I get it (I think). I have other questions and thoughts, but will communicate via email.