India is entering an election year. During the 2009 general election, 417 million out of the 719 million eligible voters actually voted. Now, five years later, the numbers will be even larger.

Predicting the outcome of such a mammoth election must seem impossibly difficult. Surprisingly, it isn’t! We’ve had opinion polls in India for almost 30 years now and most of them have performed reasonably well.

Arguably India’s best pollster is Rajeeva Karandikar; he first developed his prediction model in 1998 and has since made poll forecasts in practically every state and national election.

Starting 2005, Rajeeva has collaborated with Centre for the Study of Developing Societies (CSDS) and the CNN-IBN news channel to call numbers in 27 election campaigns so far, including the four state elections in December 2013. In 14 of these elections his predictions were “very good”, in 9 they were “good” and in the last 4 they were “not good”.

This interview with Rajeeva Karandikar (RK) actually never happened, but I thought it was a good format to explain this complex business.

*SB: You mention “very good”, “good” and “not good” predictions. What’s your criterion when you make such appraisals?*

RK: I say “very good” if our seat prediction was very precise, or if we spotted the winner when rival polls missed it. I say “good” if we were at least as good as the others, and “not good” if we were off the mark or if the others got it better than us.

*SB: How did you fare in the December 2013 state elections?*

RK: I’m giving myself “very good” for Madhya Pradesh and Chhattisgarh, and “good” for Rajasthan and Delhi.

*SB: I would’ve given you a “very good” for Delhi too because only your poll predicted single digit seats for the Congress. But isn’t that the crux of the matter in a parliamentary democracy? Getting the number of seats right.*

RK: Absolutely. The puzzle has two pieces. The easier part is to estimate the percentages of votes polled by a party, the harder part is to convert these percentages into seat numbers!

*SB: So let us start with the easier problem. How do you estimate the percentage of votes polled by the major political parties? For example, how would you do it for the forthcoming 2014 general election?*

RK: We have to pick a representative sample … a sample of voters that will truly mirror the view of all the voters.

*SB: How do you choose such a sample?*

RK: It is vital that we choose a random sample; a sample where every eligible voter has the same chance or probability of being selected.

*SB: Is randomness so important?*

RK: It is of paramount importance. The quality – and indeed even the validity – of the prediction crucially depends on randomness.

*SB: And what should be the size of the random sample, e.g., for the 2014 election?*

RK: I’d be comfortable with a sample of 50,000, although even a sample of 20,000 should do the job.

*SB: A sample of 20,000 to predict how over 425 million might vote? Surely you must be joking!*

RK: Now you are talking just like the politicians of the losing party that we meet on CNN-IBN! I admit that this number goes against our intuitive grain … but that’s because we are conditioned to think that every sample should be a reasonable percentage (5% or 10%) of the total number of voters. That’s not how it works in practice!

*SB: So how does it work?*

RK: We don’t have to worry about any percentage; we only have to worry that our sample size is large enough.

*SB: What do you mean by “large enough”?*

RK: Let me see if I can explain it informally. Suppose my random sample size is just 2000, instead of 20,000 or 50,000. With this sample size we could still get a good estimate of the percentage of votes polled by the leading political parties if we are very lucky … but that won’t happen too often; perhaps we’ll get it right just 20 times out of 100! What if we choose a random sample of 10,000? Now we might get it right 70 times out of 100.

That’s the idea: keep raising the size of your random sample till you can get it right 95 or 99 times out of 100. We can show mathematically that if we use a random sample of 16,000, we’ll get our prediction of vote shares of the leading parties right 99 times out of 100.

*SB: How do we choose this random sample in practice?*

RK: We use official Election Commission lists: These list all the 543 parliamentary constituencies, all the polling booths within each constituency, and all the eligible voters within each polling booth. A big advantage with these lists is that the data is organized contiguously; so the constituency listed after Bangalore South is likely to be Bangalore North, and not, e.g., Phulpur in UP. And polling booths with consecutive numbers will be geographical neighbors.

*SB: How does that help?*

RK: The geographical contiguity allows us to implement a circular random sampling process. Suppose – because of cost and time constraints – it is possible to sample from only 20% of the constituencies (say 108 out of 543 constituencies). Then I can draw a number at random from 1 to 543 … let us suppose I draw 378 … and proceed to pick every fifth constituency cyclically: so I start with constituency number 378, then pick constituency 383 … then 543, 5, 10 and so on.

This way, we ‘span’ the entire country, ensure that our sample is sufficiently ‘separated’, and gain exposure to the largest diversity of influences.

*SB: Is that the best way to do it?*

RK: Look, we could do better if we had supporting socio-economic data … but in India such data is not available at the polling booth level or even at the constituency level. So that’s not an option. But I have no worries; we have observed over the years that the circular random sampling scheme performs well.

*SB: So to choose the polling booths to be included in the random sample, and voters within the chosen polling booths, I suppose you pick the 71st, 161st, 251st … etc. polling booth, and within every polling booth the 92 ^{nd}, 132^{nd}, 172^{nd} …etc. voter?*

RK: Yes, that’s approximately the idea. So if we pick 25 voters from every booth, and 8 booths from every constituency, we’ll have a nationwide random sample of 25*8*108 = 21,600 voters. It would be even better if we could pick 50 voters from every booth. That’ll give us a sample of 43,200 voters. For our 2009 national election survey we worked with a sample of just over 50,000. This gave us fairly good vote estimates and seat projections.

*SB: But you told me that even with a sample of 16,000 you’ll get it right 99 times out of 100. So why do you want a sample of 50,000?*

RK: For at least two reasons. First, a larger sample size offers greater ‘security’ to a pollster who is always worried about randomness. Second, the prediction process needs estimates from sub-samples based on states or regions within a state. So if the total sample is just 20,000, some sub-samples may be just too small to work with.

If you ask me, 50,000 may be just right. It is large enough to draw valid inferences, but manageable enough from the point of view of cost and timeliness.

*SB: Timeliness?*

The Indian voter is especially fickle and vulnerable to influences. We have data to prove that there can be a wide swing in preferences even in the last week before voting. So the timing of our survey is crucial. That’s why we prefer to do our survey after the vote, rather than before.

*SB: Looks like a real tightrope walk ..*

RK: It is! Being a pollster in India is so much harder than being a pollster in the UK, although we have an identical voting process in both the countries. When I set out to become a psephologist, I consulted extensively with Prof Clive Payne, the doyen of BBC opinion polls. We agreed that the India poll model had to be different because we had far fewer ‘party-loyal’ voters in India (so fewer ‘safe’ seats for a party), lacked sufficient socio-economic data at the polling booth level, and had many more national and regional political parties.

*SB: It seems quite a miracle that you can still get it right so often.*

RK: You could call it a miracle, or you could call it the power and beauty of science. But to succeed you must respect science, or in this case the power of statistical random sampling.

That’s why we work really hard to get an unbiased feedback, even if it costs us more. For example, instead of directly asking the voter who he voted for, we recreate the actual voting process by asking him to cast his vote in secrecy using a dummy ballot paper and a dummy ballot box. And if our investigator cannot meet the randomly selected voter, he doesn’t talk to his neighbor instead! He returns the next day to meet him again.

*SB: And your investigators essentially obtain the voting preference of each of your 21,600 (or 43,200) randomly selected voters?*

RK: Exactly. How many of these voters voted for BJP, how many for Congress, and how many for the BSP and other political parties.

*SB: Let us say that your sample says that Congress gets 24.6% of the votes, and the BJP gets 25.8% of the votes. How does that help you predict what will be the Congress and BJP vote share in each of the 543 constituencies? Your sampling scheme only touched 108 constituencies.*

RK: That’s a good question and deserves a careful answer. You are right … the primary objective of the poll survey is to estimate the vote percent of every major party in every constituency. So we should ideally run the survey in every constituency. But that will make the survey unacceptably large and expensive!

*SB: So?*

RK: So we create a reasonable model to ‘extrapolate’ from 108 to all 543 constituencies. Our model is based on the key assumption that the vote swing remains constant within a state (or within a region in a big or volatile state).

*SB: (looking puzzled)*

RK: Let me say this differently. Consider any of the 543 constituencies … consider your own constituency of Bangalore North. I want to estimate the BJP vote percentage in this constituency. How would I do it? I would look at the BJP vote percentage in Bangalore North in the last election (it was 40.9% in 2009) … and to it add the vote swing percentage in Bangalore North this time (2014).

But here’s the problem! My sample isn’t good enough to estimate the vote swing specifically in Bangalore North. So I pretend that the 2014 vote swing in Bangalore North is the same as the vote swing in the state of Karnataka (which could be -2%), and add the state vote swing percentage – instead of the constituency vote swing – to the Bangalore North equation. My estimate for the Bangalore North vote percentage in 2014 could therefore be (40.9 – 2) = 38.9%.

*SB: But what makes you sure that the swing percent remains the same across the state? If you think of a state like Andhra Pradesh you would expect the Congress to have a positive swing in the Telangana region and a negative swing in other parts of the state.*

RK: That’s a distinct possibility. So in such situations we make the granularity richer. I would therefore add the swing percentage of the region instead of the state. Or I would take a weighted average with a 70% weight for the region swing and the remaining 30% for the state swing.

*SB: Looks as though we’re finally getting somewhere. We can now estimate the vote share of every major party for any national constituency; be it Phulpur or Warangal or Bangalore North! *

RK: Yes, but don’t forget that the vote share percentage could wobble a little this way or the other because of sampling errors, especially if the sample size is relatively small. It is in such situations that a larger sample size really helps.

*SB: But I do begin to get the drift. I’m beginning to understand why those election shows with Rajdeep Sardesai on CNN-IBN are so riveting. But the high point on those shows is when you convert ‘vote percentages’ to ‘projected number of seats’? How exactly do you do that?*

RK: Let me bounce back the question to you. Let us pretend that only two parties (BJP and Congress) are in the race in a certain constituency. Let us say BJP gets 54% of the vote in our survey and the Congress the remaining 46%. Who do you say would win?

*SB: BJP almost surely.*

RK: Now suppose the survey says BJP 51% and Congress 49%?

*SB: This is trickier! I’d say it is too close to call, especially because those sampling errors you mentioned could also kick in to mess up the prediction.*

RK: It is indeed tricky. Even though the numbers suggest a 2% edge for BJP, in reality Congress may actually be marginally ahead after taking sampling errors into account! This is the possibility that truly challenges the psephologist. I solve this problem by computing probabilities. I ask: What is the probability that the survey shows Congress 1% short of the majority mark when their true percentage is actually just a whisker over 50% after accounting for the likely sampling error?

*SB: And your answer is …?*

RK: My answer is that this probability is 45% … so something like this could happen 45 times out of 100. I therefore assign a 45% win probability to Congress. Since BJP is the only other party, its win probability is 55%.

*SB: And if there are three parties in the race instead of two?*

RK: Nothing really changes, except that the probability computation gets harder. As an example, suppose that the BJP-AAP-Congress vote percentages in a Delhi constituency are respectively 32%, 28% and 26%, then BJP wins 46 times out of 100, AAP 30 times out of 100 and Congress in the remaining 24 times out of 100.

*SB: But how do you call those seat numbers that eventually show up on the large CNN-IBN screen after Sardesai’s moment of delicious drama?*

RK: I just add the seat win probabilities! Let us return to your state of Karnataka. Karnataka has 28 constituencies. Imagine that our survey and analysis indicates that Congress wins 54 times out of 100 in Bagalkot, 56 times out of 100 in Bangalore North, 38 times out of 100 in Bangalore South and so on … then I just add 0.54 + 0.56 + 0.38 and so on over all the 28 constituencies. If that total is 18.05 or whatever, I privately tell myself that Congress will get 18 seats, give or take a few to account for sampling errors. So when Rajdeep asks me to call the numbers on TV I declare that “Congress will get 16-20 seats in Karnataka”. If I’m not too comfortable with the sampling process, I might widen the intervals to say “15-21 seats”.

*SB: Are you tense when you call these numbers? Do you worry that your numbers could go horribly wrong?*

RK: Well, there is obviously some tension, but I never worry. In fact I enjoy doing this! We have a good model, we have obtained reasonably good results for over 15 years … and we remain faithful to our method. So why would I worry? As long as I am honest with myself, and with my science, I have nothing to fear.

–This post was written for my blog: Bhogle The Mind appearing on statistics2013.org

This was very good and informative. Also nice to note that this came in the same month Caravan came out with this on poll predictions: http://caravanmagazine.in/reportage/spot

Thank you for sending the Caravan article link, although I’m surprised that article doesn’t mention Rajeeva Karandikar (Rajeeva-Yogen Yadav have been a team till Yogen joined AAP).

Yes. Surprising, considering that the CSDS survey is mentioned prominently.

Sir: You mention that this interview never actually happened. Is this posting then your assessment of how Prof. Karandikar does his job on projections or is this from many different conversations that you have had with him? Thank you for clarifying.

This is based on numerous personal conversations, email exchanges, and one of recent research papers. Rajeeva Karandikar, who I forgot to mention is currently Director of Chennai Mathematical Institute, has been a dear friend for over 35 years.

Sir, can you share the details about the research paper that you are referring to? This is purely for my reference and better understanding about this fascinating area. Thank you.

I have emailed you the research paper. I have also copied Prof Karandikar’s mail id. So you can ask him your questions directly.