All About Multiple SclerosisThe latest MS News articlesEssays describing Multiple SclerosisMultiple Sclerosis Encyclopaedia/GlossaryA list of celebrities with Multiple SclerosisPersonal Experiences with MSOther MS resources on the WebSearch this siteAbout this site

Multiple Sclerosis and Statistics


Many of the things that are known, or believed to be known, about MS are derived from statistical surveys or depend upon statistics for their validity and, as such, are subject to the uncertainties inherent in statistics. Although statistics are always error prone, they are particularly so when applied to MS, precisely because of the unpredictable nature of the disease. It's not that statistics are in themselves bad - without them there would be very little science at all. It's just the way that they are often interpreted by scientists and journalists. Here is an apposite quotation I read on the newsgroup ASMS: "Too many people use statistics the way a drunk uses a lamp post. For support, and not for illumination."

I'd like to address the problems inherent in statistics in this section. A lot of MS research comes up with some very doubtful claims and we must not accept these claims as fact because the "scientists" have said so. What I'd like to do here is to look at how to "read" statistics and I'd like to start off by telling you a story.

Ok, tell me.

One time, I was playing backgammon with a girl who I was trying to impress. My first two throws of the dice were double sixes. Before my third throw, we were both watching the dice keenly and stared at each other in astonishment when the dice registered two more sixes.

What were the chances of this happening? 1 in 46,656. That's around an 0.002% chance of happening. Put another way, we can be 99.998% confident that this won't happen. Does this mean that the dice were weighted? Of course not - as the game progressed, it became clear that the dice were behaving normally and I returned to throwing my normal junk.

So what has this got to do with MS?

I shall use this story to illustrate why caution needs to be exercised when interpreting statistics. I hope to explain the methods that statisticians use without getting too mathematical, but first we need to look at why statistics are so relevant to MS research.

There are three ways that we can do MS research.

The first is to look at tissue samples, genes, proteins and other biochemicals from people with MS or animals with EAE. We can then try to work out what is happening to the central nervous systems, to the cells and to other parts of their bodies. For example we might want to look at the site of an MS lesion and try to find some cells or other biochemicals which we wouldn't expect to be there. This will help us attempt to understand how the disease does what it does - to derive a potential "mechanism" to explain the nuts and bolts of the disease. Armed with a such a  mechanism we can look at ways to mitigate against the damage, to stop it in its tracks or even to reverse it.

This is the theory, anyway. In reality, luck and politics play a much bigger role than might be expected. Statistics are also very important - often what you have is a set of numbers derived from analytic experiments and you need to be confident that your margins of error are within an acceptable range.

The second way to do MS research is through surveys. This involves planning. You have to select a group of people with MS to test for a particular trait. This may not be as simple as just asking whether or not a person has the disease, because you may be looking into how it affects different population groups. For example, you might be asking whether where a person is brought up affects their chances of contracting the disease or whether they have any relatives with the disease. Usually, you also have to select a good control group. A control group is necessary in order to show that what is happening to the first group is not also happening to the control group and is therefore meaningful. The control group needs to be a valid cross-section of the population as a whole and to be randomised for the circumstances being tested. Both groups have to be large enough to return valid results.

You also have to be asking the right questions - design is critical here. Does everyone asking and answering, doctors, patients and surveyors alike, understand the same thing by a question? You must eliminate ambiguity - I've often responded to surveys or polls and haven't understand exactly what the surveyor is looking for. All these issues and we haven't even begun to address the problems inherent with statistics!

The third way is a mix of the first two methods. People's bodies are very complex and there is an awful lot going on in a person's body that has nothing whatsoever to do with MS. For example, you may find evidence of a pathogen attacking several PwMS. You may then look at a control group and find that very few of them show evidence of being attacked by that same pathogen. Is this a fluke result or has the higher incidence of this pathogen got something to do with MS? You may wish to know whether a certain genetic configuration is in any way correlated with MS or whether a particular drug has any effect on the progress of the disease. Researchers use statistics to attempt to decide.

So what about the actual statistics?

The key to statistics is to attempt to establish that your results are very unlikely to be caused by chance. As a consequence, the results that statistics yields are to do with "probability" (using the mathematical meaning of the word). Statistics are all about the likelihood that this result, obtained from a small sample population, says something meaningful about the population as a whole.

How can this be achieved? Well, there are a variety of methods and these usually involve understanding a lot of quite sophisticated mathematics involving terms like "normal population", "skewed distribution", "standard deviation", "variance", "binomial test", "t-test" and "chi-squared analysis". This is all good stuff and very valid.

Fortunately, it is not necessary to understand the mathematics in order to understand statistics or even to practice them. However it is important to understand the concepts. In my view, no one should be placed in a position of acting upon statistical results unless they have been schooled in interpreting them. Far too many civil servants and politicians are reading off the results of scientific research and demographic surveys and then making decisions based on them without having the least idea of what the statistics actually mean.

Statistics come in two flavours - parametric and non-parametric. Parametric statistics are where we have a lot of measurements which yield a lot of "real" numbers like 1.632, 517.1 or 0.00001724. These numbers could be the concentration of a certain cell type per litre of CSF or even rankings on the EDSS scale. Non-parametric statistics are where you have a set of "conditions" (usually two) into which the data falls. Statisticians usually assign 1 all the data that matches a condition and 0 to all the data that doesn't. These could be whether a PwMS smokes or not or whether a person has MS or not. Parametric statistics yield much more reliable results than non-parametric statistics and statisticians generally prefer them. Science practiced in a laboratory tends to produce parametric data and surveys conducted by market researchers tend to produce non-parametric data.

Whatever sort of data you have, it is important to apply the appropriate statistical test. Statistical tests that work for one type of dataset (a group of individual data items) are likely to be completely inappropriate for another type. Many researchers both select and apply the tests by rote. Implementing the tests can be done by simply applying formula taken from a book or a computer program and reading off the results. This is perfectly okay. It is how the results are used and interpreted that is often a problem.

A trick, that is often used in statistics, is to assume that a trait (or "variate") being examined has no bearing on the outcome. For example, that mercury amalgam fillings do not affect your chances of contracting MS. This is known as the "null hypothesis". By applying the appropriate statistical test, you can determine the probability (likelihood) that the null hypothesis is incorrect - i.e. that you are looking at something real. This probability is expressed as a number which equates to the degree of "confidence" that the null hypothesis is wrong. Researchers aim to achieve certain accepted confidence levels which allow them to say that their work is "significant". 95% and 99% are generally agreed by many statisticians to be significant results. This agreement is, in fact, purely arbitrary and approximately relates to a statistical concept called the "standard deviation", which, in itself, is actually quite arbitrary. "Significant", "confidence" and "probability" are all statistical terms whose meanings are not the same as the common English usage.

So what's the problem - 95% confidence seems pretty confident to me.

95% is one chance in twenty. If we had a dice with 20 faces marked from 1 to 20, we'd consider ourselves pretty lucky to throw a 20 on the first throw. However, if we were to chuck the same dice 20 times, we'd consider ourselves a shade unlucky not to have thrown a 20 on any throw at all. This is the problem with statistics - very often we are not throwing the metaphorical 20-faced dice just once, we are throwing it many times, but not noticing all the times that it fails to come up with a 20.

This may seem a little strange. Wasn't the initial survey or experiment was only done once? It was. However, we have to broaden our horizons to understand why it is that many more than one in twenty 95% significant correlations later prove to be chance observations. Let me explain.

There are several ways that this can happen including Statistical Fishing, Statistical Clustering, Small Dataset Size, Poor Control Group Selection, Unpredetermined Goals, Selective Sampling, Dishonesty and Manipulating the Data Published.

Here is an abstract from MedLine that demonstrates some of these errors in operation. I make some assumptions about the methods used by the researchers. Whether these assumptions are true or not, doesn't matter - it's the principles that are important.

Multiple sclerosis in Key West, Florida.

Helmick CG, Wrigley JM, Zack MM, Bigler WJ, Lehman JL, Janssen RS, Hartwig EC, Witte JJ

Division of Chronic Disease Control, Center for Environmental Health and Injury Control, Atlanta, GA.

In 1984, a press release by a Miami, Florida, neurologist described a possible cluster of persons with multiple sclerosis in Key West, Florida. The authors examined the cluster using prevalence rates, which are recognized as having a latitudinal gradient for multiple sclerosis, being generally high at high latitudes and low at low latitudes. Case ascertainment showed 32 definite or probable cases among residents of the study area (latitude, 24.5 degrees N) on September 1, 1985, a prevalence rate of 70.1/100,000 population--14 times the rate estimated for this latitude by modeling techniques based on US and international data, 7-44 times the rate for areas at similar latitudes (Mexico City, Mexico; Hawaii; New Orleans, Louisiana; and Charles County, South Carolina), and 2.5 times the expected rate for all US latitudes below 37 degrees N. This finding could not be explained by changes in diagnostic criteria, case ascertainment bias, immigration of people from high-risk areas, an unusual population structure, a large percentage of related cases, or better survival. Prevalent cases (n = 22) were more likely than general population controls (n = 76), matched by sex and 10-year age group, to have: lived longer in Key West, been a nurse, ever owned a Siamese cat, had detectable antibody titers to coxsackievirus A2 and poliovirus 2, and ever visited a local military base (Fleming Key). Key West has an unusually high prevalence of multiple sclerosis that may be related to these risk factors.

This is such a badly designed survey, based on such a poor understanding of probability, it's hard to know where to start picking the bones out of it. Let's have a look the motivation for doing this work. A neurologist reports seeing a large number of people presenting with probable or definite multiple sclerosis. Somehow the Center for Environmental Health and Injury Control are asked to look into the possibility of an MS plague and send a team go in to investigate.

Actually, this is a great survey because it commits nearly the whole gamut of statistical errors and winds up being a completely worthless waste of funds that would be better spent on real MS research. Don't imagine for a moment that this is an isolated case - MedLine is full of such trash.

Statistical Fishing

For now let's put ourselves into the position of the people carrying out the work. Rather than examine the possibility that this high MS prevalence is due to statistical clustering (a concept which I shall examine below), they are trying to find possible causes for it. Judging by the questions asked (Siamese cats!?), they have interviewed the sample population and taken blood samples. Their lives and antibody titers have been dissected from every conceivable angle and used to build up a questionnaire and blood sampling program which was then applied to the control group. The resulting data was probably analysed using t-tests and any variates that turned up a 95% or 99% confidence level were reported in the article.

So what's wrong with that? A lot!

I don't know how many variates were actually examined, but it seems likely that it was a pretty large number if it included Siamese cat ownership. Let's be generous and assume that they didn't interview the sample population before devising the survey (I'll explain what is wrong with doing this soon). They have come up with six correlations which are all non-parametric. If the results required 95% confidence limits, one would expect that they examined 120 variates, if they had required 99% we would expect 600. It's possible, of course that they didn't even require these undemanding levels of confidence.

It's like the twenty sided dice we talked about before. You expect one twentieth (95%) of all your throws to be a 20. That means that if you examine enough variates in any given dataset you should expect to turn up chance correlations. In fact, you only need as few as fourteen variates to have a one half chance of finding at least one correlation within 95% confidence limits.

In this study, the researchers have looked at the lives and blood of the PwMS in this area and come up with a whole load of tests to do on or questions to ask the control group. This is fishing and low and behold they have caught something. But what they have pulled out are minnows and the best thing to do is to chuck them back.

Statistical fishing has, in the past, turned up a lot completely spurious chance findings. It is so flawed that it is looked down upon by the more serious researchers, but it is still practiced by a whole load of people.

Statistical Clustering

When you look up on a clear night, you can see little patches of sky where there are more stars than there are in other areas. This is the nature of random distribution. If the stars were evenly distributed in the heavens, all equally distant from their neighbours, then they would appear as a grid and we wouldn't say that they were randomly distributed. In fact, the stars aren't distributed completely randomly but, for the sake of the argument and to all intents and purposes, we can that they are.

Let's imagine taking a photo of the sky then getting it developed and printed. Now we draw a line around a little area where there are a lot of stars to prescribe an area. Then we mark off a similar sized area around an area where there aren't many stars. Then we count the stars in both areas and, lo and behold, the one area has a much higher density than the other! The only way we wouldn't have been able to have done this would have been if the stars had been dotted in the sky in some kind of cosmic grid.

Are we to assume that stars aren't randomly distributed from this exercise or are we looking at a statistical cluster?

Well, of course it's a statistical cluster - we fixed it that way!

Our researchers from Atlanta have done precisely the same thing. They have selected a small area with a higher than average density of PwMS and drawn a line around it. Then they have calculated the prevalence and compared it to the overall prevalence for that latitude and, lo and behold, just like our stars, they have a higher than expected prevalence.

We could equally have gone to another area, even one at a higher latitude, and found there to be no people at all with MS. What would it prove? That we have found a freak area where MS can't survive? Not at all - it'd just be a reverse statistical cluster.

Is this example from Florida a statistical cluster? Probably! I say probably because the possibility that it is an MS plague remains. However, since that we are only looking at 22 people with probable or definite MS, it's not really a large enough dataset to draw any such conclusions. There do appear to be areas of the world where the incidence suddenly rises steeply in a plague-like manner - in the Faroe Islands, for example. Whether these are plagues or not is a matter that I deal with in another section.

There have been some great statistical clusters in the past: Israeli airforce pilots fathering a higher proportion of girls, Swedish men having a lower sperm count than their fathers to name but two.

Small Dataset size

In our example of the backgammon dice at the start of this essay, we saw three throws of the dice turn up three double sixes. We can accurately calculate that the chances of this happening are 0.002%. That's sounds pretty unlikely, doesn't it?

The trouble is that we are looking at a very small dataset size - just three throws. The key fact here is to recognise that any number on any throw of just one dice has a one in six chance of happening. A six is no more rare than a two. A two and a four, followed by a three and a two, followed by a one and a six is actually no more likely than three double sixes. It has exactly the same chances of happening.

When you throw the dice three times there are 46,656 possible outcomes of which the three double sixes are just one possibility along with the 2/4, 3/2 and 1/6 and all the other possible permutations. We just notice the three double sixes.

So when can we say that the throws definitely mean that the dice are weighted?

Well, essentially, if we use this kind of method, we can never be absolutely certain. It is always possible that, even when using completely unweighted dice, we will keep on throwing double sixes forever.

The philosophy of this gets pretty deep. To cut a long story short, let me say that it is generally agreed that science should proceed by attempting to propose mechanisms that explain why things happen the way they do. Such a mechanism is called a hypothesis. The task is then to make predictions based on this hypothesis. You design experiments to test the hypothesis with the aim of disproving it. If you fail to disprove it, you haven't proved it (science never proves anything) but you have just shown that your hypothesis remains a possibility. The more times your hypothesis fails to get disproved then the greater the confidence everybody will have in it, especially if alternative hypotheses get disproved.

For example, with the dice example above, we might put forward the hypothesis that the faces opposite the sixes are made of a magnetic substance than is attracted by an iron layer in the board. Based on this hypothesis we can make these two predictions:

  1. that by cutting one of the dice up in a laboratory, analysing it composition, we will will find some magnetic material.
  2. that by throwing the dice 6000 times we will get a lot more than 1000 sixes.
Experiment #1 is very destructive and might not be desirable. Experiment #2 is actually what we did - we went on playing backgammon and found that the dice didn't register significantly more sixes than normal dice.

But aren't you contradicting yourself? You said that throwing the dice over and over again doesn't prove anything!

No, it doesn't prove anything but it may strongly indicate that the hypothesis is either true or false.

It's really about two things:

  1. Dataset size - you need to run your experiment with a large sample population. Three throws of the dice are not really enough, hundreds are better and thousands are better still. In the Florida plague example, the sample population was only 22. In statistical terms, 22 is a very feeble number. You can't say a lot with much degree of confidence from a population of this kind of size unless you go for confidence limits way in excess of the traditional 95% and 99% confidence limits.
  2. You need to know what you are looking for before you run the experiment - you need to have predetermined goals.
Predetermined goals

Because we know that any set of results with the dice were possible, we cannot look at the results that have already turned up and then say that our hypothesis about the nature of the dice has been tested. This is because already know what has happened. Provided that we make a hypothesis that explains that facts, that hypothesis is bound to remain untested. This is a difficult get across.

Let me illustrate this with the Florida example. The researchers analysed the lives of the PwMS in the sample population and found that a number of them had Siamese cats. It is almost certain that they found something that these people had in common with eachother but that was very slightly unusual. I'm not saying that owning a Siamese cat is unusual in a wierd way, just that the minority of people in any town are likely to own a Siamese cat.

There are a very great deal of things that are unusual in this way, working as a computer programmer, owning a model railway set, eating bacon for breakfast, wearing nylon vests, growing up on a farm etc. In fact, most of the things that we do are unusual in the sense that a minority of all the people in the world do them.

We all do an awful lot of things - over the course of our lifetimes, we do many hundreds of thousands of things. So if you take any group of 22 people, you will almost certainly find at least two things that they all do which is unusual in the way just described.

The problem comes when you ascribe a causal relationship between those two things. This is because you haven't tested it. We know we can always find such pairings and we know also that they are usually random. If we want to test the hypothesis that they are related, then we must go out and get a new population plus a control group and test the hypothesis on them.

This is one reason why experimental resulted must be always be replicated.

----Selective Sampling - Interpretation - cause and effect ----

 The Immune System  |Back | New research in MS