2. Future studies should use additional measures of average and variation.

If there was one billionaire living in a small village of 1000 people, would it be fair to say that the average person in the village was a millionaire? If all the money was shared out equally, then each person in the village would be a millionaire (the usual way of working out an average). They would probably be quite happy with that, but it would be very unlikely to happen. So does it give a fair picture?

This way of calculating an average, by adding everything together and sharing it all out, is called the mean (or to be precise, the arithmetic mean). There is also a calculation that gives a measure of the variation between the values, and that is called the standard deviation.

For adult males in the U.K., the mean (average) height is 5 feet 10 inches with a standard deviation of 3 inches. We add and subtract the standard deviation from the mean to get a range of 5 feet 7 inches to 6 feet 1 inch to give us an indication that two-thirds of men in the UK will be within those heights. That is because the distribution of heights is Normal - there are the same numbers and variations of height on either side of 5 feet10 inches (a less confusing term than Normal is Gaussian).

Consider these numbers - 10, 10, 10, 10, 10, 10, 10, 10, 10, and 0. The mean of these numbers is 9 and the standard deviation is 3·2. Suppose this was the number of fingers that each of ten people had. If we used the mean and standard deviation to give us an idea of average and range, we would say that on average the group had 9 fingers, and the normal range was from 6 to 12 (3·2 either side of 9).

But of course you have already dismissed this: you know that it would be daft to say that an average person has 9 fingers, and most people have between 6 and 12.

Using the mean and standard deviation here is wrong, because, unlike with men's heights, the distribution of values is very unbalanced.

The data used in the PACE trial is also very unbalanced (as it is in many other studies). In circumstances like that we have to use other methods to give a fair and clear picture. It isn't statistics that is at fault, but a poor choice of method.

Regrettably, we tend to take statistics for granted, especially when it comes to averages, but even averages are much more complex and potentially deceptive than would appear. When we talk about an average, we normally think of the arithmetic mean, but we do have other choices.The problem with the mean, and even more so with the standard deviation, is that they are greatly influenced by unusual values (such as the billionaire).

For what is known as Normal/Gaussian distributions (such as people's height, weight, I.Q.) the pattern is understood, and we know that if we add and subtract the standard deviation from the mean, we get a range of values for the middle two-thirds of people, as stated above for heights. But most collections of data do not fit this distribution. Many are very lopsided, or skewed, with very long tails on one side (such as income distribution - there is a large clump of people on typical wages, but a very very long tail of a small number of people getting very large incomes).

This is a problem with the average/mean levels reported in the PACE trial. The data is heavily skewed, and that moves the mean to an inappropriate level. The mean is like the balance point on a seesaw. If one of the arms is very short, and the other relatively long, it takes a lot of people on the short end to balance one out on the long end. If one patient improves by quite a lot, it takes many patients remaining pretty much as they are to keep the average down.

For the group that only received Specialist Medical Care, their average/mean score on the fatigue questionnaire dropped by 16%, from 28.3 to 23.8 out of 33, but the measure of variation (the standard deviation) increased by 91%. We do not have access to the actual data, but we do know that this is a pattern that can only be obtained from most of the group improving by only a little, if at all, and a very few (say 1 in 8)doing much better.

If a distribution is well-balanced on either side, then using the arithmetic mean as an average is fair and clear. But when distributions are skewed, or the measurements themselves are not fairly linear, other types of averages are more appropriate (and mathematicians have many others). The one that should also be quoted in situations like this is the median - the middle value - and this is done in good quality reports where distributions are skewed.

We are pleased to see that the recent study by Nacul et al. (mentioned in 1-details), undertaken as part of the M.E./C.F.S. Observatory Research Programme, and looking at the functional status of people with M.E., specified both sets of averages in table 2 on the physical function S.F.-36 scores: the medians were consistently below the means (e.g. 25.0 versus 30.1), as is typical of a strongly skewed set of data.

Using the mean as an average in a situation like this consistently overstates the effectiveness of the treatments or therapies for the majority of the patients, but more importantly, it draws attention away from the clustering at the bottom end.

Into the Mathematics:

In 2007/2008 what would you guess the average income was for people in the UK - £18500 or £26800?

Obviously you are looking for the catch now, but what exactly do we mean by average? Most of us, when asked to find an average of eleven numbers, say 3, 3, 4, 5, 7, 8, 8, 8, 9, 9, and 10, would add them all together then divide by 11, in this case to get 6·7.

The term average simply means a representative value: we have several different ways of defining an average in maths, and this method is properly known as the arithmetic mean.

There is another calculation to get a measure of the spread of the marks which is called the standard deviation (in this case it is 2·5), and is usually added and subtracted from the mean to give a spread of values which would contain approximately the middle two-thirds of a larger, well-balanced set of data (in this example it is 4·2 to 9·2).

Another method of finding an average is to find the middle value (the median) and to use that. Then the middle value of each of the two separate halves become known as the quartiles (the quarter-way marks). Here this would give 8 as the median, and the marks 4 and 9 would be the quartiles. This would mean that half of the marks would lie between 4 and 9 . For a small sample like this we do not nit-pick the fact that you cannot have half of eleven results.

The advantage of this method is that it genuinely gives you a central figure that is not distorted by excessive amounts, but of course it doesn't actually use all of the figures (which would be important, say, in looking at scores in sports).

Obviously, this is not a worthwhile exercise for such a small set of numbers - it is simpler just to look at the actual data.

In the U.K. in 2007/2008 the arithmetic mean income was £26800 whereas the median income - the income of the middle person - was £18500. Like our example of the billionaire living in a village, the arithmetic mean was boosted by a small number of people earning getting very large salaries. Unless you watched the programme on incomes by Jon and Dan Snow in 2008, you are probably surprised at how low the median income is. When a distribution is evenly balanced, both the median and mean work out at the same value, but when a distribution is skewed the more extreme values have a disproportionate effect on the mean, and even more so on the standard deviation. The lower and upper quartiles for income 2007/2008 are £11800 and £29500, which means that half the country earn between these two figures. Probably a little under 70% of the country had an income below the arithmetic mean of £26800. The standard deviation is even more strongly affected by the few, very high incomes, and works out at around £29500. If we combine that with the mean to say that typical wages are £26800 plus or minus £29500, it clearly does not make sense.

The reason why the arithmetic mean and standard deviation are so prevalent in statistics is that, being the result of a mathematical calculation, further calculations, such as significance, can be performed using them. But if the distribution concerned is heavily skewed, they are not good indicators of typical values.

There is an instinctive belief that the arithmetic mean is somehow "more accurate" because it is the result of a calculation and can be quoted to a number of decimal places. The term "accurate" is misleading though: we are looking for something that is truly representative, and that means a value judgement has to be made. If a distribution is balanced, the mean and median are close to each other, and either are appropriate. If not, then the median gives you a better idea of a typical value, whereas the mean would be appropriate when it would be wrong to exclude rarer high or low scores (e.g. working out a sporting average).The same is true of using the quartiles or the standard deviation as a measure of spread. In general, the median and quartiles for a general description, and the mean and standard deviation are best reserved for use where further calculations are necessary.

Why is all of this relevant to our discussion? Simply because much of the data in the PACE trial, and in other studies of ME/CFS, is skewed, so great care must be taken to ensure that values quoted are truly representational and not deceptively large.

When the Chalder Fatigue Scale was used for people with ME/CFS in the PACE trial, the scores were heavily weighted towards the bottom/very fatigued end, and so are heavily skewed. There are several ways to measure how skewed a distribution is; in fact the Chalder Scale turns out to be as skewed as the income distribution above. Using the mean and standard deviation to describe those results could not be described as giving a good indication of typical results. The use of medians and quartiles (or even key percentiles as the Office of National Statistics does) would have given a much more realistic idea. It is likely that the distribution of improvements in each of the other measures used in the PACE trial is similarly skewed: just as very high earners boosted the mean income, a smaller proportion of patients who showed great improvement would have had a disproportionate effect and would have boosted each mean score.

Even more Mathematics:

Understanding how standard deviation works is often essential when assessing many medical studies. The authors of the PACE study, for example, decided to set the boundary for "normal scores" at 18 or less on the Chalder Fatigue Scale, by adding the standard deviation of 4·6 to the mean of 14·2 calculated from a sample of patients attending doctors' surgeries. This, quite simply, is wrong.

The explanation is fairly straightforward, and easily understood, but does need to start with a simple example. Let us use the example of nine people with ten fingers each, and one with none. First we calculate the mean: the total is 90 fingers: then we divide by ten to get the mean/average of 9 fingers per person.

Next we work out how far each person is from that mean value. Nine of them have one finger more than the average, and one person has nine fingers less. If we were to add up all those values of 1 together with the minus 9, it would come to zero: that is because the mean is the balancing point. So we square all the values to make them positive, and get 1, 1, 1, 1, 1, 1, 1, 1, 1, and 81. Notice how important that last, extreme figure has become.

Now we total these numbers to get 90, then, surprisingly, we divide by 9 rather than 10. That is because we are looking at differences, and the ten values are rather like ten fence posts. But we can only fit nine fence panels because really there are only nine independent differences.

That gives us 10. We then find the square root of 10 (to "undo" the squaring that we did earlier), which gives 3.2.

But the damage by squaring that original extreme value has been done. We now have a large standard deviation of 3.2.

With a Normal (Gaussian) Distribution, which we would get if we considered the heights of adult males, there are not many extreme values, and most heights are equally clumped either side of the middle heights, so if we add and subtract the standard deviation of 3 inches to the mean height of 5 feet 10 inches, we can estimate that two-thirds of the population lie between 5 feet 7 inches and 6 feet 1 inch.

This idea of mean plus or minus the standard deviation is often used to define what we would call normal, or everyday measurements. But it only really applies to Gaussian Distributions. If the distribution is of an unusual shape, we are unable to judge what this calculation would give us.

In the example of incomes (on the previous page), the mean was £26,800 and the standard deviation was around £29,500 (this is a calculated estimate, and, if anything, is too small: we used a ceiling income of £330,000 in the calculation). If we add and subtract the standard deviation from the mean, we get an income range of minus £2,700 (a negative amount, meaning that the employee pays £2,700 for the privilege of working) to £56,300. This could hardly be used to represent the range of everyday incomes, as it covers around 93% of all incomes. The distribution is, of course, heavily skewed, with a very small number of people having very large incomes.

The "finger" example on the "more" page (where the average of 10, 10, 10, 10, 10, 10, 10, 10, 10, and 0 is 9) is a rather silly one, but instead think of it as being the set of class marks in a statistics test. What would you report to the parents as a normal mark out of 10 in the class - would you ignore the mark of zero and say 10, or would you use the mean and standard deviation and say between 6 (9–3·2) and 12 (9+3·2) out of 10 was normal? This is, in fact, a major problem in education when it comes to setting pass marks at examinations. In the 1970s only 20% of each year's U.K. intake were allowed to gain a pass at O-level Mathematics, whereas 40% were allowed to gain a pass at O-level English (which is why we have so many people who think they are so much worse at Mathematics than English). Since the inception of C.S.E. and then Key Stage tests, examiners have moved away from these percentages to setting certain expected standards for each grade, and these are decided by experienced examiners and teachers. It is a very difficult task, but there is no way that they would use means and standard deviations to define levels, as these are so easily manipulated by the entry of additional weak candidates. If several schools suddenly entered significant numbers of poor students, it would become much easier to pass.

This is exactly what has happened with the target pass mark of 18 points set on the Chalder Fatigue Scale. To determine that, they used data from Pawlikowska and from Cella, which included a disproportionate number of ill patients, to calculate the mean and standard deviation (14 and 4 respectively), which they added to produce the boundary of 18 points (in this example, large scores mean lots of fatigue). Examiners, who have been determining standards for many years, would be aghast at this - it would mean that the ill patients, like weak students, would have a strong influence in lowering standards. The only professional way to determine this is to examine the scale, completed by many different people in various stages of good health and illness, and decide where the actual boundary is. This may well be what the authors did at the start of the study where, using a different scoring system for the scale (more of that in the next section), they set the targets at halving the fatigue score. They also added a spurious alternative – or scoring 3 or less (anyone scoring 3 or less would have halved their score). Whatever method they used to determine the boundary, it is very clear that the final target in the study was much easier to attain than the target in the agreed protocol.

Surely doctors and specialists have enough skill to able to agree amongst themselves, just as teachers and examiners do, about where the borderlines should be drawn, rather than simply use inappropriate calculations? Good health is a decision about quality: it is not a statistical calculation.