* Department of Neurosurgery, University Hospital of Alexandroupolis, Greece* Department of Neurosurgery
Introduction
The use of statistics in biomedical arena has increased
considerably during the last six decades, and the types of statistics used have become much more complex (1). Concomitant with this influx in the use of statistics was an increase in statistical errors (2-4). The most commonly encountered problem is the assessment of ordinal data with parametric methods, which are not meant to deal with such data (5). This in turn puts results under question and wrong conclusions are reached (6).
The Harvard psychologist S.S. Stevens (1946) defined measurement as the assignment of numerals to objects or events according to rules and suggested that different rules result in different kinds of scales and different kinds of
measurement. He also pointed out four types of scales: nominal, ordinal, interval, and ratio (7). The application of this typology to statistics has been reported to raise many subtle problems (8). For example, in the ordinal scale the numbers
represent rank-orders, and do not provide any information regarding the differences between adjacent ranks. (9, 10) However, each value indicates a progressively lesser or greater ranking of the property described by the scale (6).
The Glasgow Coma Scale (GCS) and the Acute Physiology and Chronic Health Evaluation (APACHE) II scoring system are ordinal data scales (11). The first one was introduced in 1974 by Teasdale and Jennett as a means to evaluate the depth and duration of coma and impaired consciousness (12). The second scale is a revision of the initial APACHE system presented in 198l by Knaus et al as a severity of disease classification system based on numerous physiological variables (13). Since their publication, a vast number of researchers employed them in evaluation of head injured patients.
The mean, standard deviation, analysis of variance and Student’s t-test are frequently used in the analysis of ordinal data (14). Yet, all of them are parametric in nature and not appropriate for such data (5, 6, 15), even though the opposite has been also suggested (16-19). Moreover, a lot of nonparametric methods have been proposed to deal with ordinal data (10, 20). The scope of the present study was to assess the
distribution of GCS and APACHE II data collected from a population of head trauma patients admitted to a tertiary
hospital intensive care unit (ICU) in a town of northern Greece. The formulated hypothesis was that these ordinal data followed the normal distribution. In doing so, they would
fulfill a major assumption of parametric statistical methods, and thus could be correctly manipulated by such parametric tests.
Materials and methods
The medical charts of 185 patients with traumatic brain injury who were admitted consecutively to the mixed ICU of University Hospital of Alexandroupolis, Greece during the decade 1994-2003 were retrospectively examined. The medical findings at admission to the hospital ICU were recorded applying the GCS and the APACHE II scoring
system. Inclusion criteria that patients should fulfill were the absence of any other kind of injury and appropriately filled in medical records including total GCS and APACHE II scores.
The mean and standard deviation for both GCS and APACHE II scores were calculated for illustration only, since they are not proper descriptors of ordinal data (4, 14) due to the lack of a consistent level of magnitude between numeric units of the scale (21). The mode and median were also found. Normality testing was conducted with the c2 goodness of fit test, the one sample Kolmogorov-Smirnov test (Z) (2-tailed p), the Lilliefors test (ZL), the Shapiro-Wilk test (W), the rule of thumb (the ratio of median to mean), and the calculation of skewness and kurtosis (including the quotient range/standard deviation and the Pearsonian coefficient of skewness). The statistical significance was set at 0.05. Finally, frequency histograms and normal probability (P-P) plots were constructed for the two scales. Regarding the P-P plots, the Blom’s fractional rank estimation method was employed.
The c2 goodness of fit test was conducted using S-PLUS 2000 Professional software for Windows (22) while all other tests, histograms and P-P plots were undertaken with the use of Statistical Package for Social Sciences (SPSS) v. 15.0 for Windows (23). The rule of thumb, the ratio range/standard deviation, and the Pearsonian coefficient of skewness were manually calculated.
Results
The inclusion criteria were met by 74 patients. The male: female ratio was 9:1 with a mean age of 45.19±2.555 and the main cause of injury being road traffic accidents in 75.7% of cases. Some patients were sedated while others were recovering from general anesthesia.
GCS and APACHE II data are summarized in Table 1. For GCS, the median and mode coincided and equaled to 3 and the data had a J-shaped distribution. 60.8% of the patients had a GCS of 3, 12.2% a score of 15 and the remaining 27% had scores almost uniformly distributed across the rest of the scale. For APACHE II, multiple modes existed and the smallest value is shown. The range of APACHE II scores was 38. The variance for GCS was 22.707 and for APACHE II 58.607. The standard error for skewness was 0.279 and for kurtosis 0.552 for both GCS and APACHE II.
The ratio of median to mean for GCS and APACHE II was 0.469 and 1.069 respectively. The corresponding values of three times the standard deviation were 14.31 and 22.98. The ratio of range to standard deviation for GCS was 2.516, and for APACHE II 4.961 (with critical values of 4.11-5.68). The Pearsonian coefficient of skewness gave a value of 2.509 for GCS and of -1.140 for APACHE II.
The c2 and p values are shown in Table 2, while the Kolmogorov-Smirnov, Lilliefors and Shapiro-Wilk statistics are presented in Table 3. For the c2 goodness of fit test the critical c2 tabular value was 16.92 (for 9 degrees of freedom (df) and level of significance a=0.05). In the one sample Kolmogorov-Smirnov test the maximum differences for GCS were 0.370 (positive) and - 0.238 (negative) and for APACHE II 0.073 and -0.082.
In contrast to normally distributed APACHE II data, GCS data were not normally distributed. This lack of
normality for GCS and, on the other hand, the normality in APACHE II data distribution is shown graphically with the frequency histograms (Fig. 1 and 2) and the normal P-P plots (Fig. 3 and 4).
Discussion
Parametric statistical methods concern parameters of
distributions. In order to apply these methods, specific conditions about the distributions must be verified (10). Practically, these tests are applied when the sampling distributions of the data variables satisfy in a reasonable way the normal model (24). On the other hand, nonparametric tests make no assumptions regarding the distributions of the data
variables, only a few mild conditions must be satisfied and they are adequate to small samples, which would demand the distributions to be known precisely for the application of a parametric test (9). In addition, nonparametric tests often concern different hypotheses about populations in comparison to parametric tests (14). Finally, unlike parametric tests, there are nonparametric tests that can be applied to ordinal and nominal data (5,25). Examples of nonparametric tests are the Kolmogorov-Smirnov, Mann-Whitney, Wilcoxon, Kruskal-Wallis, and the Friedman tests (15,20,26).
Ordinal data are a specific form of categorical data, where the order of the response categories is of importance (5). Data measured on an ordinal scale are distinguished from one another on the basis of the relative amounts of some characteristic they possess (8). Yet, the differences between rankings are not necessarily equal (10). For example, a difference of one point in GCS between scores 14 and 15 is not the same with a one-point difference between scores 8 and 9 given that a score of 8 is the critical score for intubating head injured patients. All the same, APACHE II scores 3 and 4 on the one hand and 27 and 28 also differ in one point. Nevertheless, the one-point difference in the second case would have a greater impact on mortality. Furthermore, the mean of such data seems meaningless. For example, only 1.4% of the patients studied had a GCS score of 6 and only 2.7% had an APACHE II score of 16, with corresponding means for GCS and APACHE II of 6.39 and 15.91. Thus, a “typical” patient is not easy to find.
The c2 goodness of fit test (or c2 test for a single sample) tests whether a significant difference exists between an observed number of objects or responses falling in each category and an expected number based on the null hypothesis (in our case it was the gaussian distribution of data) (9,15). Our findings showed that for GCS data c2 equaled to 297.676 > 16.92 (tabular value) (p=0.000), thus normality of distribution was rejected and for APACHE II data c2 equaled to 10.324 < 16.92 (p=0.325), thus normality was accepted. This finding is in accordance with the results of the following tests and other studies as well (6). In this point, it should be underlined the fact that literature is short of studies on this topic, i.e. the
distribution of GCS and APACHE II data.
The Kolmogorov-Smirnov one sample test is also a test of goodness of fit, and is considered as a nonparametric analog of the non-paired t-test (15). It is concerned with the degree of agreement between the distribution of a set of sample values (observed scores) and some specified theoretical distribution (expected scores). In other words it determines whether the scores in the sample can reasonably be thought to have come from a population having the theoretical distribution. In short, this test specifies the cumulative frequency distribution which would occur under the theoretical distribution (in our case the normal one) and compares that with the observed cumulative frequency distribution (27). Yet, few authors report some drawbacks; it performs badly on data with single outliers, 10% outliers and skewed data at sample sizes < 100 (28). We found a Kolmogorov-Smirnov statistic (Z) of 3.181 (p=0.000) for GCS and 0.704 (p=0.704) for APACHE II, suggesting that only APACHE II data followed the gaussian distribution.
The Lilliefors test for normality is a modification of the previous one when the mean and variance of the data are unknown. It allows researchers to compute estimates of unknown population parameters when performing hypothesis tests in which the population specified in the null hypothesis is either normally or exponentially distributed (9, 10). In our case, the calculated statistic (ZL) was 0.370 (p=0.000) for GCS and 0.082 (p=0.200) for APACHE II, thus confirming the non-normality of GCS data distribution.
As an alternative to Lilliefors test, the Shapiro-Wilk
statistic (W) is often calculated. In this test non-integer weights are specified and the W statistic can be also viewed as the square of the correlation coefficient obtained from a normal P-P plot, and thus the notion of a correlation test. Values of W close to one indicate normality, while values smaller than unity indicate non-normality (29). GCS data gave a W of 0.690 (p=0.000) (far from one), while APACHE II data gave a W of 0.977 (p=0.196) (very close to one).
Sachs (1984) mentions the rule of thumb in checking roughly the normality of data. This rule states that when 0.9 < (median/mean) < 1.1, and 3*standard deviation < mean, a sample distribution is assumed to be approximately normally distributed (30). GCS data gave a value of 0.469 which does not lie between 0.9 and 1.1 and a 3*4.77=14.31 value which is bigger than the mean (6.39). APACHE II data gave a value of 1.069 which lies between 0.9 and 1.1 but a 3*7.66=22.98 value which is bigger than the mean (15.91).
In addition, skewness and kurtosis have been used in
testing normality (31). Skewness reflects the degree to which a distribution is asymmetrical (32). The best example of a symmetrical distribution is the bell shaped normal distribution in which the mean, median and mode are always the same value and the skewness value equals to 0 (14). In asymmetrical
distributions some scores fall either to the left or right of the middle of the distribution. The mean, median, and mode are not the same value. A distribution with a significant positive skewness has a long right tail. A distribution with a significant negative skewness has a long left tail. As a guideline, a
skewness value more than twice its standard error is taken to indicate a departure from symmetry (9). The skewness for GCS was 0.921 and for APACHE II only 0.125 (standard error of 0.279 for both scales). The Pearsonian coefficient of skewness (sk) is given by the equation: 3*(mean-median)/standard deviation. Its values fall within the range -3 and +3, where 0 means a symmetrical distribution. When mean is greater than median, sk will be a positive value, and the larger the value of sk the larger the degree of positive skew. Reversely, when median is grater than mean, sk will be a negative value, and the larger the absolute value of sk, the larger the degree of negative skew (30). In our data, sk equalled to 2.509 for GCS and -1.140 for APACHE II, demonstrating that GCS data are positively skewed (skewed to the right) and APACHE II data negatively skewed (skewed to the left).
Kurtosis reflects the degree to which a distribution is peaked, that is provides information regarding the height of a distribution relative to its standard deviation (31,32). For a gaussian distribution, the value of the kurtosis statistic is zero (9). Positive kurtosis indicates that the observations cluster more and have longer tails than those in the normal distribution (inadequately occupied flanks with a surplus of values near the mean and in the tails of distribution), and negative kurtosis indicates that the observations cluster less and have shorter tails (the maximum is lower, the bell is more squat, and the distribution is flatter than the gaussian one) (30). All the above are valid for both scales studied. GCS demonstrated a kurtosis statistic of -0.908 and APACHE II a statistic of 0.336.
The ratio range/standard deviation is also used as a crude measure of normality. If in a sample this ratio is less than the theoretical lower bound or greater than the theoretical upper bound (both critical values available in tables), then at the given significance level the sample does not come from a normally distributed population (30). The above ratio was found to be 2.516 for GCS and 4.961 for APACHE II with theoretical critical values lying between 4.11 and 5.68. Thus, only APACHE II data seemed to present a
normal distribution.
Furthermore, evaluation of normality is greatly assisted by the schematic presentation of data to be analyzed. This method serves in two ways: (a) uncovers characteristics of data that are suggestive of mathematical properties of the underlying phenomena (exploratory technique), and (b) in conjunction with formal numerical techniques to verify inferences suggested by them (29). Histograms have bars, and are plotted along an equal interval scale. The height of each bar is the count of values of a quantitative variable falling within the interval. Histograms demonstrate the shape, center, and spread of the distribution. A normal curve superimposed on a histogram may aid in determining whether the data are normally distributed or not (Fig. 1 and 2) (30). In these two figures, one can see the departure of GCS data from symmetry, while APACHE II data seem more normally distributed. In addition, normal probability plots are generally used to determine whether the distribution of a variable matches a normal distribution. If it does, the points cluster around a straight line (Fig. 3 and 4) (29). This is the case for APACHE II data (Fig. 4) where most points are very close to the straight line. On the contrary, GCS data points are far away from the straight line,
indicating a lack of normality (Fig. 3).
Certain limitations in this study should be mentioned: (a) the relatively small sample size, (b) the extremely frequent GCS score of 3 (due to prior sedation and intubation), (c) the low frequencies of GCS scores in the intermediate range, and (d) the limited experience of those collecting the data
(residents).
This study showed that GCS data were far from
normally distributed and as such they should be treated with nonparametric tests (6). In contrast, APACHE II data demonstrated a much more normal distribution which
facilitates the use of parametric procedures, even though nonparametric ones are still preferred. The results based on data from skewed distributions should be troubling to
practitioners who routinely use parametric tests without
considering the form of the underlying distribution (31). It is true that is possible to obtain a sound statistical result by employing improper statistical tests (6). Nonetheless, the
correct application of the existing parametric and non parametric tests is considered essential for reaching concrete
statistical and medical conclusions (14,26). Moreover, since ordinal data are often used in clinical research further studies on the behaviour of both parametric and nonparametric
statistical methods should be undertaken (2).
References
1. BRIDGE, P.D., SAWILOWSKY, S.S. - Increasing Physicians’ Awareness of the Impact of Statistics on Research Outcomes: Comparative Power of the t-test and Wilcoxon Rank-Sum Test in Small Samples Applied Research. J. Clin. Epidemiol., 1999, 52:229.
2. FORREST, M., ANDERSEN, B. - Ordinal scale and statistics in medical research. Br. Med. J., 1986, 292:537.
3. GLANTZ, S.A. - Biostatistics: How to Detect, Correct and Prevent Errors in the Medical Literature. Circulation 1980, 61:1.
4. JAKOBSSON, U. - Statistical presentation and analysis of ordinal data in nursing research. Scand. J. Caring. Sci., 2004, 18:437.
5. DE LAND, P.N., CHASE, W.W. - Statistics Notebook: Entry I.J, Types of Data: Nominal, Ordinal, Interval, and Ratio Scales. Optom. Vis. Sci., 1990, 67:155.
6. GADDIS, G.M., GADDIS, M.L. - Non-normality of Distribution of Glasgow Coma Scores and Revised Trauma Scores. Ann. Emerg. Med., 1994, 23:75.
7. STEVENS, S.S. - On the Theory of Scales of Measurement. Science, 1946, 103:677.
8. VELLEMAN, P.F., WILKINSON, L. - Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. Am. Stat., 1993, 47:65.
9. SHESKIN, D.J. - Handbook of parametric and nonparametric statistical procedures. 2nd ed. Chapman & Hall/CRC (New York) 2000.
10. DANIEL, W.W. - Applied nonparametric statistics. 2nd ed. Pacific Grove (Duxbury) 1990.
11. FELDMANN, U., STEUDEL, I. - Methods of ordinal classification applied to medical scoring systems. Statist. Med., 2000, 19:575.
12. TEASDALE, G., JENNETT, B. - Assessment of coma and impaired consciousness. A practical scale. Lancet, 1974, 2:81.
13. KNAUS, W.A., DRAPER, E.A., WAGNER, D.P., ZIMMERMANN, J.E. - APACHE II: A severity disease classification system. Crit. Care Med., 1985, 13:818.
14. CASSIDY, L.D. - Basic Concepts of Statistical Analysis for Surgical Research. J. Surg. Res., 2005, 128:199.
15. GADDIS, G.M., GADDIS, M.L. - Introduction to Biostatistics: Part 5, Statistical Inference Techniques for Hypothesis Testing with Nonparametric Data. Ann. Emerg. Med., 1990, 19:1054.
16. BONEAU, C.A. - The effects of violations of assumptions underlying the t test. Psychol. Bull., 1960, 57:49.
17. HEEREN, T., D’AGOSTINO, R. - Robustness of the two independent samples t-test when applied to ordinal scaled data. Statist. Med., 1987, 6:79.
18. LUCKE, J.F. - Student’s t Test and the Glasgow Coma Scale. Ann. Emerg. Med., 1996, 28:408.
19. LUMLEY, T., DIEHR, P., EMERSON, S., CHEN, L. - The importance of the normality assumption in large public health data sets. Annu. Rev. Public. Health, 2002, 23:151.
20. BAUMGARDNER, K.R. - A review of key research design and statistical analysis issues. Oral Surg. Oral. Med. Oral Pathol. Oral Radiol. Endod., 1997, 84:550.
21. GADDIS, G.M., GADDIS, M.L. - Introduction to Biostatistics: Part 2, Descriptive Statistics. Ann. Emerg. Med., 1990, 19:309.
22. DATA ANALYSIS PRODUCTS DIVISION. MATHSOFT, INC. S-PLUS 2000 User’s Guide. Seattle: MathSoft, 1999.
23. SPSS INC. SPSS Base 14.0 User’s Guide. Chicago: SPSS Inc, 2005:325-6, 330-1, 335, 508-10.
24. GADDIS, G.M., GADDIS, M.L. - Introduction to Biostatistics: Part 4, Statistical Inference Techniques in Hypothesis Testing. Ann. Emerg. Med., 1990, 19:820.
25. MARQUES, DE SÁ, J.P. - Applied Statistics Using SPSS, STATISTICA, MATLAB and R. Berlin Heidelberg: Springer-Verlag, 2007:111-69, 171-222.
26. WHITLEY, E., BALL, J. - Statistics review 6: Nonparametric methods. Crit. Care, 2002, 6:509.
27. SIEGEL, S. - Nonparametric statistics for the behavioral
sciences. McGraw-Hill Kogakusha (Tokyo), 1956, pag. 47-52.
28. SCHODER V., HIMMELMAN, A., WILHELM, K.P. -Preliminary testing for normality: some statistical aspects of a common concept. Clin. Exp. Dermatol., 2006, 31:757.
29. D’AGOSTINO, R.B., STEPHENS, M.A. - Goodness-of-fit techniques. Marcel Dekker (New York) 1986.
30. SACHS, L. - Applied Statistics. A Handbook of Techniques. 2nd ed. Springer-Verlag (New York), 1984:55, 101, 323, 325-9.
31. CHAFFIN, W.W., RHIEL, S.G. - The effect of skewness and kurtosis on the one-sample T test and the impact of knowledge of the population standard deviation. J. Statist. Comput. Simul., 1993, 46:79.
32. ROYSTON, P. - Which measures of skewness and kurtosis are best? Statist. Med., 1992, 11:333.