Essentials of Biostatistics
Indian Pediatrics 1999;36: 1127-1134
4. Numerical Methods to Summarize Data
|A. Indrayan and L. Satyanarayana*
From the Division of Biostatistics and Medical Informatics,
University College of Medical Sciences, Dilshad Garden, Delhi 110 095, India and
*Institute of Cytology and Preventive Oncology, Maulana Azad Medical College Campus,
Bahadur Shah Zafar Marg, New Delhi 110 002, India.
Methods of collection of data are described in the previous article of this series(1). When data on, say, 300 infants with acute respiratory infection are available, how to make sense out of them? The initial step is to make a summary of the data in such a manner that none of their important features is lost. This is done by tabulation and by calculating a few summary values that can adequately represent data. Before we go on to the summarization methods, it is necessary to describe various scales of measurement and types of data available, because the summarization methods depend on these features.
4.1 Scales of Data Measurement
A scale is an "instrument" on which the characteristics are measured. It can be quantitatively calibrated in the usual sense, or can be qualitative also. The following are some details of these scales.
The observations such as site of a malignancy, clinical courses of an infection, and gender, are only names but are measurements in statistical sense. Such a "scale" is called nominal. The measurements on this scale do not have any specific order. Sites of lymph-adenopathy such as cervical, axillary and generalized have no order. Similarly, gender is either male or female and none is higher or more than the other. Diagnosis of liver disease into hepatitis/cirrhosis/malignancy is nominal, so are the criteria for defining aplastic anemia from peripheral blood such as neutrophils, platelets and reticulocytes. The only way to associate numerics to such a scale is by way of assigning a code to each category.
In contrast to the nominal scale, this scale consists of ranks or ordering in the categories of a measurement. Disease severity is measured in ordered categories such as none, mild, moderate, serious or critical. The self-perception of health can be ordered from very bad to very good on, say, a 7-point scale; and presence of disease can be scaled as definitely absent, doubtful, likely and definitely present.
The characteristics which can be exactly measured in terms of a quantity such as weight, height, hemoglobin level and heart rate are said to be measured on metric scale.
Exact measurements on metric scale are statistically preferable compared to the ordinal measurements. Yet, the irony is that circum-stances lead to grouping of metric data into categories sometimes even after the exact data age obtained. Birth weight of a newborn may be categorized as £2.4 kg, 2.5 - 2.9 kg and ³3.0 kg. Such ordinal categories are sometimes easier to comprehend than exact metric measurements. In the process, however, valuable exact information is lost.
4.2 Variable and Its Types
Generally speaking, a characteristic that tends to vary from subject to subject or from unit to unit is called a variable. All biological characteristics_age, gender, birth order, body temperature, ponderal index, duration of survival, disease severity, diagnostic category_are variables.
The methods to summarize data depend on the type of variables used. These types are broadly classified as qunatitative and qualita-tive.
Qualitative and Quantitative Variables
Characteristics on metric scale already have a numeric outcome such as Apgar score 7 for a particular child or respiration rate 68 per minute. Characteristics on nominal and ordinal scale do not have this feature. Gender is male or female though we can assign a code such as 0 for male and 1 for female. Similarly, malnutrition of a child in none, mild, modrate and severe categories can be assigned codes 0, 1, 2, and 3, respectively. When such numeric assignment is done, the characteristic becomes variable in true sense. This does not mean that all the variables are quantitative. It is important to maintain the distinction between values and codes, and between quality and quantity. We often use the terms qualitative data and quantitative data in this series to keep the distinction intact. The former includes categorized data such as nominal, ordinal, even categorized metric data, and the latter uncategorized metric data. Medical practice has preponderance of qualities than of quantities. Use of quantitative variables such as body temperature, birth weight and gestation is of course common but, at some stage in patient management, they tend to be interpreted as "qualities" such as high/normal/low and preterm/term. The clinical interpretability becomes easier by assigning such "qualities". We certainly are not suggest-ing that exact measurements are not important. They definitely are and always preferable because it is only through such quantities that borderline cases and the trend in terms of improvement or deterioration can be identified.
Discrete and Continuous Variables
The variable that can take only finite, generally small, number of values in a range is called discrete. Variables such as Apgar score, blood group and birth order are some examples of a discrete variable.
A variable which can take infinite number of values within a range is called continuous. Anthropometric measurements such as weight, height and mid-arm circumference, laboratory measurements such as iodine level, Hb level and serum bilirubin level are examples of a continuous variable. Theoretically, these can be measured to any number of decimal places provided a sufficiently precise instrument is available. Weight can be measured as 32.543 kg. Even between 32 and 33 kg, it can theoretically take infinite number of values depending upon the number of decimal places used. Practically though, such accuracy is redundant. Variables such as heart rate, platelet count and respiration rate are in fact discrete yet are considered continuous because of large number of possible values. Only those variables which can take a small number of values, say, less than 10, are generally considered discrete. Others can be treated as continuous for practical purposes even when they are theoretically discrete.
Stochastic and Deterministic Variables
The variables can be categorized on the basis of yet another feature. There are measurements that are considered known for the subjects. The others are those that are susbequently obtained. The former are factors and the latter responses as described in a previous article of this series(2). Responses are considered stochastic because they are subject to chance fluctuations. They can not be exactly predicted. Factors are considered deterministic because they are known before hand and are no longer subject to chance fluctuations. Most of the inferential statistical methods we discuss in these articles apply to the stochastic rather than to the deterministic variables.
4.3 Data Tabulation
Tables and graphs are just about the only way that data on a large number of subjects can be condensed or summarized for presentation. We deal with graphs in the next article. Some details of the tables are as follows.
Not all tables contain data. For example, a list of drugs used in neonatal intensive care unit is a verbatim information but can be written in the form of a table. Our interest is restricted to only those tables that contain numeric data. These tables may contain the number of subjects, called "frequency", say, on different types of drug or with different characteristics in a group of people. Such a table is called a frequency table or a frequency distribution since this depicts the number of subjects distributed among the various groups or categories of a characteristic. The distribution is univariate when the division of the subjects is presented by categories of one variable only.
It is bivariate when presented by categories of two variables simultaneously. An example of bivariate is the frequency distribution of asthmatic children by severity of disease and gender.
A frequency table in which subjects are classified into mutually exclusive and exhaustive categories is called a contingency table. Categories are called mutually exclusive when only one of them is applicable to one subject and exhaustive when a subject cannot be classified beyond the specified categories. A contingency table is called one-way, two-way, or r-way depending upon the number of variables on which the subjects are cross-classified.
Table I-Age Distribution og Girls According to Menstruating Status and Socioeconomic status
Source: Rao et al.(3)
Table I is an example of a three-way contingency table. In this table, metric scale for age in years is grouped while the socioeconomic status is ordinal. The table shows frequencies of girls in various age groups according to menstruating status and socioeconomic status. The categories are mutually exclusive. For example, a menstruating girl of age 11½ years and high socioeconomic group belongs to only one category in the frequency table. In the same way, no girl can be classified beyond these categories shown in this table. If the column for nonmenstruating is omitted then the categories are not exhaustive and the table is not a contingency table. Table II on clinical symptoms in 29 children of megaloblastic anemia is an example where categories are not mutually exclusive. One patient can have two or more symptoms. This is called multiple response. This table contains frequencies but is not a contingency table. Table III is a data table but is not a contingency table since the entries are not the count of cases.
Table II__ Clinical Symptoms in 29 Children of Megaloblastic Anemia Between 3½ and 12 Years of Age
Source: Gomber et al.(4)
Table III__ Hematological Data on 29 Children of Megaloblastic Anemia Between 3½ Months and 12 Years of Age
Source: Gomber et al.(4).
A problem frequently encountered in preparing a contingency table on all continuous variables is in deciding the number and width of intervals. Table I shows age grouped into class intervals, say, <11, 11-12, 12-13, etc. These are five groups. The choice mostly depends on common sense evaluation of the utility of such groups in conveying the basic features of the data. As a rule of thumb, it may be suggested that the number of such groups should generally be between four and eight.
Percentage in a Frequency Table
Percentages are generally calculated and presented in a table for each category using total number of subjects. For example, proportion of high socioeconomic menstruating girls in Table I in less than 11 years of age is 4/44. This awkward looking fraction can be converted into a convenient and a nice number using a multiplier. If the multiplier is 100 then it is called a percentage. The proportion or percentage of total is called a relative frequency, while the count of girls less than 11 years of age, 4, is a frequency. The main purpose of computing a percentage is to be able to compare groups or class intervals in a frequency table. Examples of percentages are given in Tables I and II. The multiplier can be 1000, 10,000 or 100,000 if the frequency of the characteristic is low.
Rate: Rate is a measure of frequency of occurrence of a phenomenon. Since this frequency can change over time, a rate is time specific. Popular examples of rate are infant mortality rate of a country per year and monthly case fatality rate in a hospital.
Essential components of a rate are (i) a numerator, (ii) a denominator, (iii) specification of a time and (iv) a multiplier like per cent or per thousand. The numerator in a rate is a part of the dominator. For a situation where the numerator is not a part of the denominator, the term ratio is used.
Ratio: Broadly speaking, a ratio is one quantity relative to another. It can be expressed as a : b or as a/b. Ratio of menstruating (a = 44) to nomenstruating (b = 28) girls in high socio-economic group is an example from Table I. Sex ratio in a population is another popular example of a ratio. This is expressed as number of females per 1000 males in a particular area. In broad sense, all rates too are ratios because there is a numerator and there is a denominator. A ratio is not a proportion but a rate is. Number of male children born in a hospital out of total live births is a proportion. On the other hand, number of male children born relative to the number of female children is not a proportion. This is a ratio.
Rates and ratios are calculated for qualitative data only. The methods to summarize quanti-tative data are as follows.
4.4 Measures of Central and Other Locations
Suppose there are data on duration of treatment in 10 children with Sydenham chorea. These durations in months may be 1, 1, 3, 4, 4, 4, 4, 4, 6, and 9. One way to represent variation in this duration, of course, is to prepare a table showing the number of cases with different duration. A step further is to be able to say that the most common duration is 4 months and the duration ranged between 1 and 9 months. The former represents a central value and the latter scatter or dispersion. These two in many cases give a fairly good idea of the entire data set. Measures of dispersion or variation are discussed in the next section. Apart from the central location, there are other locations, viz., quantiles which are sometimes considered important. Some important measures of location are given below.
This is computed by dividing the sum of all observed values by the number of observations. Mean is a popular measure because it is simple to calculate and easy to understand.If x1, x2, ..., xn are n observations, then the sample mean is
Mean or "x = S xi / n.
The mean of the 10 observations on duration of treatment in the preceding paragraph is:
Mean = 40/10 = 4 months.
This is a positional central value. This divides the total number of observations into two halves. After n observations are arranged in the order of their magnitude, then the (n + 1)/2th observation is the median if n is odd and mean of (n/2)th and (n/2 + 1)th if n is even. In our example on duration of treatment of Sydenham chorea, n=10 and median is the mean of 5th and 6th value after arranged in ascending order. Thus median = (4 + 4)/2 = 4 months. The median duration of treatment for Sydenham chorea is 4 months in these 10 subjects.
Most frequently occurring value is called mode. This has a special significance because it indicates the peak among observations. There can be more than one mode in a set of obser-vations while mean and median are unique. Practical utility of mode is lost when this is more than one. The mode in the above example on duration of treatment is 4 months. The measures: mean, median, mode, happen to coincide in this example but this is not always so.
The values of variable that divide the total number of subjects into ordered groups of equal size are called quantiles. There are different measures for various number of divisions. Important among them are percentiles or centiles, quartiles and tertiles. These are positional values at different locations in 100 divisions, 4 divisions and 3 divisions, respect-ively. The percentiles are most commonly used for child growth monitoring purposes.
Percentiles: The values which divide the total number of subjects into 100 equal parts are called percentiles. For example, 90th percentile is the value below which 90% of values occur. Similarly, 50% of observations lie below the 50th percentile. This percentile is equivalent to median. For n = 200 subjects, after arranging the observations in the ascending order of magnitude, the 3rd and 95th percentiles, first quartile and second tertile are as follows.
3rd percentile = (3 x 200/100) = 6th value
95th percentile = (95 x 200/100) = 190th value
1st quartile = (1 x 200/4) = 50th value
2nd tertile = (2 x 200/3) = 133rd value
If the 97th percentile of weight of 2-year old boys is 14.2 kg then it means that 97% of such children have weight 14.2 kg or less. The other 3% have a higher weight. Similarly, the other quantiles can be interpreted according to their divisions.
4.5 Measures of Variability
It is clear from the example on duration of treatment that variation is present from patient to patient. As explained in Article 1 of this series(5), these occur due to a large number of factors. Variation, dispersion and scatter connote the same phenomenon. A measure of this variation is many times helpful, along with a measure of location, in providing a fairly good idea of the entire data set. Most frequently used measures of variation are range and variance or standard deviation (sd).
This is defined as the difference between the largest and the smallest values. In the above example on the duration of treatment, the range is 1 to 9 or 8 months. Clearly, this measure of variability uses only two observations and thus is greatly affected by the extreme values. If the duration for one patient is 16 months then the range jumps to 15 months. A better measure that uses all the observations in the data is standard deviation.
Variance and Standard Deviation
Magnitude of the variation is the extent of difference each value has with each of the others. In place of calculating so many differences, it is convenient to compute the difference of each from a central value, namely, the mean. An average of these deviations could be a measure of variation. But this would always be zero since some of these deviations would always be positive and some negative. One way out is to ignore the sign, get absoulte values and calculate average of the absolute differences. This is called mean deviation. Absolute values are mathematically difficult to handle. An easy way to get rid of negative sign is by squaring the deviations. The average of squared deviations is called variance. This can be written as:
Variance = Sigma(xi _ "x)2 /n
The variance has been extensively studied and has been found to be a very adequate measure. The only difficulty is that the unit of the variable also gets squared up (such as square meters for area). Original unit is retrieved by taking square-root of the variance.The quantity so obtained is called standard deviation (sd). We give explanation later on in a future article that, in case of samples, the denominator (n _ 1) in place of n is more adequate for calculating variance. Thus, for a sample of n observations x1, x2, ..., xn, the standard deviation is
S(xi _ "x)2
sd = ________
n _ 1
For our example on duration of treatment, mean of 10 observations is 4 and squared deviations from the mean are 9, 9, 1, 0, 0, 0, 0, 0, 4 and 25, respectively. The sum of these is 48. Thus the sd is
sd = Ö 48/9 = 2.3 months.
Table III contains mean and sd for various hematological parameters on a sample of 29 children of megaloblastic anemia(4).
Coefficient of Variation
There is another important measure based on standard deviation called coefficient of variation (CV). This expresses standad devia-tion as percentage of the mean. The formula for this is
CV = sd / Mean X 100.
CV helps in comparing the relative variability among different measures. For example, a sd of 5 mmHg in case of systolic BP readings is small but a sd of even 3 g/dl in case of Hb level is large. This is because the sd has to be assessed in relation to the values themselves or their mean. If the mean systolic BP level of the subject under study is 100 mmHg then a sd = 5 mmHg is only 5% of mean. If the mean Hb level is 10 g/dl then a sd = 3 g/dl is 30% of mean. This sd is surely higher. Thus sd by itself can not be used to compare variability in two different kinds of variables. Table III contains CV of various hematological parameters of children of megaloblastic anemia in the age group of 3½ months to 12 years(4). Note that the variability in mean corpuscular volume is much lower than in TLC. It is relatively more consistent between patients. Such comparison cannot be done on the basis of sd.
1. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 3. Methods of sampling and data collection. Indian Pediatr 1999; 36: 905- 910.
2. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 2. Designs of medical studies.
Indian Pediatr 1999; 36: 691-696.
3. Rao S, Joshi S, Kanade A. Height velocity, body fat and menarcheal age of Indian girls. Indian Pediatr 1998; 35: 619-628.
4. Gomber S, Kela K, Dhingra N. Clinico- hematological profile of megaloblastic anemia. Indian Pediatr 1998; 35: 55-58.
5. Indrayan A, Satyanarayana L. Essentials of Biostatistics: 1. Medical uncertainties. Indian Pediatr 1999; 36: 476-483.