Essentials of Biostatistics
Indian Pediatrics 1999;36: 905-910
3. Methods of Sampling and Data Collection
A. Indrayan and L. Satyanarayana*
From the Division of Biostatistics and Medical Informatics,
University College of Medical Sciences, Dilshad Garden, Delhi 110 095, India and
*Institute of Cytology and Preventive Oncology, Maulana Azad Medical College Campus,
Bahadur Shah Zafar Marg, New Delhi 110 002, India.
Clinicians usually deal with samples of blood, urine, stool, etc., and also with specimens such as biopsies. Besides these, the subjects included in a medical study almost invariably are a sample. We earlier explained special meaning of the term population in statistics(1). It can be understood as the target group for which the investigation results are intended to be extrapolated. For example, in a descriptive study on acute respiratory infection (ARI) among children in a country, the target population could be all existing cases of ARI in that country. Cost and logistic considerations seldom allow study of all the subjects. Sampling becomes a natural choice.
In the context of health and medicine, full coverage is rarely possible. All existing cases can possibly be included in an investigation but the cases occurring in future can not be
included. And the findings would most likely be extrapolated to the future cases. Thus, sampling arguments are integral part of the science of medicine. The obvious advantages of sampling are reduced cost and faster results. The not so obvious is the possibility of getting more accurate results because better technology can be more easily used on small number of subjects. Sampling can however create a feeling of discrimination in some cases when sampled subjects get special attention and the others do not.
We first describe some terms specially used in the context of sampling in Section 3.1 and then various methods of sampling in Section 3.2. In Section 3.3, a brief account of study tools that are used in data collection is given.
3.1 Some Terms and Concepts
Sampling Unit: The unit of inquiry is the subject on which the information is obtained. The sampling unit is one which is used for selection. In a community survey on protein energy malnutrition, the sampling unit could be a family but the unit of inquiry could be a child of less than 5 years. One sampling unit can have none or two units of inquiry.
Sampling Frame: List of all sampling units in the target population is called a sampling frame. The units are chosen from this frame. The units must be mutually exclusive and the frame should be an exhaustive list. Preparation of frame requires precise definition of the unit as well as of the population.
Sample Size: The number of units or subjects sampled for inclusion in the study is called sample size. Determination of this requires detailed consideration of the objectives of the study. This aspect will be discussed in a future Article.
Parameter and Statistic: Most medical studies draw conclusions on the basis of the estimates of either means or proportions. Such quantities of interest are called parameters when calculated for the entire population and statistics (as plural) when calculated for the sample. The objective of the sampling is to provide statistics which are adequate estimates of the parameters.
Sampling Error: One sample from a population in all probability will be different from the second sample from the same population. Some or all individuals in the second sample may be new. Thus the results obtained from one sample may not match with those from another sample. We included this variability in the list of the sources of uncertainties in a previous Article of this series(1), and called it sampling fluctuation or sampling error. The magnitude of this "error" depends primarily on three factors: (i) The variability between the subjects in the population. The larger the variability, the more is the sampling error; (ii) The size of sample. When the samples include large number of subjects, the picture obtained from one sample is not likely to be very different from another sample of the same size because both tend to be fair representatives. This can not be said for small samples; and (iii) The method of sampling. The subjects should be selected in a manner that a wide spectrum could get adequate representation. Note that sampling "error" is not a mistake but signifies only variation from sample to sample.
3.2 Methods of Sampling
Sampling necessarily entails an argument from the fraction to the whole. The validity of this argument depends on the representativeness of the sample. Methods are available which make it likely to happen. Sampling can either be random or purposive. Purposive is the one in which no random component is present and specific subjects are intentionally included. Volunteers are an example. Volunteer studies do have a special place in medical research because they help to identify at least some adverse effects and can also provide clues on the efficacy of the treatment regimen. But that is where the utility of volunteer study ends. The findings so obtained are not amenable to generalization and would rarely apply to the other subjects. For this generalizability, it is necessary that the sample is representative of the population. This can be fairly assured by what is called, a probability or a random sample. This means that inclusion of a particular unit in the sample can not be predicted and would depend, at least to some extent, on chance. Random sample also is a decisive advantage in evaluating statistical significance of the results about which we discuss in later Articles.
A large number of methods are available for choosing a probability sample. But only some are commonly used in medicine. These are briefly described below. For the interested reader, further details of sampling techniques can be found in standard text(2,3).
Simple Random Sampling
If you want to select 5 children from a group of 30, attending a clinic on a particular day, what method is most unbiased? Random selection, just as in lottery, springs as the foremost choice for such sampling.
When the scheme is such that each unit of the population has the same chance of being included in the sample, it is called simple random sampling (SRS). It is customary to denote the size of the sample by n and the size of the population by N. The SRS is like picking n chits from a lot containing N chits numbered 1 through N. More scientific method is to use random numbers. These can be very easily generated on computer, otherwise are available in table form in some statistics books. A pre-requisite for choosing SRS is the availability of the frame. A disadvantage with SRS is that it does not guarantee adequate representation of different segments of the target population. It is possible that the sample happens to include adequate number of cases of severe or moderate type but very few of mild type of cases. When adequate representation of various segments is required, the method of selection should be stratified sampling.
Stratified Random Sampling
In a study on malnutrition, as assessed by weight for age, as a predictor for death in children with specified infections, it is necessary that children of different nutritional status are included in the sample. The sampling unit in this case could be a child, presenting with the specified infections, reporting in a particular clinic. An SRS of size, say, 60 in this case can yield a sample in which severe malnutrition is not represented at all or inadequately represented.
The procedure therefore should be to first divide the subjects in the frame by nutritional status such as normal, moderate and severe malnutrition, and then draw independent SRS of size, say, 20 from each division. Such division of the frame is called stratification and each division a stratrum. The investigator decides how many units are to be selected from different strata_they need not be equal. This procedure of choosing a sample is called stratified random sampling (StRS).
Cluster Random Sampling
In a study to assess the prevalence of goiter in school children of a state, we may need to have a list of schools in various PHC areas, urban blocks, census towns, etc. Each school constitutes a primary sampling unit. The children in each primary unit are the final sampling units in this prevalence study.
When the primary sampling units are not large, i.e., when they generally contain small number of subjects, then it is sometimes advisable that these units are not sampled further. All the subjects in the selected primary units are then surveyed. When this is done, it is con-venient to understand a primary unit as a cluster. Thus n clusters out of N are randomly selected. All subjects in the selected clusters are investigated. This is called cluster random sampling (CRS).
The only frame required for CRS is the list of clusters. Since survey of elements within a cluster is quick, CRS is sometimes considered as a rapid assessment method. The World Health Organization (WHO) recommends this kind of sampling to estimate the percentage of children immunized in a community, particularly in developing countries. They recommend that a cluster size of 7 be chosen and 30 clusters be studied from the area for which coverage estimate is required. A disadvantage of CRS is that the subjects whithin a cluster tend to be similar to one another and produce, what is called, a clustering effect. This effect reduces the chance of getting the full spectrum of subjects in the sample. To compensate this loss, double or a larger sample may be required relative to SRS. A good reference on general method for cluster-sample surveys is Bennett et al.(4).
Systematic Random Sampling
A sampling method in which the first unit is randomly selected and the others automatically included on the basis of the sampling interval is called systematic random sampling (SyRS). The units are numbered 1 through N in any order and the sampling interval k = N/n is calculated. If k is not an integer then the inte ger part is taken. First unit is selected at random from the first k units. Suppose this is the rth unit. Then the subsequent units in the sample are (r + k)th, (r + 2k)th, etc. Thus the first selected unit determines the entire sample. For example, for N = 101 subjects in the target population out of which n = 8 are proposed to be selected by systematic method, the sampling interval is 101/8 and its integer part is 12. If the randomly selected unit out of first 12 is 9th then the remaining units have numbers 21, 33, 45, 57, 69, 81 and 93.
An advantage of SyRS is that it is very easy to execute. When the patients are coming to the clinic, every kth after the random first can be easily pulled out for inclusion in the study. Like SRS, sometimes SyRS also fails to give adequate representation to different segments of the population.
3.3 Tools of Data Collection
The data collection tools in medicine would be in terms of existing records, questionnaires and schedules, interview and examination, laboratory and radiological investigations, etc. Each method of eliciting information has merits and demerits in terms of validity and reliability on one hand, and cost and time on the other. The following paragraphs contain some guidelines in this respect.
Almost all hospitals and clinics maintain a fairly good record of the patients served by them. Civil registration system may have records of births and deaths along with cause of death. Then there are records of adhoc surveys by different agencies that are done for specific purposes. Records are the cheapest source of getting data. A demerit of the records is that they tend to be incomplete and not sufficiently focused on the topic of interest of a study.
Questionnaires and Schedules
A questionnaire contains a series of questions for answer by the respondents. This could be self-administered or could be administered by an interviewer. In the case of the former, the education and the attitudes of the respondent towards the survey can substantially influence the response. In the case of the latter, the skill of the interviewer can make a material difference. The term schedule is used when it contains a list of items on which information is to be collected.
Whether a questionnaire or a schedule, sufficient space is always provided to record the response. A list of possible or expected responses can be given against the questions or items. Then the form is called close-ended. In the list of responses, the last can be kept as "others (specify)" to make the list exhaustive. The responses are sometimes pre-coded for easy entry into the computer. When the response is to be recorded verbatim then the question or item is called open-ended.
Framing of questions is a difficult exercise because the structure of the sentence and the choice of words become important. It is always helpful to divide the questionnaire into sections_both for the interviewee as well as for the interviewer. Use of features such as italics, bold face and capitals can help to clarify the theme of the question. The questionnaire or schedule must always be accompanied by a statement of the objectives of the survey so that the respondent becomes aware. Easy-to-follow instructions to record response and explanatory note where needed are always helpful.
Needless to say that the recording should be accurate after following the correct and full procedure as required in the protocol but what needs to be emphasized is the legibility of writing_more so because many of the schedules for such purposes could be open-ended. Computer generated form, which allows direct entry of data without the help of middle man, are a big advantage in this context. These are not used in India yet.
It is generally considered essential that all questionnaires, schedules, laboratory procedures, etc., are tested for their efficacy before they are finally used for the main study. This is called pre-testing. Many unforeseen problems or lacunae can be detected by such an exercise. The tools thus can be accordingly adjusted and improved. A study done on small number of subjects, before the actual study based on a specified design, is called a pilot study. This can reveal that the questions are sufficiently clear or not, whether the items of information are feasible or not, whether the length of the interview is within limit or not, and whether the instructions are adequate or not. Sometimes a pilot study is repeated to standardize the methodology of eliciting correct and valid information.
Pilot study performs one more very important function. This is to provide a preliminary estimate of the parameter of interest. This preliminary estimate may be required for calculation of sample size needed for the main investigation. If the phenomenon under study has not been investigated earlier at all then pilot study is the only way out to get some preliminary estimate of the parameter.
In clinic based follow-up studies, the patient may not show up next time. In the case of domicillary contact, non-response can occur for reasons such as non-availability of the selected subjects (due to unexpected death, migration or visit to other places) and refusal to cooperate even after providing informed consent. Those who remain non-respondent despite repeated efforts could be of different type from the respondents with regard to some of the characteristics under study. They may either be disproportionately normal or disproportionately abnormal. For example, in a study on post counseling follow-up of thalassemia in high risk communities, a random sample of thalassemia traits and non-traits was followed up for 5-7 years after holding initial awareness camps(5). Among traits, those who do not show up for a follow-up may be the ones who belong to low income group or are illiterates. This could be because they do not want to get themselves branded as thelassamics due to fear of social stigma. As we explain in a future Article, the extent of non-response is also a consideration in determining sample size.
One way to find that the non-respondents are different from the respondents or not is to call-back a random sub-sample of non- respondents. In case these efforts are successful then the pattern of responses obtained from this sub-sample can be compared with that of the respondents. This comparison may indicate in what respect, if at all, the two groups are different and what kind of adjustment is required.
It is possible in some cases, as for those lost to follow-up, that baseline information is available on non-respondents too. This baseline of non-respondents can also be comapred with that of the respondents to assess that the non-respondents are any different or not. In case some differences are detected then, again, an adjustment in the results may be needed.
Whether adjustment done or not, the extent of non-response must always be stated in the report so that the reader himself can decide how much confidence be placed in the results.
1. Indrayan A, Satyanarayana L. Essential of Bio statistics: 1. Medical uncertainties. Indian Pediatr 1999; 36: 476-483.
2. Cochran WG. Sampling Techniques, 3rd edn. New York, John Wiley and Sons, 1977.
3. Som RK. Practical Sampling Techniques, 2nd edn. New York, Marcel Dekker Inc., 1996.
4. Bennett S, Woods T, Liyanage WM, Smith DL. A simplified general method for cluster-sample surveys of health in developing countries. Wld Hlth Statist Quart 1991; 44: 98-106.
5. Yagnik H. Post counselling follow-up of thalassemia in high risk communities. Indian Pediatr 1997; 34: 1115-1118.