SPECIAL ARTICLE Year : 2019  Volume : 2  Issue : 1  Page : 3640 Types of data, methods of collection, handling and distribution Umesh Goneppanavar^{1}, Zulfiqar Ali^{2}, S Bala Bhaskar^{3}, Jigeeshu V Divatia^{4}, ^{1} Department of Anaesthesia, Dharwad Institute of Mental Health and Neurosciences, Dharwad, Karnataka, India ^{2} Department of Anaesthesia, SheriKashmir Institute of Medical Sciences, Srinagar, Jammu and Kashmir, India ^{3} Department of Anaesthesia, Vijayanagar Institute of Medical Sciences, Ballari, Karnataka, India ^{4} Department of Anaesthesia, Critical Care and Pain, Tata Memorial Hospital, Homi Bhabha National Institute, Mumbai, Maharashtra, India Correspondence Address: Statistics is assumed to be a tough nut to crack by novices and young researchers mainly because of the lack of understanding of the fundamentals. This article describes the types of data and the methods for compiling the raw data in an orderly fashion, followed by appropriate handling of the collected data to ensure completeness and quality. Once the data are entered into statistical software, distribution of the data should be assessed to apply appropriate statistical tests. Since the type and nature of distribution of data are the main determinants of the type of statistical test to be applied, researchers should have a thorough understanding of these aspects to help derive meaningful outcome from their research.
Introduction Research involves the collection of data in several forms, assimilating this raw data and filtering it to ensure that relevant and complete data are available for final analysis. This is followed by understanding the type of data and nature of distribution which helps to apply appropriate statistical tests.[1],[2] Mistakes in any of these steps might result in erroneous results and the outcome of the study may become meaningless. Therefore, understanding of these aspects is mandatory for a researcher. Data and Types of Data Data refers to the information obtained during scientific research. Due to its variable nature between two samples, data is also termed variable.[2] In airway studies, examples for variables include airway parameters such as measurements (thyromental distance, sternomental distance or interincisor gap) or descriptions such as the presence or absence of buck teeth, missing teeth or beard and demographic data such as height, weight and gender. Some of these variables can be quantified (numerical or quantitative), whereas some can only be categorised (e.g., presence/absence, gender) with a description (qualitative or categorical data). The type of data collected during the research will eventually determine the statistical test to be applied [Table 1].[3] The variables, irrespective of whether they are measured or categorised, can also be classified based on the scale on which they are defined.{Table 1} Statistics helps provide better understanding of the data. Descriptive statistics describes the relationship between variables in a sample or population.[1] It gives a summary of data in the form of mean, median and mode. Inferential statistics uses a random sample of data taken from a population to describe and make inferences about the whole population. It is helpful when we cannot examine each member of an entire population.[1] Inferential statistics is analysed based on probability theory using point estimation, interval estimation, and hypothesis testing. Numerical/Quantitative data Numerical data can be subclassified as discrete or continuous.[4] Discrete data are limited by a finite number of values or as many values as there are integers (e.g., number of laryngoscopy attempts). Continuous data attempts to measure the data to the nearest accurate value and can take any value over a particular range (e.g., height can be in inches, centimeters or millimeters). The scale units are not specific integers but have decimals. These variables may also be represented on an interval scale or a ratio. Interval scale involves a constant between every measurement [e.g., central venous pressure (CVP) in cm H2O as −5, 0, 5, (10), 15 or 20] in such a way that the interval makes meaningful difference but not the ratio (e.g., ratio of 15–20 is 0.75 but does not mean there is 0.75 times the risk of developing pulmonary oedema when CVP changes from 15 to 20). In addition, negative CVP is possible (rendering zero an arbitrary value). In contrast, ratio has a true zero and both differences, as well as ratios, will be meaningful (e.g., mouth opening of 1 cm, 2 cm or 3 cm where 3 cm is actually three times that of 1 cm and literally means three times the space for laryngoscopy and zero mouth opening refers to no mouth opening at all). For obvious reasons, there cannot be a negative mouth opening. Categorical/Qualitative data This can be measured either on a nominal scale without any particular order (e.g., male or female where one data is not superior to the other), or on an ordered scale (e.g., increasing grades of difficulty in laryngoscopy or number of laryngoscopy attempts needed as one, two and so on). The data in most studies will be a mix of both qualitative and quantitative variables.[5] Nominal data can be dichotomous (e.g., successful intubation – Yes/No) or polychotomous (blood type – A, B, AB and O). The same variable can be expressed differently. For instance, the difficult airway can be recorded as difficult/normal (categorical); as a measure of the degree of difficulty as easy, mildly difficult, moderately difficult, extremely difficult (ordinal) or as number of intubation attempts (discrete). Nature of the data is also interchangeable depending on the researcher. For example, visual analogue sale (VAS) score for intubation difficulty can range from 0 to 100 (quantitative) but can be presented as categorical data alsomild (0–40), moderate (41–70) and severe difficulty (71–100). However, one must understand that though a difference in VAS of 7040 and 10070 both yield a difference of 30, yet the degree of intubation difficulty is not the same for both. Hence, the statistical tests cannot be the same as those applicable for quantitative data. Based on the role they play, variables may be 'independent' when determined by the researcher and hence cannot be affected by the study or 'dependent' variable that will vary based on the independent variable [Table 2].[5] For example, consider a study comparing two different videolaryngoscopes for duration of intubation and number of intubation attempts. The videolaryngoscopes are chosen at the beginning of the study by the researcher and hence are independent variables (also called treatment variable), whereas the duration of intubation and number of intubation attempts will be dependent variables.{Table 2} Collection and Handling of Data Tabulation Once raw data are obtained, ensuring that the data are properly organised to arrive at meaningful conclusions is of paramount importance. Clear identification of the dependent and the independent variables in the data entry sheet are vital and should be entered as and when the data are available during the study. At the completion of the study, the variables are tabulated for statistical analysis into Microsoft Excel® or any statistical software (commonly usedSPSS®). While tabulating the data, each row represents the data from one case while each column represents one variable. An abbreviated name (preferably <8 characters) is assigned for each variable. For instance, systolic blood pressure (SBP) variables may be SBP at baseline, at induction, at laryngoscopy, 1, 3, 5 and 10 min after intubation. This can be abbreviated as SBPbase, SBPind, SBPlar, SBPint1, SBPint3, SBPint5 and SBPint10. Data rectification, cleansing/scrubbing It is the procedure of amending, removing, rectifying and rationalising the raw data that is accumulated at the end of the study. For example, some of the raw data might be duplication or incomplete or even obsolete by the time the study is completed, and this needs to be ratified. Use COUNT, MIN, MAX and AVERAGE formulae from excel sheet to identify wrong/missing numerical data. COUNTIF formula helps to identify the number of minimum, maximum and medium entries that are present in the variables. COUNTBLANK formula will help to identify all the empty cells. QUARTILE function can be used to identify the statistical outliers in numerical data. All these will help to identify and easily ratify the incorrect/incomplete data. Correct spelling and case for variables should be ensured during data entry (e.g., restricted/Restricted or restrict/restricted are not one and the same and will be regarded as different responses by the statistical software). A double entry verification method also can be used to minimise data entry errors. In this method, after all the data are entered into a file, the double entry verification option is chosen in the statistical software which masks the initial data entered. The data are then entered again. Now, the first set of data and the second set of data will be compared by the software and differences will be highlighted which can be resolved immediately by accessing the raw data sources. However, for this method to work correctly, caution should be exercised to ensure that the second data entry is done in exactly the same order as was used in the initial/ first entry. Coding of the data Data also should be coded in a simplified and logical pattern (e.g., intubation attempt – failed, first attempt success, second attempt success, aborted – can be coded as 0 [failed], 1 [ first attempt success], 2 [second attempt success] and 3 [aborted]). Similarly, coding buck teeth absent (0) and buck teeth present (1) is less prone for error than vice versa. Ensuring that the units of measurement are uniform for all observations is vital (e.g., 5.7 cm and 58 mm are two responses for thyromental distance). They should be converted to uniform scale either 5.7 and 5.8 cm or 57 and 58 mm. Qualitative data also are coded for statistical ease (e.g., male gender [1] and female gender [2]). After ensuring that labelling/spelling errors are rectified, relevant data can be entered into individual cells. It is advisable to maintain a codebook containing all the abbreviations and their respective full forms. Furthermore, variables should be designated according to type – integer, continuous, date or string (text) variable. It is also advisable to list all the permitted values for each variable in the codebook which will help identify missing/wrong data. Missing data should be assigned a value (allows for understanding that it was not available either because of lack of response or reasons other than overlooking the data) other than the possible numeric values for that variable (e.g., dichotomous data have two variables coded usually as 0 and 1, a value other than 0 and 1 can be assigned as code for missing information). The person entering these values into the statistical software should exercise caution to ensure 0 is NOT accidentally entered as the alphabet 'O' as this can contribute to erroneous statistical results. If any outliers in values are found, go back to the data entry sheet to determine whether this is actual or an error that occurred while entering. There can be occasions where two different variables may be present for a particular point, in which case both should not be entered into a single cell. In such instances, the correct method will be to prepare two different cells for the same variable (e.g., for loose dentition column, the researcher might have filled upper and lower dentition as two different options. In this situation, correct data entry requires two different cells for the same response – U and L. Incorrect entry would be UL or U and L or U, L). While attempting to cleanse the data, highest priority should be given to the data pertaining to the primary objective of the study.[6] Measurement of Variables Statistically, the quantitative variables are expressed in the form of mean (average of all the values), median (the value of the middle observation where 50% of the data lie above and 50% below when the data are arranged in an order), mode (the most commonly occurring value in the data), standard deviation (SD, a measure of the spread of values around the mean) or interquartile range (a measure of statistical dispersion being equal to the difference between the 25th and 75th percentiles or between upper and lower quartiles, i.e., the first quartile subtracted from the third quartile), whereas the qualitative variables are usually expressed as frequencyproportions (a measure of the number of times an event occurs), or percentages. SD is a measure of variability and should be quoted when describing the distribution of sample data. Standard error (SE) is the SD of sample means (from 2 or >groups), used to calculate 95% confidence intervals, and so is a measure of precision (of how well sample data can be used to predict a population parameter). SE is a much smaller value than SD and is often presented (wrongly) for this reason. Percentiles divide a data set into 100 units of equal size, whereas quartiles divide the ordered data into four equal quarters. They are interrelated in such a way that the first quartile represents 25th percentile, the second quartile represents 50th percentile and the third quartile represents 75th percentile and the second quartile which represents 50th percentile also represents the median for that ordered set of data. Therefore, the first quartile represents the middle number between the smallest value of the data and the median of the data set, whereas the third quartile represents the middle number between the median and the highest value of the data. The observed variables are measured initially for how the data are distributed and then, for comparison. They can either be distributed uniformly around a central value (normal or Gaussian distribution where mean, median and mode assume same value [Figure 1], or might be scattered far away from the central value (nonnormal distribution where mean is not a reliable reflection of the central tendency of the values [Figure 2]a and [Figure 2]b. They are expressed in terms of measures of central tendency/location  a single value that attempts to describe a set of data by identifying the central position within that set of data (mean, median and mode)[7] and in terms of measures of dispersion (SD, SE [reflects the extent to which the mean of the sample might reflect the mean of the whole population], range [difference between the smallest and largest values in a data set], variance [a measure of how far a set of numbers are spread out from their average value] and percentile).{Figure 1}{Figure 2} Distribution Since the distribution of data determines the types of tests to be applied, one should carefully assess the data for confirming the distribution pattern. Overall, normally distributed data have better validity. If the mean, median and mode coincide within a narrow range, then it is normal distribution. In a normal distribution, one SD on either side of the mean includes 68% of the total area, two SDs includes 95.4% and three SDs includes 99.7%. In addition, 95% of the population lies within 1.96 SDs [Figure 1]. When the mean and median differ by more than 5%–10%, a higher possibility of the data being skewed exists. When the data are skewed, the median is more robust and reliable as measure of central tendency than the mean. When the distribution is normal, the distribution extends beyond two SDs on either side of the mean. In other words, if the mean is smaller than twice the SD, the data are likely to be skewed. When data from several groups are to be assessed for distribution, if the SD changes proportionately in the same direction as the mean, then the data are quite likely to be skewed. The goodness of fit tests for normalcy (Kolmogorov–Smirnov and Shapiro–Wilk) can also be applied to confirm the type of distribution of data. While mean is a better measure of central tendency in Gaussian distribution, the median appears to be a better measure in case of nonnormal distribution. Asymmetrically distributed data can be positively skewed where the skewed values will be mostly on the right side of the peak [Figure 2]a or negatively skewed where the skewed values will be mostly on the left side of the peak [Figure 2]b. When faced with a skewed data, one may try to normalise it by applying square root or cube root functions, logarithmic functions and so on. If the data distribution normalises after this, it can be subjected to parametric tests. If the data still remain skewed after the application of normalisation techniques, it can be subjected to nonparametric tests. Practice pearls Data obtained from the research is first compiled, tabulated and then cleansed/scrubbedData collected may vary (variables) and can be qualitative or quantitative. Based on the role they play, they can be dependent and independent variablesMeasures of central location are mean, median and modeMeasures of dispersion are SD, SE, range, variance, percentileData distribution is assessed in terms of uniform (Gaussian) or nonuniform (nonGaussian) distribution of the data for statistical analysisNormally distributed data have better validity Financial support and sponsorship Nil. Conflicts of interest There are no conflicts of interest. References


