|Year : 2019 | Volume
| Issue : 3 | Page : 120-125
Descriptive statistics: Measures of central tendency, dispersion, correlation and regression
Zulfiqar Ali1, S Bala Bhaskar2, K Sudheesh3
1 Department of Anaesthesiology, Sher-i-Kashmir Institute of Medical Sciences, Srinagar, Jammu and Kashmir, India
2 Department of Anaesthesiology, Vijayanagar Institute of Medical Sciences, Ballari, Karnataka, India
3 Department of Anaesthesiology, Bangalore Medical College and Research Institute, Bengaluru, Karnataka, India
|Date of Submission||20-Dec-2019|
|Date of Acceptance||22-Dec-2019|
|Date of Web Publication||30-Jan-2020|
Prof. S Bala Bhaskar
Vijayanagar Institute of Medical Sciences, Ballari, Karnataka
Source of Support: None, Conflict of Interest: None
Large data obtained from research are subjected to statistical analysis so that outcomes can be extrapolated to the larger population. Towards this end, such large data have to be consolidated into smaller, simpler expressions of measures, representing the outcomes of the whole sample. These form the descriptive statistics, which will later on help in inferential statistics, involving the different variables within one group and more than one group. Their distribution features are analysed and are described as sums, averages, relationships and differences. These measures are classified as those of central location and those of dispersion. Mean, Median and Mode are the three main measures of central tendency and Range. Percentile, variance, standard deviation, standard error and confidence interval are measures of dispersion. Correlation and regression can be used to describe the relationship between two numerical variables. Correlation is a measure of association and regression is used for prediction. Regression analysis helps to assess 'influential' relationships between the data. Changes among one or more variables might affect other variables.
Keywords: Central location, confidence intervals, data, dispersion, measures, numerical, regression analysis
|How to cite this article:|
Ali Z, Bhaskar S B, Sudheesh K. Descriptive statistics: Measures of central tendency, dispersion, correlation and regression. Airway 2019;2:120-5
|How to cite this URL:|
Ali Z, Bhaskar S B, Sudheesh K. Descriptive statistics: Measures of central tendency, dispersion, correlation and regression. Airway [serial online] 2019 [cited 2020 Apr 5];2:120-5. Available from: http://www.arwy.org/text.asp?2019/2/3/120/277331
| Introduction|| |
Statistics is a branch of science that deals with the collection, organisation, summarisation and analysis of data and drawing of inferences from these samples to the whole population., Thus, there are two broad categories of statistics: descriptive statistics and inferential statistics. Descriptive statistics describes the relationship between variables in a sample or population, whereas inferential statistics makes inferences about the population based on a random sample from that population.
Descriptive statistics involves various methods that reduce large sets of data that are presented in the form of tables or graphs in order to characterise features of its distribution and are described as sums, averages, relationships and differences. They are measured in terms of central location and of dispersion. Descriptive statistics are not 'decision' oriented. Pilot studies, for example, are descriptive.
In inferential statistics, the summary data (used for descriptive statistics) are processed in order to estimate, or predict, characteristics of another (usually larger) group. That is, the tests extrapolate/infer sample data and generalise that to the larger population, usually with calculated degrees of certainty. The details of inferential statistics will be covered in the next article in this series of basic statistical considerations.
| Expression of Data in Descriptive Statistics|| |
The extent to which the descriptive observations cluster around a central location is described by the central tendency and the spread towards the extremes is described by the degree of dispersion.
| Measures of Central Tendency|| |
The measure of central tendency is a single value which best represents the characteristic of the data. Mean, median and mode are the three main measures of central tendency. The mean is the arithmetic average value (μ), median the middle value and mode the most common value in a series of observations.
The mean is highly influenced by the extreme variables. These extreme values are called outliers. For example, if thyromental distance with head in maximum extension of nine patients is 8.2, 8.7, 8.4, 8.9, 8.2, 9.1, 8.5, 8.6 and 5.0 cm, the simplest approach is to rank the observations from lowest to highest: 5.0, 8.2, 8.2, 8.4, 8.5, 8.6, 8.7, 8.9 and 9.1 cm. Out of these values, the thyromental distance of 5.0 cm is an outlier.
Median is the middle value of a distribution in ranked data. Half of the variables are above and half of the values are below the median value. The mode is the most frequently occurring variable in a distribution. The mean thyromental distance of the patients in the above example is 8.17 cm, whereas the median and mode are 8.5 and 8.2 cm, respectively.
| Measures of Dispersion|| |
The observed data may be dispersed away from the central value as opposed to those which are centrally distributed. They are expressed in terms of measures of dispersion (range, percentile, variance, standard deviation [SD], standard error [SE] and confidence interval [CI]).
Range is the difference between the minimum and the maximum values in a sample (e.g., if thyromental distance with head in maximum extension in a sample of patients is 8.2, 8.7, 8.4, 8.9, 8.2, 9.1, 8.5, 8.6 and 5.0 cm, the range is 5.0–9.1 cm). It describes the variability of distribution in a sample. The range does not provide valuable information about the overall distribution of the data and is heavily affected by the outliers (e.g., 5.0 cm in the above example).
| Normal Distribution or Gaussian Distribution and Measures of Dispersion|| |
Most of the biological variables usually cluster around a central value, with symmetrical positive and negative deviations about this point. The more the deviation of value of the variable from the central point, the less frequently it is seen. The standard normal distribution curve is a symmetrical bell-shaped curve. In a normal distribution curve, about 68% of the values are within one SD of the mean. Around 95% of the values are within two SDs of the mean, and around 99% are within three SDs of the mean [Figure 1].
|Figure 1: Symmetrical distribution (mean [μ], standard deviation [SD/σ])|
Click here to view
Variance gives a measure of the spread-out of the distribution of variables in a population. It gives an indication of how close an individual observation clusters about the mean value. A large variance indicates that the data in the set are far from the mean and each other, whereas a small variance indicates that the data in the set are close to the mean.
The variance of a sample is defined by:
where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample and n is the number of elements in the sample.
The formula for the variance for a population has the value 'n' as the denominator. The expression 'n − 1' it represents the degrees of freedom and is one less than the number of observations.
Variance is measured in squared units. However, in order to make the interpretation of the data simple, the square root of variance is used. The positive square root of the variance is denoted by the SD defined by the following formula:
where σ is the population SD, X is the population mean, Xi is the ith element from the population and N is the number of elements in the population.
The SD of a sample is defined by a slightly different formula:
s = [Σ (xi − x)2/(n − 1)]
where s is the sample SD, x is the sample mean, xi is the ith element from the sample and n is the number of elements in the sample.
For example, the interincisor distance of five patients undergoing laparoscopic cholecystectomy was 45, 45, 35, 35 and 40 mm.
SD = √25
Another measure of variability is the coefficient of variation (CV). It considers the relative size of the SD with respect to the mean. It is mainly used to describe the variability of instruments used for measuring various physiological functions.
CV = SD/mean × 100%.
A CV of <5% is taken as acceptable. The variations in the CV can be the biological variability (variation between individuals and over time which leads to scatter), random error (due to measurement imprecision) and systematic error (mistakes or biases in measurement or recording). Random error can be reduced by taking the following remedial measures such as use of accurate measurement instruments, taking multiple measurements and taking measurement by trained observers. Systematic error cannot be compensated for by increasing sample size.
| Standard Error of the Mean|| |
Another measure of variability is the standard error (SE) or standard error of the mean. The mean of the sample data obtained from a population may not be the exact mean of the data (if) obtained from the entire population. If we take more samples, the means of these samples may vary. If we take the SD of the distribution of these means, then we call it the SE of the means or SE. It is calculated from the SD and sample size (n):
For example, the interincisor distance of five patients undergoing laparoscopic cholecystectomy was 45, 45, 35, 35 and 40 mm. The SD is 5.
| Confidence Limits (Confidence Interval)|| |
The true mean of a population may not coincide exactly with the mean of the data obtained from the sample. With appropriately sized sample (>30), we may be able to specify within what limits the true mean lies and how confident we are that it lies within these limits. These are called confidence limits or CI [Figure 2].
Lower confidence limit (95%) = mean − 1.96 × SE
Upper confidence limit (95%) = mean + 1.96 × SE
For example, the interincisor distance of five patients undergoing laparoscopic cholecystectomy was 45, 45, 35, 35 and 40 mm.
Mean interincisor distance = 40 mm
SE = 2.236
Lower CI = 40 – 2.236 × 1.96 = 35.62
Upper CI = 40 + 2.236 × 1.96 = 44.38
Therefore, when we take means from observations of a different set of patients from the same population, the means are expected to fall between 35.62 and 44.38, 95% of the times.
| Skewed Distribution and Measures of Dispersion|| |
Here, the distribution of data is asymmetrical about the mean. In a negatively skewed distribution (left-skewed), the mass of the distribution is concentrated on the right of the figure leading to a curve with a long left tail. In a positively skewed distribution (right-skewed), the mass of the distribution is concentrated on the left of the figure leading to a long right tail.
If we rank the data into percentiles or quartiles, we get better information of the pattern of spread of the variables. In percentiles, we rank the observations into 100 equal parts and then describe the data being at 25%, 50%, 75% or any other percentile amount. The median is taken as the 50th percentile.
An alternate expression is when the ordered data are divided into four equal parts/quarters of 25% each by three values – one at 25%, one at 50% and one at 75%. These divisions are called quartiles (Q1, Q2 and Q3). Thus, the interquartile ranges (IQRs) are the observations in the middle 50% about the median, between first and third quartiles (25th–75th percentile – note the relationship between quartiles and percentiles) [Figure 3]. Q1 corresponds to P25, Q2 corresponds to P50 and Q3 corresponds to P75. Q2 is the median value in a set of data. For example, for a median value of 7.5 cm for thyromental distance in a series of ordered data, for data between values of 7.5 and 8.0 cm falling in the Q1 to Q3 range (the IQR), the representation is 7.5 (7.5–8.0) cm.
|Figure 3: The distribution of data into quarters, quartiles and percentiles (Q1, Q2 and Q3 – quartiles; P25, P50 and P75 – percentiles)|
Click here to view
| Regression|| |
Regression is a statistical term wherein a statistical tool is used to find relationships within a given data. It is a measure of how change in an independent variable will affect the dependent variable. For example, blood pressure rises linearly with advancing age/weight; here, variable to be explained (blood pressure) is called the dependent variable and variables that explain it (age/weight) are called independent variables. Dependent variables are the main factors that a researcher is studying to understand or predict and independent variable is the variable that may have an impact on the dependent variable. Thus, 'regression' identifies and characterises the relationship among multiple factors. It also helps in identification of the prognostically relevant risk factors among multiple identified factors and is used in calculation of risk scoring systems as well. Regression analysis is a tool for correlation; the correlation is not to be interpreted as causation.
Some of the types of regression include linear regression, Cox regression and logistic regression. Linear regression describes a linear relationship between dependent variables (which must be continuous variables such as weight or blood pressure) with independent variables (which may be continuous such as age and duration of laryngoscopy or categorical such as social status) (for example, greater difficulty in intubation – longer time for intubation and longer time for intubation – greater increases in systolic blood pressure). For assessing the relationship between two continuous variables, assessment of scatter plot/graph is desirable to look for the relationship (linear vs non-linear). Linear regression is of value when the relationship is linear. In the event of multiple independent variables singularly affecting the dependent variable, multivariable linear regression is used. Proper use of regression coefficients and clinical experience are required for extracting the correct correlations.
Logistic regression is appropriate when the dependent variable is dichotomous (binary). It describes data and explains the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables, for example, correlating the occurrence of lung cancer (Yes vs No) for each kilogram of weight gain (for each kilogram) and for each pack of cigarettes smoked per day.
Cox regression is frequently applied for modelling of survival data, e.g., survival analysis with many variates.
The sample size influences the relationships assessed using regression analyses. Too small samples will demonstrate only very strong relationships. As rule of thumb, the sample size should be about 20 times the number of independent variables being studied. For example, if a researcher is studying the effect of two independent variables on a dependent variable, then at least 40 observations are desirable.
| Correlation|| |
Correlation is used to denote the association between two quantitative variables. When assessing for correlation, certain assumptions have to be made; the association is linear in that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. Correlation is described as the analysis which lets us know the association or the absence of the relationship between two variables 'x' and 'y', whereas Regression analysis predicts the value of the dependent variable based on the known value of the independent variable, assuming that there exists average mathematical relationship between two or more variables.
Examining for the presence of association requires few steps to be followed. The collected data have to be plotted on a scatter plot [Figure 4] with the vertical axis representing the dependent variable and the horizontal axis representing the independent variable. In the example given, there is an attempt to assess if association exists between height and thyromental distance [Figure 4]. The height (independent variable) forms the horizontal axis, and the thyromental distance forms the vertical axis.
|Figure 4: Scatter plot showing correlation between height and thyromental distance (r + 0.97)|
Click here to view
Looking at scatter plot, we need to see if correlation exists and then go ahead with calculation of correlation coefficient. The correlation coefficient ('r') is represented by a value that varies from +1 through 0 to −1. Complete correlation between two variables is expressed by either +1 or −1. When one variable increases as the other increases, the correlation is positive; when one decreases as the other increases, it is negative. Complete absence of correlation is represented by 0. Values lying between 0 to +1 represent varying degrees of positive correlation, whereas 0 to −1 represents varying degrees of negative correlation. The strength of association can be arbitrarily graded as weak or strong based on correlation coefficient. Values of r between 0 and 0.19 are regarded as very weak, 0.2–0.39 as weak, 0.40–0.59 as moderate, 0.6–0.79 as strong and 0.8–1 as very strong correlation. In the example quoted above, an 'r' value of +0.97 represents a very strong positive correlation between height and thyromental distance.
However, it is important to note that the mere establishment of correlation does not establish causation. Hence, the correlation coefficient is tested for its occurrence by chance, applying t-test and P value derived. P < 0.05 says that the occurrence of correlation coefficient is not by chance. The 'rho' value calculated along with calculation of correlation coefficient gives us an idea about contribution of the parameter of interest to the outcome measure. In the quoted example [Figure 4], the rho value is 0.97 which says that 97% of the association between height and thyromental distance is due to height only and not because of other variables, which may influence the observations.
Pearson's correlation coefficient
This test is applied when the variables are normally distributed, and an association exists between them.
Spearman's rank correlation
This test is applied when the data show skewed distribution, the data are quantitative discrete or the data are arranged in an order (ordinal data), for example, association between laryngoscopy time and Cormack–Lehane grades, where Cormack–Lehane grades constitute ordinal data, whereas laryngoscopy time is a quantitative continuous variable.
| Conversion of One Measure to the Other|| |
At times, for analysis, some measures may need to be converted to another form as a statistical requirement or for critical appraisal of articles or for meta-analysis without having an access to the raw data such as median to mean, CI to SD, range to SD or IQR to SD. Formulae have been recommended in these situations,, but their discussion is beyond the scope of this article.
| Practice Pearls|| |
- Large data are simplified first in descriptive statistics, described as sums, averages, relationships and differences
- The data distribution measures are described based on symmetrical (Gaussian) and asymmetrical (non-Gaussian) distribution of data
- Measures of central tendency are mean, median and mode
- Measures of dispersion are range, percentile, variance, SD, SE and CI
- Correlation and regression can be used to describe the relationship between two numerical variables
- Correlation is a measure of association and regression is used for prediction
- Formulae are available where some measures can be converted to another form during critical appraisal of evidence.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Winters R, Winters A, Amedee RG. Statistics: A brief overview. Ochsner J 2010;10:213-6.
Goneppanavar U, Ali Z, Bhaskar SB, Divatia JV. Types of data, methods of collection, handling and distribution. Airway 2019;2:36-40. [Full text]
Kannan S, Dongare PA, Garg R, Harsoor SS. Describing and displaying numerical and categorical data. Airway 2019;2:64-70. [Full text]
Manikandan S. Measures of central tendency: Median and mode. J Pharmacol Pharmacother 2011;2:214-5.
] [Full text]
Myles PS, Gin T. Statistical Methods for Anaesthesia and Intensive Care. 1st
ed.. Oxford: Butterworth Heinemann; 2000. p. 8-10.
Ali Z, Bhaskar SB. Basic statistical tools in research and data analysis. Indian J Anaesth 2016;60:662-9.
] [Full text]
Schneider A, Hommel G, Blettner M. Linear regression analysis: Part 14 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2010;107:776-82.
Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: Linear regression analysis. Perspect Clin Res 2017;8:100-2.
] [Full text]
Campbell MJ, Swinscow TD. Correlation and regression. Statistics at Square One. 11th
ed. West Sussex: Wiley Blackwell Publications; 2009. p. 119-27.
Hozo SP, Djulbegovic B, Hozo I. Estimating the mean and variance from the median, range, and the size of a sample. BMC Med Res Methodol 2005;5:13.
Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, et al
, editors. Cochrane Handbook for Systematic Reviews of Interventions Version 6.0. Cochrane 2019. Available from: www.training. cochrane.org/handbook. [Last accessed on 2019 Nov 30].
[Figure 1], [Figure 2], [Figure 3], [Figure 4]