Reliability

Authors
Affiliations

Doctor of Physical Therapy

B.S. in Kinesiology

Doctor of Physical Therapy

B.A. in Neuroscience

Reliability relates to the extent of variability and error inherent in a measurement and can be interpreted as the extent to which measurements can be replicated1. Reliability reflects not only degree of correlation but also agreement between measurements1.

Reliability is NOT the same as validity

“in order to improve reliability, attempts must be made to remove random error”2

Significance of Reliability

  1. One cannot be confident in the results of a scale with poor reliability3.
Example of Poor reliability
  • If a person’s score is 10 on 1 occasion and 15 on another (again assuming the person has not changed)3
  • If 1 rater gives a score of 7 and another a score of 12, we wouldn’t be sure which, if either, score is correct3.
  • Similarly, if the items of a scale are not related to each other, we would not be sure just what it is that the scale is measuring (assuming that the scale is measuring a homogeneous construct)3.

Misconceptions

Reliability is sometimes used synonymously with ‘precision,’‘agreement,’ and ‘repeatability,’ but these are misslabeling the concept3.

Non-mathematical explanation

Reliability is made up of 2 parts: Correlation and Agreement. :::{layout-ncol=“2”} ### Correlation

  • Correlation refers to the fact that there is a pattern between results.
  • For example, results iwth high correlation would mean that Test 2 is always greater than test 1.
    • Or Test 2 is always slightly smaller than test 1.

Agreement

  • Agreement quantifies the magnitude of difference from Test 1 and Test 2.
  • It does not indicate whether the results follow a similar pattern, but how much they differ from eachother.

:::

Calculation

Reliability is the proportion of the total variance (\(\sigma_Total^2\)) in scores that is due to differences among people3.

\[ Reliability = \frac{\sigma_s^2}{\sigma_s^2 + \sigma_e^2} = \frac{\sigma_s^2}{\sigma_Total^2} \]

Symbol Meaning
\(\sigma_s^2\) Subject variability3,4
\(\sigma_e^2\) Measurement error3,4
\(\sigma_Total^2\) Total variance3,4
Note
  • Terms such as ‘precision,’‘agreement,’ and ‘repeatability’ focus solely on the fact that the error term should be as small as possible3.
  • These terms ignore the fact that reliability also depends on the variability among people3.
Subject variability implications
  • If there is no variability between participants ( \(\sigma_s^2\) ) then \(\sigma_s^2 = 0\) the reliability is 03.
    • The reliability of a scale reflects its ability to differentiate among people and if it cannot, then the reliability is 0 and the scale is useless3.
  • Another implication about \(\sigma_s^2\) is that reliability is not a fixed property, but rather dependent on the sample being studied.
    • Reliability is is very dependent on the sample in which it is determined3.
    • Example: Applying a depression scale to ER patients vs outpatient psychiatric patients
      • The ER Patients would consist of people who have very low depression to extremely high depression (suicidally depressed)3.
        • This sample has a very high patient variability ( \(\sigma_s^2\) ) and thus a higher reliability score3
      • The outpatient psychiatric patients would likely have moderate levels of depression (those with high depression would be administered to a hospital)3.
        • Thus there will be low patient variability (\(\sigma_s^2\)) and the reliability will be lower3.

Types of Reliability

Inter-rater Reliability

Inter-rater Agreement

Inter-rater agreement (IRA) measures the variation in results of a test when performed by different assessors on the same patient at the same time point.

Intra-rater Reliability

Intra-rater reliability measures the variation of results of an assessor across multiple time points.

Test-Retest Reliability

Test-Retest Reliability measures the reliability of the instrument across multiple time points.

Internal Consistency

Reliability Coefficients

  • Historically, Pearson correlation coefficient, paired t test, and Bland-Altman plot have been used to evaluate reliability1.
  • These tests only measure correlation and disregard variance, thus are are nonideal measures of reliability1.

Common reliability statistics include:

Scoring

Values range from 0.00 (not reliable) to 1.00 (perfectly reliable)

“Rule of thumb” Quick interpretation of reliability
Score Reliability
0.00 – 0.20 Poor
0.21 – 0.40 Fair
0.41 – 0.60 Moderate
0.61 – 0.80 Good
0.81 – 1.00 Excellent

For reliability measures, the confidence interval defines a range in which the true coefficient lies with a given probability5

For a result to show that reliability is better than chance at a confidence level of 95%, the lower limit of the CI must be above 0.65.

Interpretation

How High Should Reliability be?

How high should reliability be?

  • Scales used in new, underdeveloped research areas should have a minimum reliability of 0.703.
  • Scales coming from mature areas of research, the minimum is 0.803.
  • and if the scale is to be used for clinical purposes the minimum is 0.903.

*Poor-moderate reliability are not good for clinical decision making since you are likely to get a different result everytime you test your patient, regardless of status

References

1.
Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of Chiropractic Medicine. 2016;15(2):155-163. doi:10.1016/j.jcm.2016.02.012
2.
Sim J, Arnell P. Measurement validity in physical therapy research. Physical Therapy. 1993;73(2):102-110; discussion 110-115. doi:10.1093/ptj/73.2.102
3.
Streiner DL. Statistics Commentary Series: Commentary #15-Reliability. Journal of Clinical Psychopharmacology. 2016;36(4):305-307. doi:10.1097/JCP.0000000000000517
4.
Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Research in social & administrative pharmacy: RSAP. 2013;9(3):330-338. doi:10.1016/j.sapharm.2012.04.004
5.
Zapf A, Castell S, Morawietz L, Karch A. Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate? BMC medical research methodology. 2016;16:93. doi:10.1186/s12874-016-0200-9

Citation

For attribution, please cite this work as: