Reliability

Authors

Affiliations

Doctor of Physical Therapy

B.S. in Kinesiology

Doctor of Physical Therapy

B.A. in Neuroscience

Reliability relates to the extent of variability and error inherent in a measurement and can be interpreted as the extent to which measurements can be replicated¹. Reliability reflects not only degree of correlation but also agreement between measurements¹.

Reliability is NOT the same as validity

“in order to improve reliability, attempts must be made to remove random error”²

Significance of Reliability

One cannot be confident in the results of a scale with poor reliability³.

Example of Poor reliability

If a person’s score is 10 on 1 occasion and 15 on another (again assuming the person has not changed)³
If 1 rater gives a score of 7 and another a score of 12, we wouldn’t be sure which, if either, score is correct³.
Similarly, if the items of a scale are not related to each other, we would not be sure just what it is that the scale is measuring (assuming that the scale is measuring a homogeneous construct)³.

Misconceptions

Reliability is sometimes used synonymously with ‘precision,’‘agreement,’ and ‘repeatability,’ but these are misslabeling the concept³.

Non-mathematical explanation

Reliability is made up of 2 parts: Correlation and Agreement. :::{layout-ncol=“2”} ### Correlation

Correlation refers to the fact that there is a pattern between results.
For example, results iwth high correlation would mean that Test 2 is always greater than test 1.
- Or Test 2 is always slightly smaller than test 1.

Agreement

Agreement quantifies the magnitude of difference from Test 1 and Test 2.
It does not indicate whether the results follow a similar pattern, but how much they differ from eachother.

:::

Calculation

Reliability is the proportion of the total variance ( $σ_{T} o t a l^{2}$ ) in scores that is due to differences among people³.

$R e l i a b i l i t y = \frac{σ_{s}^{2}}{σ_{s}^{2} + σ_{e}^{2}} = \frac{σ_{s}^{2}}{σ_{T} o t a l^{2}}$

Symbol	Meaning
$σ_{s}^{2}$	Subject variability^3,4
$σ_{e}^{2}$	Measurement error^3,4
$σ_{T} o t a l^{2}$	Total variance^3,4

Note

Terms such as ‘precision,’‘agreement,’ and ‘repeatability’ focus solely on the fact that the error term should be as small as possible³.
These terms ignore the fact that reliability also depends on the variability among people³.

Subject variability implications

If there is no variability between participants ( $σ_{s}^{2}$ ) then $σ_{s}^{2} = 0$ the reliability is 0³.
- The reliability of a scale reflects its ability to differentiate among people and if it cannot, then the reliability is 0 and the scale is useless³.
Another implication about $σ_{s}^{2}$ is that reliability is not a fixed property, but rather dependent on the sample being studied.
- Reliability is is very dependent on the sample in which it is determined³.
- Example: Applying a depression scale to ER patients vs outpatient psychiatric patients
  - The ER Patients would consist of people who have very low depression to extremely high depression (suicidally depressed)³.
    - This sample has a very high patient variability ( $σ_{s}^{2}$ ) and thus a higher reliability score³
  - The outpatient psychiatric patients would likely have moderate levels of depression (those with high depression would be administered to a hospital)³.
    - Thus there will be low patient variability ( $σ_{s}^{2}$ ) and the reliability will be lower³.

Types of Reliability

Inter-rater Reliability

Inter-rater reliability (IRR): Consistency of a measure assessed by multiple raters

Inter-rater Agreement

Inter-rater agreement (IRA) measures the variation in results of a test when performed by different assessors on the same patient at the same time point.

Intra-rater Reliability

Intra-rater reliability measures the variation of results of an assessor across multiple time points.

Test-Retest Reliability

Test-Retest Reliability measures the reliability of the instrument across multiple time points.

Internal Consistency

Internal Consistency

Reliability Coefficients

Outdated measures of reliability

Historically, Pearson correlation coefficient, paired t test, and Bland-Altman plot have been used to evaluate reliability¹.
These tests only measure correlation and disregard variance, thus are are nonideal measures of reliability¹.

Common reliability statistics include:

Scoring

Values range from 0.00 (not reliable) to 1.00 (perfectly reliable)

“Rule of thumb” Quick interpretation of reliability
Score	Reliability
0.00 – 0.20	Poor
0.21 – 0.40	Fair
0.41 – 0.60	Moderate
0.61 – 0.80	Good
0.81 – 1.00	Excellent

For reliability measures, the confidence interval defines a range in which the true coefficient lies with a given probability⁵

For a result to show that reliability is better than chance at a confidence level of 95%, the lower limit of the CI must be above 0.6⁵.

Interpretation

How High Should Reliability be?

How high should reliability be?

Scales used in new, underdeveloped research areas should have a minimum reliability of 0.70³.
Scales coming from mature areas of research, the minimum is 0.80³.
and if the scale is to be used for clinical purposes the minimum is 0.90³.

*Poor-moderate reliability are not good for clinical decision making since you are likely to get a different result everytime you test your patient, regardless of status

References

Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of Chiropractic Medicine. 2016;15(2):155-163. doi:10.1016/j.jcm.2016.02.012

Sim J, Arnell P. Measurement validity in physical therapy research. Physical Therapy. 1993;73(2):102-110; discussion 110-115. doi:10.1093/ptj/73.2.102

Streiner DL. Statistics Commentary Series: Commentary #15-Reliability. Journal of Clinical Psychopharmacology. 2016;36(4):305-307. doi:10.1097/JCP.0000000000000517

Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Research in social & administrative pharmacy: RSAP. 2013;9(3):330-338. doi:10.1016/j.sapharm.2012.04.004

Zapf A, Castell S, Morawietz L, Karch A. Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate? BMC medical research methodology. 2016;16:93. doi:10.1186/s12874-016-0200-9

Citation

For attribution, please cite this work as:

Yomogida N, Kerstein C. Reliability. https://yomokerst.com/The Archive/Evidene Based Practice/Reliability/reliability.html