Lars is involved in a study on grip strength in patients with rheumatoid arthritis, and he is interested in finding out more about the reproducibility of the procedures he uses.

Therefore, he has designed a study which includes measuring the grip strength (Nm2) in 20 patiens with rheumatoid arthritis at two different time points (1 week in between). He obtains the results outlined in Table 1.

**Table 1.** Test-retest data on grip strength in 20 rheumatoid arthritis patients

Patient no. | Measurement 1 | Measurement 2 |
---|---|---|

1 | 86 | 92 |

2 | 40 | 47 |

3 | 50 | 55 |

4 | 52 | 57 |

5 | 69 | 74 |

6 | 62 | 64 |

7 | 84 | 84 |

8 | 68 | 72 |

9 | 58 | 62 |

10 | 71 | 74 |

11 | 92 | 97 |

12 | 76 | 74 |

13 | 77 | 81 |

14 | 77 | 83 |

15 | 64 | 67 |

16 | 35 | 34 |

17 | 88 | 94 |

18 | 76 | 74 |

19 | 103 | 110 |

20 | 117 | 125 |

Lars calculates the Pearson’s correlation coefficient to be: 0.991 (p<0.0001).

*Questions*

1.1 Discuss what the correlation coefficient tell us about the reliability between the two measurements?

1.2 Explain what the p-value means?

The numbers in Table 2 below are an extraction from an SPSS analysis (General Linear Models, variance components, restricted maximum likelihood) of the data from the Table 1.

**Table 2.** Variance components

Variance component | Estimate |
---|---|

Var(patients) | 424,579 |

Var(measurements) | 6,811 |

Var(error) | 4,414 |

*Questions*

2.1 Using the variance estimates, please calculate ICC-consistency (model 2.1), ICC-agreement (model 2.1) SEM-consistency and SEM-agreement.

2.2 What do you think about the reliability and the measurement error?

A paired t-test of measurement 1 (score1) and measurement 2 (score2) is shown in Table 3.

**Table 3.** Paired t-test

```
. ttest score1 = score2
Paired t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
score1 | 20 72.25 4.508982 20.16478 62.81259 81.68741
score2 | 20 76 4.750623 21.24543 66.05683 85.94317
---------+--------------------------------------------------------------------
diff | 20 -3.75 .6644151 2.971354 -5.140637 -2.359363
------------------------------------------------------------------------------
mean(diff) = mean(score1 - score2) t = -5.6441
Ho: mean(diff) = 0 degrees of freedom = 19
Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
```

*Questions*

3.1 What is the systematic difference between measurement 1 (score1) and measurement 2 (score2)? Which measurement has the highest average score?

3.2 What is LOA (limits of agreement)?

3.3 What will happen with ICC-consistency and ICC-agreement if measurement 2 (score2 )always measures 20 Nm2 lower compared to measurement 1 (score1)?

3.4 What will happen to the ‘limits of agreement’?

In a study by De Winter et al. (2004) two physiotherapists have measured shoulder abduction (Figure 1) in 155 patients with pain in one shoulder. They used an electronic inclinometer, showing the abduction angle in degrees (Figure 2).

**Figure 1.** Shoulder abduction

**Figure 2.** Inclinometer measuring the abduction angle

Table 4 shows the results in degrees as well as other measures of reproducibility for both shoulders.

**Table 4.** Results

*Questions*

4.1 Explain the large difference between the two ICC’s in light of the exact same results for the 5 and 10% agreement.

4.2 Discuss which parameter is preferrable?

A researcher is designing a reproducibility study. On a course in questionnaire technique he has heard that the reliability will improve if the study population is more heterogenous. He therefore designs the study to include very different patients.

*Questions*

5.1 Discuss this strategy?