How is inter rater reliability reported




















When this conservative estimate of the RCI was used, significantly higher numbers of equal or diverging ratings were not found, neither for a single rating subgroup, nor across the study population. Thus, the probability of a child to receive a concordant rating did not differ from chance.

Table 2. Proportions of diverging ratings for monolingual, bilingual, and all children in the sample. In the parent—teacher rating subgroup 21 out of 34 children received diverging ratings; 9 out of 19 children received diverging ratings in the mother—father rating subgroup.

Binomial tests see Table 2 for details clarified that these absolute differences were not statistically reliable within the limitations posed by the small sample size.

The results reported in this section consider those rating pairs that were classified as reliably different using the more conservative RCI calculation on the basis of the test-retest reliability, which yield a considerable number of diverging ratings.

We explored the potential influence of three different factors on the likelihood of receiving diverging ratings: rating subgroup mother—father vs. Next, we assessed whether the likelihood to receive diverging ratings was above chance.

We conducted these binomial tests separately for bilingual and monolingual children, as bilingual children were shown to receive more diverging ratings compared to monolingual children. As only 2 out of 19 bilingual children were rated by two parents see Table 1 , we also considered rating subgroups separately.

As summarized in Table 2 , the likelihood to receive diverging ratings exceeded chance for bilingual children only. However, conclusions about whether this is also true for bilingual children rated by two parents cannot be drawn on the basis of our data, as only two children fell in this category. Wilcoxon paired-sample tests were used to uncover possible systematic direction tendencies for different groups of raters. None of the within subgroup comparisons maternal- vs. Thus, we did not find evidence for systematic direction of rating divergence, neither for bilingual, nor for monolingual children.

We therefore conclude that within the two different rating subgroups a similar proportion of diverging ratings occurred. Neither the gender of the child, nor whether the expressive vocabulary was evaluated by two parents or by a teacher and a parent, increased the probability of the children to receive two diverging ratings.

The only factor that reliably increased this probability was bilingualism of the child. No systematic direction of differences was found.

In a first step, we compared means of ratings for each rater group: mothers, fathers, parents and teachers. T -Tests did not reveal any significant differences see Table 3.

Table 3. Means and standard deviations of vocabulary ratings and comparisons of means. Only when using the test-retest reliability provided in the manual of the ELAN, there was a substantial number of differing rating pairs 30 out of 53 or The magnitude of these differences was assessed descriptively using a scatter plot see Figure 3 and a Bland-Altman plot also known as Tukey mean-difference plot, see Figure 4.

First, we displayed the rating of the individual children in a scatter plot and illustrated the two different areas of agreement: Figure 3. Scatter-plot of children's ratings. Every dot represents two ratings provided for a child. For the parent—teacher rating subgroup, parental ratings are on the x -axis, teacher ratings are on the y -axis, for the parental rating subgroup, paternal ratings are on the x -axis, maternal ratings are on the y -axis.

Ratings for bilingual children are represented by gray, for monolingual children by black dots. Dashed lines enclose statistically identical ratings as calculated on the basis of the manual-provided test-retest reliability less than 3 T -points difference; 23 out of 53 rating pairs.

Straight lines enclose statistically identical ratings as calculated on the basis of the inter-rater reliability ICC in our study less than 12 T -points difference. Figure 4. Bland-Altman plot of T -values, corresponding to a Tukey mean-difference plot. Dots represent the 30 rating pairs diverging significant in the study population. Differing mother—father ratings are represented by empty, differing parent—teacher ratings by filled dots.

Another way of illustrating the magnitude of differences is to display the distribution of significant differences, where mean T -values are plotted against the absolute difference values as proposed by Bland and Altman , So far we reported results regarding inter-rater reliability and the number of diverging ratings within and between subgroups using two different but equally legitimate reliability estimates.

We also explored which factors might influence the likelihood of receiving two statistically diverging ratings and described the magnitude of observed differences. These analyses focused on inter-rater reliability and agreement, as well as related measures.

In this last section we turn to Pearson correlations coefficients in order to explore the linear relation between ratings and their strength within and between rater subgroups.

In conclusion, with regard to correlation of ratings, strong associations were observed for ratings provided by mothers and fathers, as well as for those provided by teachers and parents and thus across our study sample. Figure 5. Correlations of ratings. Monolingual children are represented by black, bilingual by gray dots. In this report a concrete data set is employed to demonstrate how a comprehensive evaluation of inter-rater reliability, inter-rater agreement concordance , and linear correlation of ratings can be conducted and reported.

On the grounds of this example we aim to disambiguate aspects of assessment that are frequently confused and thereby to contribute to increasing comparability of future rating analyses. By providing a tutorial, we hope to foster knowledge transfer to e. We analyzed two independent vocabulary ratings obtained for 53 German speaking children at the age of 2 years with the German vocabulary scale ELAN Bockmann and Kiese-Himmel, On the example of assessing whether ELAN ratings can be reliably obtained from daycare teachers as well as from parents we show that rater agreement, linear correlation, and inter-rater reliability all have to be considered.

Otherwise, an exhaustive conclusion about a rating scale's employability with different rater groups cannot be made. We also considered the factors gender and bilingualism of the evaluated child as potentially influencing the likelihood of rating agreement. First, we assessed the inter-rater reliability within and across rating subgroups. The inter-rater reliability as expressed by intra-class correlation coefficients ICC measures the degree to which the instrument used is able to differentiate between participants indicated by two or more raters that reach similar conclusions Liao et al.

Hence, the inter-rater reliability is a quality criterion of the assessment instrument and the accuracy of the rating process rather than one quantifying the agreement between raters. It can be regarded as an estimate for the instrument's reliability in a concrete study population.

This is the first study to evaluate inter-rater reliability of the ELAN questionnaire. We report high inter-rater reliability for mother—father as well as for parent—teacher ratings and across the complete study population. No systematic differences between the subgroups of raters were found. This indicates that using the ELAN with daycare teachers does not lower its capability to differentiate between children with high and low vocabulary.

Many studies supposedly evaluating agreement of expressive vocabulary ratings rely only on measures of strength of relations such as linear correlations e. In some studies the raw scores are used as reference values and critical differences are disregarded e.

However, absolute differences between raw scores or percentiles do not contain information about their statistical relevance. We demonstrate the use of the Reliable Change Index RCI to establish statistically meaningful divergences between rating pairs.

We obtained two different RCIs on the basis of two reliability measures: the test-retest reliability provided in the ELAN's manual Bockmann and Kiese-Himmel, and the inter-rater reliability expressed as ICC derived from our sample. This dual approach was chosen to demonstrate the impact of more or less conservative, but similarly applicable reliability estimates, on measures of rating agreement.

We determined that, if considering the reliability provided in the ELAN-manual, ratings differ reliably if the absolute difference between them amounts to three or more T -points.

With regard to the reliability of our study, however, this difference necessary to establish reliable divergence between two ratings is considerably larger, i. For both critical values we determined absolute agreement e. In contrast, absolute agreement was With this more conservative measure of absolute agreement, the probability to receive a concordant rating did not differ from chance. This probability did not differ statistically for the two rating subgroups parent—teacher and mother—father ratings and thus across the study population, regardless of the chosen RCI calculation.

These results support the assumption that parents and daycare teachers in this case were similarly competent raters with regard to early expressive vocabulary of the children. Nonetheless, the RCIs obtained with different reliability estimates differ substantially with regard to the specific estimates of absolute agreement. The profoundly diverging amounts of absolute agreement obtained by using either inter-rater reliability within a relatively small sample or the instrument's test-retest reliability obtained with a large and more representative sample highlights the need for caution when calculating reliable differences.

Whether In the domain of expressive vocabulary, however, we scarcely find empirical studies reporting the proportion of absolute agreement between raters. If they do, they consider agreement on the level of individual items here words and not on the level of the overall rating a child receives de Houwer et al.

In other domains, such as attention deficit or behavior problems, percentages of absolute agreement as proportion of concordant rating pairs are reported more often and provide more comparable results e. However, one should take into account that these studies usually evaluate inter-rater agreement of instruments with far fewer items than the present study in which raters had to decide on individual words. When comparing the results of our study and those of studies in other domains it has to be considered that increasing the number of items composing a rating reduces the likelihood of two identical scores.

The difficulty to find reliable and comparable data on rater agreement in the otherwise well-examined domain of early expressive vocabulary assessment highlights both the widespread inconsistency of reporting practices and the need to measure absolute agreement in a comparable way, as e. In order to evaluate inter-rater agreement in more detail, the proportion of absolute agreement needs to be considered in light of magnitude and direction of the observed differences.

These two aspects provide relevant information on how close diverging ratings tend to be and whether systematically higher or lower ratings emerge for one subgroup of raters or rated persons in comparison to another. The magnitude of difference is an important aspect of agreement evaluations, since the proportions of statistically equal ratings only reflect perfect concordance.

Such perfect concordance may, however, not always be relevant, e. In order to assess the magnitude of difference between raters, we employed a descriptive approach considering the distribution and the magnitude of score differences.

As reliably different ratings were only observed when calculations were based on the test-retest reliability of the ELAN, we used these results to assess magnitude and direction of differences. Thus, the occurring differences were in an acceptable range for a screening tool, since they did not exceed one standard deviation of the norm scale used. This finding puts into perspective the relatively low proportion of absolute agreement measured on the groups of the tools test-retest reliability The analysis of differences' direction is intended to uncover systematic rating tendencies by a group of raters or for a group of rated persons.

Some validity studies show a tendency of raters, specifically of mothers, to estimate children's language developmental status higher than the results obtained via objective testing of the child's language abilities Deimann et al.

Whether these effects reflect an overrating of the abilities of the children by their mothers, or the fact that objective results acquired specifically for young children might underestimate the actual ability of a child, remains uncertain. In the present study we did not assess validity and thus did not compare the acquired ratings to objective data.

This also means that our assessments cannot reveal lenience or harshness of ratings. Instead, comparisons were conducted between raters, i. We did not find any systematic direction of differences under these circumstances: No one party of either rating pair rated children's vocabulary systematically higher or lower than the other.

As explained above, only with the more conservative approach to calculate the RCI did we find a substantial amount of diverging ratings. We looked at the factors possibly influencing the likelihood of receiving diverging ratings. Neither gender of the child, nor whether it was evaluated by two parents or by a parent and a teacher, influenced this likelihood systematically. Bilingualism of the evaluated child was the only examined factor which increased the likelihood of a child to receive diverging scores.

It is possible that diverging ratings for the small group of bilingual children reflected systematic differences of vocabulary used in the two different settings: monolingual German daycare and bilingual family environments. Larger groups and more systematic variability of the bilingual environment characteristics are necessary to determine whether bilingualism has a systematic effect on rater agreement, as suggested by this report and, if yes, where this effect stems from.

In order to further explore the linear relation between ratings, we calculated Pearson correlation coefficients. As mentioned above, many researchers employ correlation coefficients as an indicator of agreement e. However, Pearson correlation coefficients are useful when quantifying the strength of linear association between variables.

They can also be compared to assess differences between rater groups concerning these relations. In the context of vocabulary assessment, they allow us to relate the present results to previous findings. Possible explanations can be found in our population characteristics, specifically in the homogeneity of the children's family and educational backgrounds, as well as the high professional qualification of the teachers in the participating state regulated daycare facilities. The high correlations could also be seen as indication that the employed questionnaire was easy to understand and unambiguous for most of the raters.

What is more, we did not find differences in correlation coefficients when comparing rater subgroups. These results provide evidence that two parental ratings were not more strongly associated with each other than a parent with a teacher rating and that in general the two ratings of the expressive vocabulary of a child obtained with the ELAN-questionnaire Bockmann and Kiese-Himmel, were strongly associated with each other.

Taking together the results on agreement and those on linear correlations, we conclude that both measures are important to report. We demonstrate that high correlations of ratings do not necessarily indicate high agreement of ratings when a conservative reliability estimate is used.

The present study is an example of low to moderate agreement of ratings combined with relatively small magnitude of differences, unsystematic direction of differences and very high linear correlations between ratings within and between rating subgroups.

In our study it would have thus been very misleading to only consider correlations as a measure of agreement which they are not. In summary, this study provides a comprehensive evaluation of agreement within and between two rater groups with regard to a German expressive vocabulary checklist for parents ELAN, Bockmann and Kiese-Himmel, Inter-rater reliability of the ELAN-questionnaire, assessed here for the first time, proved to be high across rater groups.

Within the limits of population size and its homogeneity, our results indicate that the ELAN-questionnaire, originally standardized for parents, can also be used reliably with qualified daycare teachers who have sufficient amount of experience with a child.

We did not find any indication for systematically lower agreement of parent—teacher ratings compared to mother—father ratings. Also, teachers compared to parents as well as mothers compared to fathers did not provide systematically higher or lower ratings. The magnitude of absolute agreement profoundly depended on the reliability estimate used to calculate a statistically meaningful difference between ratings. The magnitude of rating differences was small and the strength of association between vocabulary ratings was high.

These findings highlight that rater agreement has to be assessed in addition to correlative measures while not only taking significance but also magnitude of differences into account. The employed and discussed analytical approach serves as one example for evaluation of ratings and rating instruments applicable to a variety of developmental and behavioral characteristics.

It allows the assessment and documentation of differences and similarities between rater and rated subgroups using a combination of different statistical analyses. If future reports succeed in disambiguating the terms agreement, reliability and liner correlation and if the statistical approaches necessary to tackle each aspect are used appropriately, higher comparability of research results and thus improved transparency will be achieved.

Funding for this study was provided by the Zukunftskolleg of the University of Konstanz. The ROB adjudications are categorized as follows: low risk, moderate risk , serious risk , critical risk , or no information. As ROB-NRSE is the most current, publicly available version modeled after the ROBINS-I tool, we conducted this cross-sectional study to establish ample evidences on its reliability and validity in order to improve the consistency in its application and in how it is interpreted across various systematic reviews that include NRSE.

Inter-rater reliability IRR refers to the reproducibility or consistency of decisions between two reviewers and is a necessary component of validity [ 16 , 17 ]. Inter-consensus reliability ICR refers to the comparison of consensus assessments across pairs of reviewers in the participating centers.

Concurrent validity refers to the extent to which the results of the instrument or tool can be trusted [ 17 ]. Furthermore, it is important to understand the barriers to using this tool e. Using methods similar to those described previously for the evaluation of the ROBINS-I tool [ 18 ], an international team of experienced researchers from four participating centers will collaboratively undertake this study. The major objectives are the following:. In order to address the above objectives, we will conduct a cross-sectional analytical study on a sample of NRSE publications following this protocol.

We plan to report any protocol amendments in the final study manuscript. Our first objective is to evaluate the IRR of ROB-NRSE at first stage, without customized training and guidance document from principal investigator, and then at the second stage, with customized training and guidance. At both stages, assessors will have access to the publicly available detailed guidance [ 22 ].

For the second stage, a customized guidance document will be developed using Microsoft word Word v1. Following review and feedback by another experienced senior member of the team MA , we will finalize the document. The guidance document will contain simplified decision rules, additional guidance for advanced concepts, and clarifications on answering signaling questions that will guide reviewers in making adjudications for each domain in ROB-NRSE tool.

Once developed, we will send the guidance document to all the reviewers, for help with adjudications in the second stage of the project. Additionally, one training session via Skype will be organized by a trainer MJ , who is a senior member of the team and the developer of the customized guidance document.

During the training session, the trainer will review the guidance document with all the reviewers and provide clarifications. We obtained the observed-agreement probability P a between reviewers required for sample size calculation from an initial pilot testing of 10 NRSE publications. If a study does not report a primary outcome, the principal investigator will identify an important outcome reported in the study, for ROB appraisal. With the help of content experts, we will identify a list of confounders and important co-exposures for the specific association of interest reported in each of the included NRSE publications.

We will also advise all reviewers in the participating centers to read the full report of each included NRSE prior to making assessments. Reviewers will have the list of confounders and important co-exposures available during their assessments. At the end, the two reviewers will resolve conflicts and arrive at a consensus. At the end of the assessments, again the reviewers will meet to resolve conflicts and arrive at a consensus. All studies are assessed first without guidance, before any with-guidance assessments, to prevent the possibility of with-guidance assessment influencing without-guidance assessment.

The principal investigator MJ at the coordinating center will coordinate this process among reviewers in the different participating centers. Upon completion, the collaborating center will collect, organize, and transfer the ROB assessment data from various reviewers to an Excel workbook, prior to proceeding with the data analysis. An experienced biostatistician RR from the collaborating center will conduct all the analyses in collaboration with the other members of the research team.

Concurrent validity refers to how well a newly developed tool is correlated to similar domains of a widely used tool at the same point in time [ 30 ]. In other words, concurrent validity evaluates the extent to which there is concordance in judgment for similar domains in both the tools that are being compared [ 30 ]. We have compared and matched both NOS and the ROB instrument in NRS of exposures tool as shown in Tables 3 and 4 to identify the items that completely overlap , partially overlap , or unique to each tool.

We will then compare these NOS adjudications with the after-consensus adjudications of ROB-NRSE done after customized training and guidance by two pairs of reviewers , for the same set of studies that were used for the ICR assessments.

We will calculate the correlation between the two tools for each of the domains and for the overall assessments. In addition, for any discordance observed between domains or overall assessment, we will explore the possible reasons and attempt to provide explanations.

We will transfer all collected data from Excel workbook to SAS 9. It is also important to assess factors that could reduce the application time. Reviewers will record using a digital clock the time taken in minutes while applying time to read article plus time to adjudicate ROB-NRSE tool without and with guidance , time taken for consensus, and the time taken to apply the NOS tool time to read article plus time to adjudicate for each included NRSE.

The reviewers will use the Excel workbook created by the principal investigator to record the start time, end time, and total time to apply ROB-NRSE at the completion of the assessment for each NRSE and after the consensus process with the second reviewer. In addition, we will also calculate the time taken to resolve conflicts and arrive at a consensus, and the overall time time to apply plus time taken to arrive at a consensus for each pair of reviewers. The time to arrive at a consensus will start when the two reviewers convene to resolve conflicts and will end when they arrive at a consensus.

An experienced biostatistician RR from the coordinating center will conduct all the analyses in collaboration with the other members of the research team. We will use generalized linear models to evaluate changes in time taken to assess ROB-NRSE after customized guidance compared with without guidance.

We will control for the correlation between reviewers using random effects. The distribution of outcome will be adjusted by using a link function. The model distribution will be chosen by link function. Systematic reviews including NRSE can provide valuable evidence on rare outcomes, adverse events, long-term outcomes, real-world practice, and in situations where RCTs are not available [ 9 , 33 ].

It is very important to appraise the ROB in the included NRSE to have a complete understanding of the strengths and weaknesses of the overall evidence, as methodological flaws in the design or conduct of the NRSE could lead to biased effect estimates [ 9 ]. As such, it is important to evaluate the usability, reliability, and concurrent validity of this tool to help identify potential barriers and facilitators in applying this tool in a real-world setting.

In this cross-sectional study protocol, we describe the methods we will use to assess the inter-rater reliability, inter-consensus reliability, and the concurrent validity of ROB-NRSE. Across the world, researchers, with a range of expertise, conduct systematic reviews that include NRSE. The ROB-NRSE tool was designed to be used by systematic reviewers with varied academic backgrounds and experience across multiple knowledge synthesis centers.

A major strength of our study is that we will involve reviewers from multiple research teams with a range of expertise and academic backgrounds highest degree attained to apply and test ROB-NRSE, in order to simulate the real-world settings. We will also use a sample of NRSE that were not evaluated previously by the reviewers, in order to mimic what is typically encountered in a real-world setting.

In addition, similar to what will be encountered in the real-world setting, we anticipate that the time taken to assess ROB might be longer for NRSE appraised at the beginning compared with those appraised later, due to increasing familiarity and a learning curve. We anticipate the following limitations. This may be a limitation as reviewers in real-world settings that may need to appraise multiple outcomes for each of the included NRSE and the evaluator burden might differ slightly from the findings of this study.

In a real-world setting, the training and customized guidance decision rules developed by the researchers for their own systematic reviews may differ from the one developed by the principal investigator of this study, and this may pose a challenge in the generalization of the findings of this study. For feasibility, we have proposed to use the same reviewers for both stages without and with guidance , and we anticipate that this may bias the effect of training and guidance.

However, we will address this limitation by assessing the correlations between adjudications made during the two stages, for each of the reviewers. A poor correlation between adjudications made during the two stages for a reviewer would indicate that the training and guidance have been useful. We hope that the findings of this study will contribute to an improved understanding and better application of the ROB instrument for NRS of exposures tool. Systematic reviews serve as a source of knowledge and evidence to aid in the decision-making process.

Our cross-sectional study addresses issues that may contribute to the quality of the evidence synthesized by the systematic review and thus will be of great interest to all stakeholders such as clinicians, decision-makers, patients, and the general-public through GRADE assessments of the quality of the evidence.

This service is more advanced with JavaScript available. Encyclopedia of Clinical Neuropsychology Edition. Editors: Jeffrey S. Contents Search. Inter-rater Reliability. Authors Authors and affiliations Rael T. How to cite.



0コメント

  • 1000 / 1000