Measuring mental health and wellbeing outcomes for children and adolescents to inform practice and policy: a review of child self-report measures

There is a growing appetite for mental health and wellbeing outcome measures that can inform clinical practice at individual and service levels, including use for local and national benchmarking. Despite a varied literature on child mental health and wellbeing outcome measures that focus on psychometric properties alone, no reviews exist that appraise the availability of psychometric evidence and suitability for use in routine practice in child and adolescent mental health services (CAMHS) including key implementation issues. This paper aimed to present the findings of the first review that evaluates existing broadband measures of mental health and wellbeing outcomes in terms of these criteria. The following steps were implemented in order to select measures suitable for use in routine practice: literature database searches, consultation with stakeholders, application of inclusion and exclusion criteria, secondary searches and filtering. Subsequently, detailed reviews of the retained measures’ psychometric properties and implementation features were carried out. 11 measures were identified as having potential for use in routine practice and meeting most of the key criteria: 1) Achenbach System of Empirically Based Assessment, 2) Beck Youth Inventories, 3) Behavior Assessment System for Children, 4) Behavioral and Emotional Rating Scale, 5) Child Health Questionnaire, 6) Child Symptom Inventories, 7) Health of the National Outcome Scale for Children and Adolescents, 8) Kidscreen, 9) Pediatric Symptom Checklist, 10) Strengths and Difficulties Questionnaire, 11) Youth Outcome Questionnaire. However, all existing measures identified had limitations as well as strengths. Furthermore, none had sufficient psychometric evidence available to demonstrate that they could reliably measure both severity and change over time in key groups. The review suggests a way of rigorously evaluating the growing number of broadband self-report mental health outcome measures against standards of feasibility and psychometric credibility in relation to use for practice and policy.


Introduction
There is a growing number of children's mental health and wellbeing measures that have the potential to be used in child and adolescent mental health services (CAMHS) to inform individual clinical practice e.g. [1], to provide information to feed into service development e.g. [2] and for local or national benchmarking e.g. [3]. Some such measures have a burgeoning corpus of psychometric evidence (e.g., Achenbach System of Empirically Based Assessment, ASEBA [4]; the Strengths and Difficulties Questionnaire, SDQ [5,6]) and a number of reviews have usefully summarized the validity and reliability of such measures [7,8]. However, it is also vital to determine which measures can be feasibly and appropriately deployed in a given setting or circumstance [8]. While some attempt has been made to identify measures that might be used in routine clinical practice [9] no reviews have evaluated in depth both the psychometric rigor and the utility of these measures.
National and international policy has focused on the importance of the voice of the child, of shared decision making for children accessing health services, and of self-defined recovery [10][11][12][13]. This policy context gives a clear rationale for the use of self-report measures for child mental health outcomes. Further rationale is provided by the costs of administration and burden for other reporters. For example, typical costs for a 30 minute instrument to be completed by a child mental health professional could be as much as £30 (clinical psychologist, £30.00; mental health nurse, £20.00; social worker, £27.00; generic CAMHS worker, £21.00; [14]). However, research has indicated that, due to their difficulties with reading and language and their tendencies to respond based on their state of mind at the moment (rather than on more general levels of adjustment), children may be less reliable in their assessments of their own mental health, and there is evidence of under-reporting behavioral difficulties [15,16]. Yet, there is increasing evidence that even children with significant mental health problems understand and have insight on their difficulties and can provide information that is unique and informative. Providing efforts are made to ensure measures are age appropriate (in terms of presentation and reading age), young children can be accurate reporters of their own mental health [17][18][19]. Even in the case of conduct problems, which are commonly identified as problematic for child self-report, evidence suggests that the use of age appropriate measures can yield valid and reliable self-report data [20]. In particular, a number of interactive, online self-report measures have been developed e.g., Dominic interactive; and see [17,21], which appear to elicit valid and reliable responses from children as young as eight years old.
Assessing mental health outcome measures for use in CAMHS also requires consideration of how outcomes should be compared across services. While more specific measures may provide a more detailed account of specific symptomatology, and may be more sensitive to change, they raise challenges in making comparisons across cases or across services where differences in case mix from one setting to the next are likely. Broad mental health indicators in contrast are designed to capture a constellation of the most commonly presented symptoms or difficulties and, therefore, are of relevance to most of the CAMHS population. They also reduce the need to isolate particular presenting problems at the outset of treatment in order to capture baseline problems to assess subsequent change againsta difficult task in the context of changing problems or situations across therapy sessions [22,23]. Associated with breadth of the measure is the issue of brevity; even if costs associated with clinician reported measures are avoided, long child self-report measures are likely to either erode clinical time where completed in clinical sessions or present barriers to completion for children and young people when administered outside sessions [22].
The current study is motivated by the argument that challenges to valid and reliable measurement of child mental health outcomes for those accessing services do not simply relate to the selection of a psychometrically sound tool; issues of burden, financial cost and suitability for comparison across services are huge barriers to successful implementation. Failure to grapple with such efficacy issues is likely to lead to distortions (based on attrition, representativeness and perverse incentives) in the yielded data. This review places particular importance on: 1) measures that cover broad symptom and age ranges, allowing comparisons between services, regions and years; 2) child self-report measures that offer more service user oriented and feasible perspective on mental health outcomes; 3) measures with a range of available evidence relating to psychometric properties, and 4) the resource implications of measures (in terms of both time and financial cost).

Method
The review process to identify and filter appropriate measures consisted of four stages, summarized in Figure 1.
The review was carried out by a team of four researchers, one review coordinator and an expert advisory group (five experts in child mental health and development, two psychometricians, three educational psychology experts and one economist). The search strategy, and inclusion and exclusion criteria were developed and agreed by the expert advisory group. Searches in respective databases and filtering were carried out by the researchers and review coordinator. Any ambiguous cases were taken to the expert advisory group for discussion.

Stage 1: Setting review parameters, literature searching and consultation
The key purpose of this review was to identify measures that could be used in routine CAMHS in order to inform service development and facilitate regional or national comparison. Because any outcome data collected for these purposes would need to be aggregated to the service level in sufficient numbers to provide reliable information, and would need to allow comparison across services and across years, only measures that cover broad symptom and age ranges were considered. The review focused on measures that included a child self-report version. This was partly because of the cost and burden implications associated with other reporters, especially clinicians, but also because of the recent emphasis on patient reported outcome measures e.g. [11] and evidence that, where measures are developed specifically to be child friendly, children can be accurate reporters of their own mental health e.g. [17,19]. The review focused on measures that had strong evidence of good psychometric properties and also took account of the resource implications associated with the measures (in terms of both time and financial cost).

Developing the search terms
For the purposes of this review, child mental health outcome measures were included if they sought to provide measurement of mental health in children and young people (up to age 18). To capture this, search terms were developed by splitting 'child mental health outcomes measure' into three categories: 'measurement' , 'mental health' and 'child'. A list of words and phrases reflecting each category was generated (see Table 1).

Search of key databases
Search terms were combined using 'and' statements to carry out initial searches focused on 4 key databases: EMBASE, ERIC, MEDLINE and PsychInfo. Searches resulting in over 200 papers were subjected to basic filtering using the following exclusion criteria: 1) the title made it clear that the paper was not related to children's mental health outcome measures; or 2) the paper was not in English.
The remaining papers were further sorted based on more specific criteria. Papers were removed if: No child mental health outcome measure was mentioned in the abstract; The measure indicated was too narrow to provide a broad assessment of mental health; They referred to a measure not used with children; They were not in English; They were a duplicate; The measure was used solely as a tool for assessment or diagnosis.
A list of identified measures was collated from the papers that were retained.

Consultation with collaborators and stakeholders
In order to identify other relevant measures, consultation with two key groups about their knowledge of other existing mental health measures was conducted: 1) the experts in child and adolescent psychology, education and psychometrics from the research group, 2) child mental health practitioners accessed via established UK networks   In order to determine which of these measures were to be considered for more in-depth review, inclusion and exclusion criteria were established.

Inclusion criteria
A questionnaire or measure was included if it: Provided measurement of broad mental health and/or wellbeing in children and young people (up to age 18), including measures of wellbeing and quality of life; Was completed by children; Had been validated in a child or adolescent context.

Exclusion criteria
A questionnaire or measure was excluded if it: Was not available in English; Concerned only a narrow set of specific mental disorders or difficulties; Could only be completed by a professional; Took over 30 minutes to complete; Primarily employed open-ended responses; Used an age range that was too narrow (e.g. only for preschoolers); Had not been used with a variety of populations.
Applying these criteria generated a list of 45 measures see [24].

Stage 3: Secondary searches
The initial searches provided preliminary information on these 45 measures. However, secondary searches on these measures were conducted in order to gather further information about: Psychometric properties; Symptoms or subscales covered; Response format; Length; Respondent; Age range covered; Number of associated published papers; Settings in which the measure has been used.
Information on specific measures was sought from the following sources (in order of priority): measure manuals, review papers, published papers (prioritizing the most recent), contact with the measure developer (s), other web-based sources. Measures were excluded if no further information about them could be gathered from these sources.

Stage 4: Filtering of measures according to breadth and extent of research evidence
After collecting this information, the measures were filtered based on the quality of the evidence available for the psychometric properties. Measures were also removed at this stage if it transpired they were earlier versions of measures for which more recent versions had been identified. The original inclusion and exclusion criteria were also maintained. In addition, the following criteria were now applied: 1. Heterogeneity of samplesthe measure was excluded if the only evidence for it was in one particular population, specifically children with one type of problem or diagnosis (e.g., only those with conduct problems or only those with eating disorders). 2. Extent of evidencethe measure was retained only if it had more than five published empirical studies that reported use with a sample or if psychometric evidence was available from independent researchers other than the original developers. 3. Response scalesthe measure was retained only if its response scale was polytomous; simple yes/no checklists or visual analogue scales (VAS) were excluded.
These relatively strict criteria were used to identify a small number of robust measures that are appropriate for gauging levels of wellbeing across populations and for evaluating service level outcomes. After these criteria were applied, the retained measures were subjected to a detailed review of implementation features (including versions, age range, response scales, length and financial costs) and psychometric properties. The range of psychometric properties considered included content validity, discriminant validity, concurrent validity, internal consistency and test-retest reliability. We also considered whether the measure had: undergone analysis using item response theory (IRT) approaches (including whether the measure had been tested for bias or differential performance in different UK populations); evidence of sensitivity to change; or, evidence of being successfully used to drive up performance within services.

Results
The application of the criteria outlined resulted in the retention of 11 measures. The implementation features and psychometric properties of these measures are outlined in Tables 2 and 3. In terms of acceptability for routine use (including burden and possible potential for dissemination) three of the measures identified, though below the stipulated half hour completion time, were in excess of fifty items (ASEBA, BASC, the full BYI) which might limit their use for repeated measurement to track change over time in the way that many services are now looking to track outcomes [3]. These measures are most likely to be useful for detailed assessments and periodic reviews. In addition the majority of the measures require license fees to use, introducing a potential barrier to use in clinical services. Kidscreen, CHQ, SDQ, HoNOSCA and PSC are all free to use in non-profit organizations (though some only in paper form and some only under particular circumstances).

Discussion
In terms of scale properties, all the measures identified have met key psychometric standards. Each of the final measures has been well validated in terms of classical psychometric evaluation. In addition, a range of modern psychometric and statistical modelling approaches have also been applied for some of these measures item response theory (IRT) methods, including categorical data factor analysis and differential item functioning, e.g. [51]. This is particularly true for the Kidscreen, which is less well known to mental health services than some of the other measures identified. However, analyses carried out for this measure include both Classical and IRT methods [38].
All measures were able to provide normative data and thus the potential for cut off criteria and to differentiate between clinical and non-clinical groups. However, we found no evidence of any measure being tested for bias or differential performance in different ethnic, regional or socio-economic status (SES) differences in the UK. Sensitivity to change evidence was only found for YOQ, ASEBA and SDQ, which were found to have the capacity to be used routinely to assess change over time [52]. The other measures may have such capacity but this was not identified by our searches. However, it is worth noting that many of the measures used a three-point Likert scale (e.g., PSC, SDQ). This may result in limited variability in the data derived, possibly leading to issues of insensitivity to change over time and/or floor or ceiling effects if used as a measure of change. In terms of impact of using these measures, we found no evidence that any measures had been successfully used to drive up performance within services.
In terms of implications for practice it is hoped that identifying these measures and their strengths and limitations may aid practitioners who are under increased pressure to identify and use child-and parent-report outcome measures to evaluate outcomes of treatment [12].
Some limitations should be acknowledged with respect to the current review. It is important to note that some measures were excluded from the current review purely because they did not fit our specific criteria. These measures may nevertheless be entirely appropriate for other purposes. In particular, all measures pertaining to specific psychological disorders or difficulties were excluded because the aim of the review was to identify broad measures of mental health. We recognize that many of these measures are psychometrically sound and practically useful in other settings or with specific groups. Furthermore, as recognized by Humphrey et al. [53], in their review of measures of social and emotional skills, we acknowledge that the publication bias associated with systematic reviews is relevant to the current study and may have affected the inclusion of measures at the final stage of the review. However, we maintain that this criterion is important to ensure the academic rigor of the measure validation.     [25] Pilot studies used to select initial items based on verbal reports of children who were in therapy, distribution of responses and the ability of an item to differentiate between clinical and non-clinical sample.
Discriminates between clinical group and matched controls; children seeing SEN services and matched controls.    NB, no evidence was found of any measure being tested for bias or differential performance in different UK populations (e.g., ethnic, regional or SES differences or for any measure being successfully used to drive up performance within services). These categories are not included in the table to aid clarity but did form part of the initial range of considerations. In terms of future research what is required is more research into the sensitivity to change for these and related measures [54,55], their applicability to different cultures and, the impact of their use us as performance measurement tools [56]. Research is also needed on the impact of these tools on clinical practice and service improvement [57]. In particular in the light of clinician and service user anxiety about use of such tools [58][59][60] it would be helpful to undertake further exploration of their acceptability directly with these groups (Wolpert, Curtis-Tyler, & Edbrooke-Childs: A qualitative exploration of clinician and service user views on Patient Reported Outcome Measures in child mental health and diabetes services in the United Kingdom, submitted).

Conclusions
Using criteria taking account of psychometric properties and practical usability, this review was able to identify 11 child self-report mental health measures with the potential to inform individual clinical practice, feed into service development, and to inform national benchmarking. The review identified some limitations in each measure in terms of either the time and cost associated with administration, or the strength of the psychometric evidence. The different strengths and weaknesses to some extent reflect the heterogeneity in purposes for which mental health measures have been developed (e.g., estimation of prevalence and progression in normative populations, assessment of intervention impact, individual assessment at treatment outset, tracking of treatment progress, and appraisal of service performance). While it is anticipated that as use of such measures diversifies the evidence base will expand, there are some gaps in current knowledge about the full range of psychometric properties of many of the shortlisted measures. However, current indications are that the 11 measures identified here provide a useful starting point for those looking to implement mental health measures in routine practice and suggest options for future research and exploration.