Confirmatory Factor Analysis for the Service-Learning Outcomes Measurement Scale (S-LOMS)

,


Introduction
In comparison with similar measurement instruments that have been adopted in the past, S-LOMS carries several merits. First, it has been designed for the context of Hong Kong, reflecting the local culture and recent developments within the higher education sector there (Snell & Lau, 2020). Second, the set of domains included in S-LOMS comprehensively covers the desired developmental outcomes of Hong Kong based service-learning programs. Third, the administration of S-LOM is both standardized and flexible, such that practitioners can elect to measure developmental categories or domains, according to their needs. Fourth, S-LOM is expected to undergo rigorous validation before its practical implementation. In a previous validation study (Snell & Lau, 2020), S-LOMS was tested with 400 Hong Kong university students, and the current study involves a further 600-plus respondents. It is intended that there will be subsequent studies of test-retest reliability and criterion validity, which will engage additional respondents. We anticipate that the conceptual relevance and scale validity of S-LOMS will attract its usage by service-learning practitioners as a tool for assessing progress on the enhancement of developmental outcomes for students.
The starting point for the development of S-LOMS as a measurement instrument was a review of the common student developmental domains arising from service-learning, as documented in past literature. This was followed by considering the special educational and social context for service-learning in Hong Kong. For example, within the overarching category of civic orientation and engagement, the instrument was oriented more toward moral development than participatory democracy. To further match the emerging instrument to the local context, the authors also invited local service-learning practitioners to examine the developmental domains and proposed items in the development process. As a result, 15 developmental domains under the four aforementioned overarching categories were identified.
An initial study (Snell & Lau, 2020) was then conducted to validate S-LOMS based on its administration with a sample of 400 university students. S-LOMS was found to have satisfactory internal consistency with the underlying dimensionality uncovered through exploratory factor analysis (EFA) by using the method of Principle Components with oblimin rotation. In that study, regarding reliability, S-LOMS achieved the Cronbach's alpha value above .70 for its four categories, while the 15 original domains collapsed into 11, as follows. Creativity and problem solving skills combined into the higher-order domain of creative problem solving skills. Another higher-order domain comprised relationship and team skills. A third higher-order domain, community commitment and understanding, combined commitment to social betterment with understanding community. A fourth higher-order domain, caring and respect, combined empathy and caring for others with respecting diversity. The other domains remained discrete.
The current study continues the measurement instrument validation journey. This paper reports the validation results of testing S-LOMS with a new sample through confirmatory factor analysis (CFA) against the above factor structure that emerged in the previous EFA study (Snell & Lau, 2020). It is intended that subsequent research not reported here will test for other types of validity regarding S-LOMS, such as test-retest reliability, and will then use S-LOMS to measure developmental outcomes for students through before and after administration around servicelearning experience. The above practice is a typical step in the scale development process (e.g., Brown, 2015;Hurley, Scandura, Schriesheim, Brannick, Seers, Vandenberg, & Williams, 1997;Tay & Jebb, 2017;Worthington & Whittaker, 2006). While EFA is used to identify the dimensionality for a set of variables, it does not force variables to be loaded on certain factors in advance. By contrast, CFA tests whether data fits a pre-specified factor structure (Stevens, 2009). The current study tested a series of alternative models with various factor structures. Since the 11-domain factor structure discussed above had received empirical support from only one prior EFA study, the current study adopted a prudent approach in testing that structure together with the originally theoretical 15-domain factor structure proposed by Snell and Lau (2020), together with other possible structures, so as to compare which one would provide a better fit with a new set of data.

Participants
The current study recruited 629 university students from four Hong Kong government universities, namely Lingnan University, The Hong Kong Polytechnic University, Hong Kong Baptist University, and The Education University of Hong Kong. Female respondents constituted a larger part of the sample (59.5%) and the average age was 20.5 (s.d. = 2.21). Broken down by major disciplines, the sample comprised engineering and science (40.9%), business (19.9%), social sciences (14.3%), arts (12.7%), and healthcare (12.2%). Among the respondents, 65.8% had previous service-learning experience or were in the process of taking service-learning programs or courses.

Instrument
The original structure with four overarching categories and 15 domains described above was employed in the construction of S-LOMS. In the 56-item instrument administered to the students (see Appendix 1), there were three to four items for each of the 15 domains, in the form of selfdescriptive statements. Respondents were asked to indicate the extent of their agreement with the items on a 10-point Likert scale (from 1, "strongly disagree" to 10, "strongly agree").

Procedures
The respondents were invited to answer S-LOMS on a voluntarily basis in a classroom setting, with consent from the instructors of the respective courses, which did not necessarily involve service-learning. Besides S-LOMS, the students completed some demographic items about gender, age, academic background, and prior service-learning experience. Upon completing the questionnaire, the students received a HK$50 supermarket voucher.

Statistical Analysis
CFA was employed in the analysis, using EQS version 6.4 for Windows, and the extent of missingness of the sample data and the assumption of multivariate normality were checked, in order to decide the estimation methods. Regarding data missingness, 520 of the 629 participants (82.7%) provided no missing responses. The mean percentage of missing responses in an item was 0.4%. 63 missing patterns were identified among the 109 respondents with missing responses. Moreover, the sample was tested with multivariate normality. The related indices provided by the EQS indicated violation of the assumption. Specifically, both the Yuan, Lambert, and Fouladi's coefficient (1,332.76) and its normalized estimate (208.24) showed values over 5.00, inferring the nonnormally distributed pattern of the sample data (Bentler, 2006).
As the data showed incomplete and nonnormal patterns, the full information maximum likelihood (FIML) method with robust correction was employed in EQS for the CFA execution, recommended by Bentler (2006). The scaled chi-square (Yuan-Bentler, i.e. Y-B c 2 ) and other indices under Yuan-Bentler's correction of the results were adopted for deciding goodness of fit for the models. This approach is regarded as an effective adjustment procedure when the model violates multivariate normality and is applied to incomplete data (Blunch, 2016;Byrne, 2008;Savalei & Bentler, 2005).
In executing the analysis on the models specified below, a typical CFA parameterization was adopted, as described in steps (a)-(d). In step (a), the first path between each designated factor (whether a learning domain or an overarching category) and its first variable (whether an assigned item or a learning domain) was set as 1.0, for the sake of model identification and latent variable scaling. In step (b), all other parameters and factor variances were freely estimated. In step (c), a constant variable V999 with no variance and a mean value of 1.0 was created for each variable equation. In step (d), covariances were freely estimated between each designated factor. For the sake of comparison, no modification such as error covariances were made to the models.
As the model chi-square test, although commonly used, is subject to a number of limitations (Hooper, Coughlan, & Mullen, 2008) and tends to be rejected as not fitting (Thompson, 2004), other goodness of fit indices, including CFI, NNFI, and RMSEA were also used for assessing the models (Tabachnick & Fidell, 2013). Since the robust correction was implemented, the values of the above indices under the results of Yuen-Bentler correction was adopted. Acceptable model fit was defined as follows: CFI (³ .90); NNFI (³ .90); RMSEA (≤ .08) (Bentler, 1990;Brown, 2015;Browne & Cudeck, 1992). Since a series of models (see the next section) with different factor structures were tested, model AIC indices were employed in comparing the competing models. These are among the most commonly adopted indices for the comparison of non-nested models by using chi-square values (Brown, 2015). The smallest AIC value indicates the best fitting model, under the condition that the models are non-nested.

Models Specification
Since S-LOMS is a newly established measurement instrument with only one prior EFA validation to support its internal factor structure, the current study tested, through a series of CFAs, whether the data fitted other possible factor structures for the instrument, besides the one already reported by Snell and Lau (2020). The seven models that were tested are represented in Table 1 and are explained next.
Model 1 serves as a baseline model, within which all items are loaded onto a single factor. Model 2 is theoretically grounded to the extent that the items are assumed to load directly onto their respective overarching categories identified in prior literature, of which there are four: knowledge application, personal and professional skills, civic orientation and engagement, and self-awareness. The developmental outcome domains such as relationship skills were omitted from this model. Models 3 and 4 tested whether items loaded onto their corresponding developmental outcome domains irrespective of the overarching categories (developmental domains directly to corresponding items). Model 3 was theoretically based, to the extent that it comprised the original 15 domains that S-LOMS had originally been designed to measure (Snell & Lau, 2020). For example, creativity and problem-solving skills were retained as two separate domains instead of being merged into the single domain of creative problem solving skills. However, the four overarching categories were not included in this model. By contrast, Model 4 was empirically based, to the extent that it combined some pairs among the original 15 outcomes to match the 11 domains that had been discovered in the previous EFA study (Snell & Lau, 2020).
Model 5 and Model 6 involved two layers of factors, and constituted hybrids of Model 2 with either Model 3 or Model 4. Both Model 5 and Model 6 were theoretically based, to the extent that they included the four overarching categories. In addition, Model 5 included the 15 theoretically based original outcome domains, whereas Model 6 included the 11 domains from the previous EFA study.
An additional model, Model 7, was a modification of Model 6 that was created by combining the domains of sense of social responsibility with the domain of community commitment and understanding under the overarching category of civic orientation and engagement, as is explained in the next section.  Table 2 reports the CFA results in terms of the chi-square test, goodness of fit indices, and AIC indices. The chi-square values for all these models were statistically significant, reflecting that large sample size increased the power of the test and thus the likelihood of rejection as not an exact fit. Accordingly, the goodness of the fit indices were taken into consideration (Bentler, 1990), and Models 3, 4, 5, and 6 demonstrated acceptable model fit, with both NNFI and CFI at marginally 0.9 or above, and RMSEA and its 90% confidence interval at or lower than .05. Moreover, all absolute values of standardized residual were small, indicating that those models fit the data well enough. Note: * Y-B c 2 denotes the Yuan-Bentler scaled chi-square values with robust correction applied. The fit indices in the table are also adopting the version of robust correction.

Model Comparison
Comparing the AIC indices for the above four models indicates that Model 3 is the best fit, followed by Model 4, 6 and 5, in preference order. Despite being the best fit, the results for Model 3 nonetheless indicate two issues. Specifically, the factor correlations between two pairs of learning domains, namely 1) creativity and problem solving skills; and 2) commitment to social betterment and understanding community, are 1.0, and correspond to Snell and Lau's (2020) results in the earlier EFA, which led to the creation of higher-order domains, such as "creative problem solving skills". Factor correlations approaching 1.0 constitute strong grounds for combining multiple factors into a single factor, given the poor discriminant validity that is implied (Brown, 2015).
A similar issue was found with Model 5, which put the 15 domains under four overarching categories, in that the factor coefficient between understanding community and its overarching category of civic orientation and engagement was found to be 1.0. Because of these issues, Model 3 and Model 5 were dropped and only those models with a structure involving 11 domains were considered. Among the remaining models, Model 4 was preferred, given its low AIC value, acceptable goodness of fit (Y-B c 2 = 3,450.80; df = 1,429; p = .00; NNFI = .902; CFI = .909; RMSEA = .047, CI = .045, .049), and good factor loadings and factor correlations.
Model 6, with 11 domains under four overarching categories, was also found to have an issue, in that the path of the domain of sense of social responsibility obtained 1.0 of factor loading from its parent category, indicating the need for further structure simplification under civic orientation and engagement. Accordingly, Model 7 was created as a modification of Model 6 by combining the two conceptually related domains of sense of social responsibility and community commitment and understanding. Model 7 obtained acceptable overall goodness of fit (Y-B c 2 = 3,631.76; df = 1,470; p = .00; NNFI = .898; CFI = .902; RMSEA = .048, CI = .046, .050), and a relatively low AIC value (691.76). The 56 items and the 10 domains loaded with statistical significance on their respective domains and categories, nearly all with scores over .60 (except two items with loadings close to .60), while the four categories were significantly yet not perfectly correlated. Although usually more parsimonious models (i.e. Model 4) would be preferred, a more complex model may also be considered if it is based on a theory that can "substantially improve understanding of the phenomenon or can substantially broaden the types of phenomena understood using that theoretical approach" (Stevens, 2009, p. 572). In our case, the results of the CFA for Model 7 imply that S-LOMS can also further understood as a 10domain model with four overarching categories.
In summary, while the fit indices implied that Model 4 was the best model for S-LOMS, inspection of factor loadings led to the creation of Model 7, which was retained for further consideration in the next step, where Model 4 and Model 7 were examined for their stability on gender by using multi-sample analysis.

Multi-sample Analysis
Multi-sample analysis, or the factorial invariance test, is especially suitable for testing whether a particular model structure or relationships between factors in a model is applicable across samples by different types of categorization (Schumacker & Lomax, 1996). The dataset was divided into two samples by gender (248 male and 364 female). The demographic profile, including mean age and academic backgrounds, for the two sub-samples is listed in Table 3 below. The missingness of both male and female samples revealed an acceptable pattern, with around or over 80% of the responses did not contain any missing responses. As with the previous analysis, the FIML method with the Yuan-Bentler Correction was employed in model estimation, given the incomplete data with the multivariate nonnormal pattern for both samples (see Table  3). We followed the approach recommended by Tabachnick and Fidell (2013) to perform the multisample analysis. We began with the baseline model for the two samples, and constrained a different parameter in each round to test whether the chi-square difference for each group between the less restrictive and more restrictive model was statistically significant. In EQS, this result is presented as the overall chi-square values of the two models against their summative degrees of freedom, in accordance with Bentler's (2006) recommendation. In this procedure, if the result is insignificant, the next step is to add another set of constraints followed by another test, with further steps taken until the result is significant. For our analysis, the parameters comprised, in order, factor loadings, factor coefficients, and factor covariances, but disturbance variances and error variances were not tested due to concern about the sub-group sample size. Model 4 and Model 7 were tested by means of the above method.
The results of the multi-sample analyses for Model 4 and Model 7 are given in Table 4 The multi-sample analyses indicated that both Model 4 and 7 were stable across the sample by gender, with acceptable goodness of fit (NNFI and CFI at .90 or above; and RMSEA <.06). Table 5 and 6 display the reliability results of the developmental outcome domains and overarching categories of the two models. These results indicate satisfactory reliability (see Lance, Butts, & Michels, 2006), with most Cronbach's alpha scores above .80 and a small number just below .80. This was the case for the entire scale (.981), for the four overarching categories (.866 to .957 for Model 7), for the 10 developmental outcome domains in Model 7 (.794 to .925) and for the 11 domains in Model 4 (.790 to .915).

Selected Models and Summary
Based on the above analyses, Model 4 and Model 7 were selected as potential final models, but with inclination toward Model 4 because of its lower AIC value. The final findings for Model 4 and Model 7, in terms of factor loadings, factor coefficients, factor correlations, and reliability indices are illustrated in Figure 1 and 2, and Table 5 and 6. All items in Model 4 and Model 7 were loaded on their designated domains and categories, except that for Model 7 the domain of sense of social responsibility was combined with that of community commitment and understanding. As a result, the constituent domains within the category of civic orientation and engagement distinguish interpersonal-level issues, i.e., caring and respect, from community-level issues. Although the structure of both models received confirmation, the high factor correlations and coefficients illustrated that a more parsimonious solution could be obtained (Brown, 2015). We will discuss this further in the next section. Note: * Y-B c 2 denotes the Yuan-Bentler scaled chi-square values with robust correction applied. The fit indices in the table are also adopting the version of robust correction. Self-efficacy; SU: Self-understanding; CSI: Commitment to Self-improvement  Self-efficacy; SU: Self-understanding; CSI: Commitment to Self-improvement   .805 Commitment of Self-improvement .807 Note: *The cognominal category was created above the domain "Knowledge Application" for the sake of providing a clear model structure

Conclusion
By using CFA with a relatively large sample, the current study sought to confirm the dimensionality and factor structure of S-LOMS that had been obtained through EFA in a previous study (Snell & Lau, 2020). Seven alternative models were specified and tested. The results indicated that an 11-domain model without overarching categories (Model 4) was the best fit, outperforming the single factor model (Model 1) and four-category level model (Model 2) in terms of the AIC values and goodness of fit indices. By contrast, the analysis indicated that both models that contained 15 developmental outcome domains (Model 3 and Model 5) could not fit the data well, because of ill-fitting patterns in factor correlations and coefficients between particular pairs of domains. Thus, in Model 3, there was a factor correlation of 1.0 between the domains of creativity and problem solving skills, and between the domains of commitment to social betterment and understanding community; while in Model 5 a factor coefficient of 1.0 was found between the domain of understanding community and its overarching category of civic orientation and engagement. The discovery of factor correlations or coefficients approaching 1.0 indicates that there may be more parsimonious model structures (Brown, 2015), and is consistent with the EFA results in the prior study (Snell & Lau, 2020).
Model 6, with a structure of 15 developmental outcome domains under four overarching categories was also rejected due to its factor coefficient of 1.0 between the domain of sense of social responsibility and its overarching category of civic orientation and engagement. Model 7 was therefore created based on a modification of Model 6, with the two domains subsumed under the overarching category of civic orientation and engagement. The first of these domains, a composite of community commitment and understanding and sense of social responsibility, reflects concern for societal level issues. The second domain, caring and respect, reflects interpersonal-level sensitivity. Acceptable goodness of fit was found between the data and Model 7, albeit with an AIC that was larger than for Model 4. Both the 11-domain model without overarching categories (Model 4) and the 10-domain model with four overarching categories (Model 7) were found to be invariant in terms of factor structure, factor loadings, factor coefficients, and factor correlations between male and female groups in the sample, indicating the stability of both models across gender (Schumacker & Lomax, 1996).
The results, indicating preference for 11 over 15 developmental domains, confirmed the previous EFA findings of Snell and Lau (2020). Specifically in Model 4, creativity and problem solving skills were combined into creative problem solving skills; relationship skills and team skills were combined under relationship and team skills; community understanding and commitment to community were integrated under the community commitment and understanding; and empathy and caring for others along with respecting diversity were subsumed under caring and respect.
The four overarching categories confirmed in Model 7 are consistent with typologies of the major developmental outcomes of service-learning in the past literature, which include academic enhancement, personal growth and civic learning (e.g. Driscoll et al., 1996;Elyer & Giles, 1999;Elyer et al., 2001;Felton & Clayton, 2011). Model 7 also includes self-awareness as an overarching category, which was created by Snell & Lau (2020) to capture the developmental outcomes associated with Confucian self-cultivation, which has influenced tertiary education policy in Hong Kong. At the developmental outcome domain level, Model 7 further reduces the number of domains from 11 to 10, by combining sense of social responsibility with community commitment and understanding. In summary, by comparing the 11-and 15-domain structure through CFA with a new sample, the current study confirmed that S-LOMS can be structured as an 11-domain model (Model 4) without an overarching category level, which was a better fit with the data than the alternative models. Nonetheless, the study also offered some support for a model with 10 developmental outcome domains under four overarching categories (Model 7), resembling the findings of past literature. Multi-sample analysis indicated that both Model 4 and Model 7 were stable across male and female groups in the current sample.

Practical Implications
Because of the satisfactory factor validity and internal consistency reported above, S-LOMS offers flexibility in how the developmental impacts on students engaging in service-learning can be measured, with a number of options besides using the entire 56-item scale. For example, an instructor with a specific interest only in the two developmental outcome domains of critical thinking skills, which has three items, and creative problem-solving skills, which has eight items, need only use those 11 items for measurement, thereby streamlining the data collection process. Another example is that an investigator, who wishes to focus on measuring impact within the overarching category of civic orientation and engagement need only use the 18 associated items instead of the entire S-LOMS. Thus, S-LOMS can be administrated flexibly in accordance with instructors' or researchers' needs. Overall scores for any particular developmental outcome domain can be derived by averaging the scores of the associated items. It is assumed that investigators would adopt a pretest-posttest research design for measuring developmental impacts.

Limitations and Further Studies
The first limitation lies in the level of fitness of the models in the current study. Although both Model 4 and 7 achieved acceptable goodness of fit indices (i.e..90 or above for NNFI and CFI ), they did not meet the satisfactory level, which is .95 for NNFI and CFI indices (Hu & Bentler, 1999). Further studies should apply S-LOMS into more new samples to test whether consistently satisfactory goodness of fit indices can be obtained, and if so, discover what modifications are necessary in order to achieve this. The second limitation arises from the multivariate nonnormality of the data from the current sample, resulting in bias over the ML methods in model estimation. Despite our attempt to apply corrections through Yuan-Bentler correction with the FIML method, other researchers have stated that better results can be achieved by adopting a two-stage robust method for non-normal missing data (e.g. Tong, Zhang, & Yuan, 2014), and further studies can consider adopting the latter approach.
Third, the numerous factor correlations and factor coefficients exceeding .85 that were found for both Model 4 and Model 7 warrant attention. They imply poor discriminant validity (Brown, 2015) and may raise questions about the unique predictive validity of their individual factor. Further research is thus required into the predictive validity of S-LOMS's domains and categories. Further limitations, in the case of Model 7, concern the high factor coefficients between the 10 domains and their corresponding overarching categories, as well as the high factor correlations between the overarching categories. This phenomenon matches the observation by Snell and Lau (2020) that although the four overarching categories are conceptually distinct, they are empirically inter-related. The limitation of high factor correlations and coefficients suggests that S-LOMS may need further refinement, and that there is scope for testing a set of simplified models against data from new samples.
Despite the above limitations, the current study has provided empirical evidence about the construct validity of S-LOMS, from which further validation work can be done. The next steps being undertaken include validating test-retest reliability over an interval of time, and testing