AUCTORES
Globalize your Research
Review Article | DOI: https://doi.org/10.31579/2637-8892/314
Indian Ports Association, Indian Statistical Institute, Indian Maritime University.
*Corresponding Author: Satyendra Nath Chakrabartty, Indian Ports Association, Indian Statistical Institute, Indian Maritime University.
Citation: Satyendra N. Chakrabartty, (2025), Problems of Rating Scales in Health Measurements, Psychology and Mental Health Care, 9(1): DOI:10.31579/2637-8892/314
Copyright: © 2025, Satyendra Nath Chakrabartty. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received: 05 November 2024 | Accepted: 13 February 2025 | Published: 20 February 2025
Keywords: patient-reported scale; linear transformation; normal distribution; ability to detect changes; elasticity
Background: Patient-reported outcomes (PROs) using multi-item rating scales are not comparable due to different formats of the scales, different factors under consideration, etc.
Objectives: To discuss methodological limitations of PROs in health measurement and to provide a method for converting ordinal item-score of PROs to follow normal distribution.
Method: Converting raw item-score to equidistant score (E) followed by standardization to Z-scores ~N(0,1) and converting Z-scores to proposed scores (P_i) in the range 1 to 100. Scale scores (P_Scale) as sum of P_i's and battery scores (B-scores) as sum of scale scores follow normal distributions.
Results: Each of P_Scale-scores and B-scores satisfy desired properties, helps undertaking parametric analysis, facilitates comparing status and finding equivalent scores of two PROs having implications in classification and also to get reliability, validity in better fashion.
Conclusion: The suggested method contributing to improve scoring of PRO instruments with additional benefits of identification of poorly performing scales, assessment of progress across time is recommended.
Often subjective self-reported measures of illness are evaluated through rating scales to assess objective health (Bourne, 2009). Data resulting from such rating scales are categorical and in ordinal level. Large numbers of clinical researches use patient reported rating scales (PROs) to quantify clinical conditions like intensity of disease, effects of disease or treatment, health status, quality of life (QoL), pain, sleep disorders, depression, anxiety, stress and far beyond as part of the patient decision making process. The MAPI Research Trust, a nonprofit organization provides information for all stakeholders in the field of Patient Centered Outcomes, particularly for Clinical Outcome Assessments (COAs) (https://c-path.org/programs/proc/).
PROs consist of number of scales which vary in terms of features of the scales like number of items (scale length), number of levels (scale width), scoring methods, etc. and are not comparable. Scale length, scale width, frequencies of levels affect differential item functioning (DIF). Analysis of ordinal data emerging from PROs without satisfying the assumptions of statistical techniques used, may distort the results. Mokkink et al. (2010) suggested prior checking of measurement properties of PRO-instruments. Self-reported rating scale consisting of multi-point items suffers from methodological limitations including not meaningful addition. If addition is not meaningful, computations like standard deviation (SD), correlation, Cronbach α, etc. are meaningless. Statistical analysis like regression, Principal component analysis (PCA), Factor analysis (FA), testing equality of means by t-test or ANOVA assumes normal distribution of the variables under study. But, questionnaire scores with unknown distributions violate the assumption and may distort the results. Assigning equal importance to items and constituent scales in summative scoring of PROs is not justified since contributions of items or scales to total battery score, values of inter-item correlations, scale-battery correlations and factor loadings are different (Parkin et al. 2010). Mean, SD, Cronbach alpha tends to increase with increase in number of levels and items and may influence mean score more than the underlying variable (Lim, 2008). No consensus is there regarding number of levels per item in rating scales (Chakrabartty and Gupta, 2016). Studies attempting to evaluate effect of selenium supplementation on stroke used different definitions of stroke either by categorical variables or variables in ratio scale. While investigating dose-response correlation between dietary selenium intake and stroke risk, Shi et al. (2022) used self-reported single question "Has a doctor ever told you that you had a stroke?" to define Stroke. Thus, stroke was taken here as a categorical variable and not in ratio scale. Zhang et al. (2023) asked each participant whether a doctor ever given a diagnosis of stroke (no, yes, unknown) and defined stroke as a self-reported physician diagnosis during follow-up. The follow-up time was the date of the first discovery of stroke. Sharifi-Razavi et al. (2022) included adults with accepted ischemic stroke by neuroimaging during the last 72hrs with a volume of at least one-third of middle cerebral artery (MCA) territory which is the most commonly affected territory in a cerebral infarction. Different inclusion criteria for stroke and different analysis resulted in different relationships between intake of selenium supplementation on stroke and conclusions and thus, effect of dietary selenium intake on stroke risk remains controversial. Beneficial effects of Selenium, on stroke risk have been found (Xiao et al. 2019; Hu et al. 2019). However, selenium at high levels is toxic (Hadrup & Ravn-Haren, 2020). Nutritional Prevention of Cancer (NPC) trial showed no benefit of selenium supplementation on the risk of stroke (Ding & Zhang, 2021; Shi et al. 2021). One possible reason of such differences could be consideration of benefits of circulating selenium levels and not on quantity of selenium intake, which probably had U-shaped relationship with stroke risks (Tan et al. 2021).
The paper suggests a method of transforming ordinal scores of i-th item of a PRO to normally distributed proposed scores (-scores) facilitating meaningful addition and deriving scale score (
as arithmetic aggregation of
-scores satisfying desired properties, enabling assessment of progress and parametric analysis.
If distance between two successive response-categories or levels of K-point items (K= 2, 3, 4, 5 ……) is denoted by then
j =1, 2, 3, 4… i.e. scores are not equidistant (Rutter and Brown 2017). Thus, addition of ordinal item scores are not meaningful (Jamieson, 2004) and even
> or <
is meaningless (Hand 1996). Despite this limitation, an individual score is taken as sum of item scores in ordinal scale (Kyte et al. 2015). Meaningful addition of two random variables, X
Y = Z requires similar probability distribution of X and Y and known distribution of Z for further uses. In terms of probability, X + Y = Z implies
(X= x, Y= z - x) for discrete case and
) dx for continuous case. Thus, knowledge of probability density function (pdf) of X and Y and their convolution are necessary. However, if each of X and Y follow log-normal distribution, X + Y cannot be obtained as such and require complex Lie-Trotter operator splitting method (Lo, 2012).Generic or disease-specific multidimensional rating scales for QoL may not consider all relevant constructs. For example, Disease-specific stroke adapted 30-item SIP version (SA-SIP30) with 8 subscales excludes domains like recreation, energy, pain, general health perceptions, overall quality of life or stroke symptoms (Golomb et al. 2001). Multidimensional rating scales may even fail to give global summary like 36-Item Short Form Health Survey questionnaire (SF-36) (http://www.webcitation.org/6cfeefPkf).Multidimensional scale covers a number of sub-scales/dimensions where scale formats are different for different sub-scales. For example, SF-36 has 10 items (3-points) on Physical functioning, 3 items, each 6-point on Energy/Fatigue, 2 items on 5-point scale for Social functioning, 6 (6-points) items on Emotional well-being, 5 (5-point) items on General health, two items on Pain (one 6-point and one 5-point), seven binary items and another item regarding reported health transition over the last year. The set-up indicates (i) different distributions for binary items, 3-point, 5-point, 6-point items, (ii) higher mean, SD of sub-class containing 6-point items, (iii) different reliability, validity, for different sub-classes (Preston and Colman, 2000). Two distinct concepts measured by the SF-36 are Physical Component Summary (PCS), and Mental Component Summary (MCS). Taft et al. (2001) found paradoxical inverse relationship between PCS and MCS which implies well physical condition pre-supposes poor mental health and vice versa. SF-36 was found to be negatively correlated with Patient Health Questionnaire (PHQ) and General anxiety disorder questionnaire (GAD-7) probably due to different factors measured by them (Johnson et al. 2019).Scoring methods of PROs are different. Dimension score of MacNew Heart Disease Health–Related Quality of Life Questionnaire (MacNew) is based on mean of the responses in items belonging to the dimension but, Cardiovascular Limitations and Symptoms Profile (CLASP) scores consider weights to find total for each subscale. Each dimension of Myocardial Infarction Dimensional Assessment Scale (MIDAS) is scored separately. No clear understanding of factors being measured is there. Against two factors proposed in the Hospital Anxiety and Depression Scale (HADS), factor structure of the instrument were found to be three in a range of clinical populations (Caci et al. 2003) against recommending HADS as a one-dimensional measure (Costantini et al. 1999) and statistical evidence for a three-factor structure (Strong et al. 2007). Similarly, for Psychological General Well-Being Index (PGWBI), Lundgren-Nilsson et al. (2013) found single construct of psychological wellbeing against underlying six factors of the scale raising questions about factor analytic interpretation in the presence of local dependency.Use of zero as an anchor value does not allow computation of expected value (value of the variable × probability of that value). Responses to zero, reduces mean and SD of the scale, item-total correlations, affects regression or logistic regression, etc. If each respondent of a sub-group selects the level marked as “0” to an item then mean = variance = 0 for the sub-group for that item and correlation with that item is undefined. Stucki et al. (1995) found more than 40% of the patientsscored zero in 10 subscalesof Sickness Impact Profile (SIP) and in one subclass of SF-36. Better is to mark the anchor values as 1, 2, 3… and so on, keeping the convention of higher score ⇔ higher value of the variable being measured. Higher score in each of Nottingham Health Profile (NHP), Minnesota Living with Heart Failure (MLHF) indicate higher health problems, unlike Sickness impact Profile (SIP). Thus, directions of scores are different for different scales. Rating data with floor and ceiling effects follow unknown distribution and do not satisfy the assumption of PCA like bivariate normality for each pair of observed variables, normally distributed scores, etc. Test reliability by Cronbach alpha assumes one-dimensional scale and tau-equivalence (equality of all factor loadings). Multidimensional PROs like Insomnia Severity Index (ISI), Pittsburgh Sleep Quality Index (PSQI) and Insomnia Symptom Questionnaire (ISQ), etc. violate the assumption and underestimate the coefficient alpha (Daniel, 1990). The coefficient alpha is influenced by variance sources, sampling errors (Terry & Kelley, 2012), sample size (Charter, 1999) and even test length and test width (Luh, 2024).Validity of a multidimensional scale as correlation with criterion scores raises the question about the dimension /factor being reflected by the validity. It is desirable to find the validity of the main factor for which the scale was developed and also to derive relationship between test reliability and test validity. Vaughan, (1998) found lower validity where data contained predominantly high performers. To avoid such problems, structural validity of normally distributed transformed scores by PCA was preferred (Chakrabartty, 2020).Different cut-off scores are there for different PROs. For example, cut-off score of Sickness Impact Profile (SIP136) with 136 “Yes–No” type items distributed over 12 domains is
22 and for Stroke-Adapted Sickness Impact Profile (SA-SIP30) with 30 items covering 8 subscales is
33. Natural question is whether score of 33 in SA-SIP30 is equivalent to the score of 22 in SIP136. Similarly, score of 14 in ISI indicating “no insomnia” is equivalent to which score in PSQI or ISQ? Thus, finding equivalent scores of two scales can make better comparisons of the PROs and also help in classification of individuals. For QoL questionnaires, there could be no cut-off point to show better or worse QoL (Silva et al. (2014). Based on treatment status for Cancer Core Questionnaire (EORTC QLQ-C30), four different cut-off scores were found (Lidington et al. 2022). Intra- and-inter observer reliability of ordinal scale like Kessler Psychological Distress Scales (K 6 and K 10) are evaluated by Kappa and weighted Kappa. Major limitations in this context are:
Let be the raw score of a respondent in the i-th item for choosing the j-th level where the levels are marked as 1, 2, 3, 4, …. avoiding zero and higher value of
implies higher dysfunctions or impairments. The suggested method transforms ordinal item scores (
) to equidistant scores (
) and further transformation to proposed scores (
-scores) in the score range [1, 100] following (
facilitating meaningful addition to derive scale score (
as sum of
-scores. The method is described below.
For the i-th item, find maximum frequency and minimum frequency
For n-number of respondents in a 5-point item (say), find initial weights
, the common difference
and other initial weights as
,
, and
.
Take final weights =
Here,
=1. Here,
form an arithmetic progression. Generated scores
are continuous, monotonic and equidistant.
Standardized equidistant scores (E) of each item as
Parameters of distributions of -scores and B-scores can be estimated from data. Normality enables estimation of population mean (
population variance (
, confidence interval of
testing statistical hypothesis like
or
etc.
Based on battery scares, progress of i-th patient in t-th period over the previous period by . Decline is indicated in case of
. For a group of patients,
indicates progress. Similarly, progress with respect to scores of
can be computed. Decline if any, may be probed to find the critical scale(s) where
and initiate appropriate corrective actions in the treatment and management plan. Statistical test of significance of progress/deterioration can be made since ratio of two normally distributed variable follows
distribution.
Effect of small change in i-th scale () to Battery score B-scores can be quantified by considering elasticity i.e. percentage change of B-scores due to small change in
. The scales can be ranked based on such elasticity. Elasticity studies in economics, reliability engineering, consider model like
where
denotes the quantity demanded of j-th industry at time t and
is industry price relative to the price index of the economy However, for normally distributed
-scores and B-scores, logarithmic transformations are not required to fit regression equation of the form
=
+
+
The coefficient reflects the impact of a unit change in the independent variable (i-th dimension) on the dependent variable (
Policy makers can decide appropriate actions in terms of continuation of efforts towards the scales with high values of elasticity and corrective actions for the dimensions with lower elasticity i.e. areas of concern.
Normality of B-scores facilitates testing
=
reflecting effectiveness of the treatment plans and
:
= 0, reflecting progression
Graph depicting progress/decline of one patient or a group of patients with similar socio-demographic profile is analogous to hazard function and helps to identify high-risk groups and compare response to treatments from the start.
For two scales X and Y with normal pdf
respectively, equivalent score
for a given value say
can be found by solving the equation
using standard normal table even if the scales have different lengths and widths (Chakrabartty, 2021).
P-scores and B-scores following normal distributions satisfy the assumptions of PCA, FA and enable finding Factorial (FV) = =
where
the highest eigenvalue indicating validity for the main factor being measured (Parkerson et al. 2013). The test significance of
can be undertaken using the Tracy–Widom (TW) test statistic U =
following TW-distribution (Nadler, 2011). Such FV avoids the problems of construct validity and selection of criterion scale ensuring matching constructs and two administrations of the scale and the criterion scale.
For standardized item scores, of a test with m-items is
and the test variance
can be written as
(1)
The equation (1) can be used to find the theoretical reliability
=
(2)
Equation (2) gives relationship between and factorial validity, which is non-linear.
Ten Berge and Hofstee (1999) suggested maximum reliability of a test by which can be derived from the correlation matrix of m-number of items by
(3)
Relationship between FV and can be derived as:
=
=
(4)
As per (4), higher value of increases
Cronbach alpha of a battery consisting of K-scales can be obtained as a function of scale reliabilities by =
(5)
where and
denote respectively reliability and SD of the i-th scale.
The suggested method defines meaningful scale scores and battery scores for each individual. Each of -scores and B-scores satisfy desired properties, helps undertaking parametric analysis, comparing status and progression of patients including indication of effectiveness of treatment plans, finding equivalent scores of two patient reported scales (PROs) where area under normal curve corresponding to PRO-1 up to
= area under normal curve corresponding to PRO-2 up to
. For classification of individuals, equivalent cut-off scores of class boundaries may be found satisfying
which may facilitate to have similar efficiency of classification, in terms of within group variance and between group variance.Factorial validity (FV) reflecting the main factor being measured helps to have a clear understanding of the most important factor being measured. However, establishing clinically meaningful content validity is a vital step. Maximum value of test reliability
, relationship between
and
and also between
and FV can be used effectively to compare scales. The scales with eigenvalues exceeding unity can be retained keeping in view that results may get distorted by wrong selection of constituent scales.
6. Conclusions:
The suggested B-scores reflecting disease severity with respect to the PRO measures is recommended with the scales chosen as per the selection criteria mentioned above. Future empirical investigations may be undertaken to evaluate properties of the suggested method and its clinical validation along with effects of socio-demographic factors.
Declarations:
Acknowledgement: Nil
Conflicts Of Interest/Competing Interests: The author has no conflicts of interest to declare
Funding: Did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors.
Ethical approval: Not applicable since the paper does not involve human participants.
Consent of the participants: Not applicable since the paper does not involve data from human participants
Data Availability Statement: The paper did not use any datasets
Code availability: No application of software package or custom code
CRediT statement: Conceptualization; Methodology; Analysis; Writing and editing the paper by the Sole Author