[Updated: Tue, Sep 10, 2024 - 12:54:23 ]

Acknowledgement

This tutorial is a product of a research project funded by Duolingo, Inc. through Competitive Research Grant Program to support topics of interest related to Duolingo’s English Test’s ongoing research agenda.

For the full paper we recently published:

Zopluoglu, C., & Lockwood, J. R. (2024). A Comparative Study of Item Response Theory Models for Mixed Discrete-Continuous Responses. Journal of Intelligence, 12(3), 26. https://doi.org/10.3390/jintelligence12030026

Introduction

This tutorial offers a comprehensive introduction to the analysis presented in the subsequent paper by Zopluoglu and Lockwood (2024). Our paper discusses three model families based on the Beta, Simplex, and \(S_B\) distributions. These models were initially introduced by Molenaar et al. in 2022. In our study, we enhance these models by integrating auxiliary variables, allowing us to predict the latent trait through latent regression. Additionally, we explore the application of these models in an extreme scenario where 99.8% of the observed response data matrix is missing, a result of the assessment’s adaptive design.

The study’s original dataset comprises 295,157 test sessions from 222,568 distinct test takers who responded to a total of 2,738 items. Each participant in the dataset answered between 4 and 7 dictation items. Notably, the vast majority (91.6%) responded to 6 items, resulting in a collective 1,789,297 responses to the 2,738 dictation items.

The analysis of the original dataset was carried out on the Talapas server, which is The University of Oregon’s high-performance computing cluster. This choice was due to the dataset’s size and the complexity of the models. For demonstration purposes, we created simulated datasets on a smaller scale, enabling anyone to replicate the analysis on their personal computers. Thus, this tutorial utilizes simulated datasets that mirror the structure of the original dataset. The use of the simulated data also gives us a chance to demonstrate that the Stan model syntax is properly written.

Given the similarity in the code used for all three models, this tutorial only provides a demonstration for the Beta IRT model. To replicate the analysis for the Simplex and \(S_B\) IRT models, you can find the code in the GitHub repository. You can access the code for the Simplex model here and for the \(S_B\) IRT model here

Importing the Dataset and Basic EDA

We begin by importing the simulated dataset for the Beta IRT model, available in both long and wide formats. The code used to generate this dataset can be accessed here.

d_wide <- read.csv(here('data/beta_wide.csv'))
d_long <- read.csv(here('data/beta_long.csv'))

head(d_long)
  id          w          s   true_theta item      resp
1  8 -0.2296408  1.0206258  1.111173406    1 0.9920787
2 11  0.6977469  0.6609727  0.305028689    1 0.9872080
3 13  0.5893113 -0.4885882 -0.641392773    1 0.9467020
4 14 -0.8038206 -0.5944645  0.067733055    1 1.0000000
5 20 -0.1666584 -0.6297235  0.003820428    1 0.8751422
6 23 -0.2445903  0.4465533 -0.517746735    1 0.8143535

Dataset Overview:

  • Test Takers: 1,000 individuals (person ids are found in column id).
  • Test Items: A total of 50 items (item ids are found in column item).
  • Responses: Each test taker answers only 10 out of the 50 items (found in column resp).
  • Auxiliary variables: Each test taker is assigned a writing score (found in column w) and a speaking score (found in column s).

For demonstration purposes, the simulated dataset has a sparsity of 80% missing data, which is less extreme than the 99.8% missing data in the original study dataset. The true person parameters are included in the simulated dataset under the column true_theta. We will reference this column later to compare the true parameters with the estimated ones.

Let’s do a quick check on the descriptive statistics for each item.

require(psych)

desc <- describeBy(d_long$resp,d_long$item,mat=TRUE)
desc <- desc[,c('item','n','mean','sd','min','max','skew','kurtosis')]
desc[,3:8] <- round(desc[,3:8],2)
rownames(desc) <- NULL
desc
   item   n mean   sd min max  skew kurtosis
1     1 206 0.82 0.34   0   1 -1.92     1.95
2     2 204 0.63 0.43   0   1 -0.69    -1.43
3     3 190 0.89 0.21   0   1 -3.41    11.31
4     4 200 0.58 0.44   0   1 -0.44    -1.66
5     5 227 0.81 0.34   0   1 -1.57     0.90
6     6 190 0.88 0.27   0   1 -2.68     5.91
7     7 201 0.83 0.28   0   1 -2.07     3.25
8     8 212 0.87 0.27   0   1 -2.71     5.84
9     9 190 0.77 0.33   0   1 -1.55     1.07
10   10 188 0.12 0.32   0   1  2.17     2.86
11   11 184 0.77 0.37   0   1 -1.52     0.49
12   12 206 0.61 0.44   0   1 -0.60    -1.53
13   13 194 0.87 0.26   0   1 -2.78     6.51
14   14 188 0.58 0.45   0   1 -0.44    -1.71
15   15 193 0.84 0.33   0   1 -1.82     1.71
16   16 205 0.87 0.30   0   1 -2.38     4.01
17   17 194 0.85 0.25   0   1 -2.45     5.41
18   18 217 0.87 0.28   0   1 -2.57     5.03
19   19 194 0.71 0.38   0   1 -1.14    -0.36
20   20 194 0.12 0.32   0   1  2.22     3.03
21   21 203 0.81 0.34   0   1 -1.83     1.68
22   22 203 0.60 0.45   0   1 -0.53    -1.65
23   23 208 0.87 0.25   0   1 -2.89     7.11
24   24 180 0.66 0.43   0   1 -0.78    -1.23
25   25 183 0.81 0.34   0   1 -1.53     0.74
26   26 216 0.90 0.24   0   1 -3.02     8.24
27   27 214 0.87 0.21   0   1 -2.78     8.15
28   28 197 0.87 0.26   0   1 -2.69     6.01
29   29 219 0.75 0.35   0   1 -1.42     0.50
30   30 200 0.13 0.31   0   1  2.13     2.69
31   31 201 0.80 0.36   0   1 -1.67     0.96
32   32 204 0.66 0.43   0   1 -0.85    -1.21
33   33 179 0.86 0.26   0   1 -2.71     6.09
34   34 217 0.64 0.42   0   1 -0.74    -1.24
35   35 205 0.82 0.32   0   1 -1.64     1.22
36   36 193 0.88 0.28   0   1 -2.63     5.49
37   37 169 0.83 0.28   0   1 -2.25     3.85
38   38 209 0.87 0.27   0   1 -2.64     5.60
39   39 205 0.76 0.34   0   1 -1.51     0.83
40   40 188 0.11 0.30   0   1  2.35     3.64
41   41 198 0.80 0.33   0   1 -1.83     1.71
42   42 188 0.61 0.45   0   1 -0.56    -1.63
43   43 198 0.86 0.26   0   1 -2.75     6.15
44   44 209 0.64 0.42   0   1 -0.71    -1.30
45   45 240 0.81 0.33   0   1 -1.55     0.89
46   46 187 0.90 0.25   0   1 -3.00     7.96
47   47 211 0.85 0.26   0   1 -2.45     4.97
48   48 179 0.85 0.29   0   1 -2.39     4.23
49   49 207 0.75 0.33   0   1 -1.42     0.64
50   50 213 0.11 0.30   0   1  2.38     3.87

The following provides a quick look at the distribution of scores for a randomly selected nine items.

require(ggplot2)

set.seed(9122023)

sub <- sample(1:50,9)
sub <- d_long[d_long$item%in%sub,]

labs <- setNames(paste0('Item ',unique(sub$item)),unique(sub$item))
               
ggplot(sub, aes(x=resp)) + 
  geom_histogram() + 
  geom_density(linewidth=0.1) + 
  theme_minimal()+
  facet_wrap(~ item,
             labeller = as_labeller(labs))+
   theme(strip.background = element_rect(fill = "#33C1FF"))

Item Network Analysis

Given the dataset’s sparsity, it’s essential to assess how items relate to one another and determine if there’s a structured interconnection that allows for meaningful scaling of all items collectively. With 1,225 potential item pairs, calculated as \(\frac{50*49}{2}\), the number of available responses for any given pair ranges from 21 to 55. On average, there are about 37 responses available for each item pair. This analysis suggests that no group of items exists in isolation, ensuring that all items can be meaningfully linked and scaled together.

item_pairs <- as.data.frame(t(combn(1:50,2)))
item_pairs$count <- NA

for(i in 1:nrow(item_pairs)){
  item_pairs[i,3] <- sum(!is.na(d_wide[,item_pairs[i,1]]) & 
                           !is.na(d_wide[,item_pairs[i,2]]))

}

head(item_pairs)
  V1 V2 count
1  1  2    26
2  1  3    35
3  1  4    42
4  1  5    44
5  1  6    35
6  1  7    45
describe(item_pairs$count)
   vars    n  mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 1225 36.73 5.89     37   36.65 5.93  21  55    34 0.15    -0.09 0.17
require(igraph)

nodes <- unique(c(item_pairs[,1],item_pairs[,2]))

g <- graph_from_data_frame(d = item_pairs,
                           vertices=nodes,
                           directed = FALSE)

par(mar = c(0, 0, 0, 0))

plot(g,
     edge.arrow.size=.1,
     vertex.label=NA,
     vertex.size=3)

Unidimensionality

Pearson correlations are probably not the ideal choice for zero-and-one inflated bounded continuous data. It’s also important to note that the item correlations are derived from a limited amount of data available for each item pair. This means they might not be estimated with high reliability. However, these correlations can still provide an initial insight into the eigenvalues and assist in generating a scree plot.

It looks like we have a very large first eigenvalue, and the rest of the eigenvalues doesn’t show any pattern to suggest a second or more dimensions.

corr       <- cor(d_wide[,1:50],use='pairwise')
diag(corr) <- smc(corr)
eigens     <- eigen(corr)$values


ggplot() + 
  geom_point(aes(x=1:50,y=eigens))+
  theme_minimal()

Model Description

This section provides a description of the Beta IRT model, which integrates auxiliary variables through latent regression to handle zero-and-one inflated bounded continuous response data. Descriptions for the Simplex and \(S_B\) IRT models are detailed in our paper.

Let \(X_{pi}\) denote the continuous bounded item scores such that \(X_{pi\ }\epsilon\ [0,1]\) for the \(p^{th}\) person, \(p\ =\ {1,\ ...,\ P}\), on the \(i^{th}\) item, \(i={1,\ ...,\ I}\). We can define a discrete variable \(Z_{pi}\), representing three possible conditions as the following, \[Z_{pi}=\left\{\begin{matrix}0,&if\ X_{pi}=0\\1,&if\ 0<X_{pi}<1\\2,&{if\ X}_{pi}=1\\\end{matrix}\right.\] A logistic graded response model (Samejima, 1969) can be written for modeling \(Z_{pi}\) such that,

\[P\left(Z_{pi}=0\middle|\theta_p,\alpha_i,\gamma_{0i}\right)=\frac{1}{1+e^{\alpha_i\theta_p-\gamma_{0i}}}\]

\[P\left(Z_{pi}=1\middle|\theta_p,\alpha_i,\gamma_{0i},\gamma_{1i}\right)=\frac{1}{1+e^{\alpha_i\theta_p-\gamma_{1i}}}-\frac{1}{1+e^{\alpha_i\theta_p-\gamma_{0i}}}\]

\[P\left(Z_{pi}=2\middle|\theta_p,\alpha_i,\gamma_{1i}\right)=\frac{e^{\alpha_i\theta_p-\gamma_{1i}}}{1+e^{\alpha_i\theta_p-\gamma_{1i}}}\],

where \(\theta_p\in\ R\) is a latent person parameter, \(\alpha_i\ \in\ R^+\) is an item discrimination parameter, and \(\gamma_{0i}\in\ R\) and \(\gamma_{1i}\in\ R\) are category threshold parameters satisfying \(\gamma_{0i}<\gamma_{1i}\).

Then, the joint conditional density for the model, which is denoted by \(k(.)\), can be written as the following:

\[k\left(X_{pi}\middle|\theta_p,\alpha_i,\gamma_{0i}\right)=P\left(Z_{pi}=0\middle|\theta_p,\alpha_i,\gamma_{0i}\right)\]

\[k\left(X_{pi}\middle|\theta_p,\alpha_i,\gamma_{0i},\gamma_{1i},\beta_i,\omega_i\right)=P\left(Z_{pi}=1\middle|\theta_p,\alpha_i,\gamma_{0i},\gamma_{1i}\right)\times\ f\left(X_{pi}\middle|\theta_p,\alpha_i,\beta_i,\omega_i\right)\]

\[k\left(X_{pi}\middle|\theta_p,\alpha_i,\gamma_{1i}\right)=P\left(Z_{pi}=2\middle|\theta_p,\alpha_i,\gamma_{1i}\right),\]

where \(f\left(X_{pi}\middle|\theta_p,\alpha_i,\beta_i,\omega_i\right)\) corresponds to the model-specific density function with support on the open interval (0,1).

\[f\left(X_{pi}\middle|\theta_p,\alpha_p,\beta_p,\omega_p\right)=\frac{\Gamma\left(a_{pi}+b_{pi}\right)}{\Gamma\left(a_{pi}\right)\Gamma\left(b_{pi}\right)}X_{pi}^{a_{pi}-1}\left(1-X_{pi}\right)^{b_{pi}-1}\]

\(\Gamma\left(.\right)\) is the gamma function defined by

\[\Gamma\left(d\right)=\int_{0}^{\infty}t^{d-1}e^{-t}dt\], and

\[a_{pi}=e^\frac{\alpha_i\theta_p+\beta_i+\omega_i}{2}\] \[b_{pi}=e^\frac{-\left(\alpha_i\theta_p+\beta_i\right)+\omega_i}{2},\]

where \(\beta_i\in\ R\) is an item location parameter, and \(\omega_i\in\ R^+\) is an item dispersion parameter. Note that the probability distribution of \(X_{pi}\) is a mixture of a discrete distribution on {0,1} and a continuous distribution on the open interval (0,1).

The model above can be extended by proposing a linear regression model of \(\theta_p\) on the auxiliary variables,

\[\theta_p=\xi_1W_p+\xi_2S_p+\epsilon_p,\]

where \(W_p\) and \(S_p\) are the observed writing and speaking assessment scores for the \(p^{th}\) examinee, \(\xi_1\) and \(\xi_2\) are the associated regression coefficients and \(\epsilon_p\) is the error term. Both writing and speaking scores were standardized prior to model fitting, so they have a mean of zero and unit variance.

When fitting the model, we set the prior for the error term in the regression model as

\[\epsilon\sim\ N\left(0,\sigma^2\right)\]

\[\sigma^2=1-\left(\xi_1^2+\xi_2^2+2\xi_1\xi_2r\right),\]

where \(r\) is the observed correlation between W and S, suggesting a standard normal prior for the latent trait, \(\theta \sim N(0,1)\) since the W and S are both standardized prior to model fitting.

Model Estimation

Cross-validation

When fitting these models, we employed a cross-validation approach. In this method, the data is randomly divided into training and test sets. The model is trained on the training set, yielding posterior distributions for both person and item parameters. These posterior distributions of model parameters are then used to generate model-based predictions for observations in the unseen test set. The model’s effectiveness is evaluated based on its ability to accurately predict certain aspects of the unseen data.

The inspiration for this approach is derived from a paper by Stenhaug and Domingue (2022). Implementing it in Stan is straightforward, provided the model in question possesses a built-in random number generator within Stan. This is true for the Beta and \(S_B\) models. However, for the Simplex model, this has to be done externally since Stan currently lacks a built-in random number generator for the Simplex distribution. You can find the necessary R script to achieve this after fitting the model in R here

We create a random split of data for future use.

  set.seed(692023)

  # Randomly select 10% of data for test set

  loc <- sample(1:nrow(d),nrow(d)*.10)
  
  d_test  <- d[loc,]
  d_train <- d[-loc,]
  
  dim(d_train)
  dim(d_test)
  
  head(d_train)
  head(d_test)

Stan Input Data

The following is preparing the data input for the Stan model syntax. Y and Y2 refers to responses from the training and test set, respectively. Everything else is the input data required to be used in the model syntax.

The Stan model syntax for the Beta version of the model can be found here.

  data_rt <- list(I              = length(unique(d_train$item)),
                  P              = length(unique(d_train$id)),
                  n_obs          = nrow(d_train),
                  id             = d_train$id,
                  item           = d_train$item,
                  W              = d_wide$w,
                  S              = d_wide$s,
                  r              = cor(d_wide[,'s'],d_wide[,'w']),
                  Y              = d_train$resp,
                  n_obs2         = nrow(d_test),
                  Y2             = d_test$resp,
                  item2          = d_test$item,
                  id2            = d_test$id)

Start Values

We create the objects to provide start values for Stan. Although this is not necessary, it may help with estimation under extreme situations (like the one we had in the original dataset).

  nit <- length(unique(d$item))
  N   <- length(unique(d$id))

  start <- list(ln_alpha= rnorm(nit,0,.2),
                omega   = rnorm(nit,0,.2),
                beta    = rnorm(nit,0,.5),
                theta   = rnorm(N),
                gamma0  =-abs(rnorm(nit,1,.2)),
                gamma1  = abs(rnorm(nit,1,.2)),
                a = 0.25,
                b = 0.25)

Compile the model

We compile the model. We have to do this step only once, and it will create a beta.exe file in the respective folder. Unless you modify or change the model syntax file (beta.stan), you don’t have to repeat this step. If you modify the model syntax file, then the model needs to be recompiled.

require(cmdstanr)
require(rstan)

  mod  <- cmdstan_model('./script/beta_irt/beta.stan')

Fit the model

We proceeded to fit the model using four chains, each consisting of 1,000 iterations. Of these, the initial 250 draws will be discarded, leaving the subsequent 750 draws to be utilized for inference.

 fit <- mod$sample(data            = data_rt,
                    seed            = 1234,
                    chains          = 4,
                    parallel_chains = 4,
                    iter_warmup     = 250,
                    iter_sampling   = 750,
                    refresh         = 100,
                    init            = list(start,start,start,start))

  fit$cmdstan_summary()
  
  stanfit <- rstan::read_stan_csv(fit$output_files())

Save the output

For efficiency, it’s advisable to save the model output as an .RData file for future processing. Without this step, you might end up re-fitting the model to restore the model output, a task that can be notably time-consuming, particularly with extensive datasets.

I store these files in the “output” folder. However, due to their substantial size, this folder is not uploaded to the GitHub repository and is retained locally.

  rm(list=ls()[!ls()%in%c("stanfit","d_wide","d","d_train","d_test")])
  
  save.image("./output/beta.RData")

Model Evaluation

Item and Person Parameter Estimates

Given that we’re working with a simulated dataset, our initial step is to compare the true parameters used for data generation with their estimated values. It’s important to understand that these parameters might not be estimated with high precision and accuracy for two primary reasons:

  • Each item only has approximately 200 observations, and the model we’re fitting is a complex one. Such a sample size might not be sufficiently large to reliably determine the model parameters.

  • The sparse person-item structure could further complicate the estimation process.

The true item parameters we used to generate data can be found at the top of the file 0_simulate.r

  ipar <- data.frame(gamma0 = rep(c(-2,-.9,-2.6,-.8,-3,-2.5,-2.9,-2.6,-1.8,2),5),
                     gamma1 = rep(c(0.5,1.9,1.6,1,-1,-0.5,.9,.6,.8,3),5),
                     beta   = rep(c(2.5,2.2,2.6,1.5,1.2,2.5,2.2,2.6,1.5,1.2),5),
                     alpha  = rep(c(0.9,0.9,0.7,0.6,1.2,0.9,0.9,0.7,0.6,0.7),5),
                     omega  = rep(c(4.3,3.6,4.1,2.6,2.7,4.3,3.6,4.1,2.6,2.7),5))

  head(ipar)
  gamma0 gamma1 beta alpha omega
1   -2.0    0.5  2.5   0.9   4.3
2   -0.9    1.9  2.2   0.9   3.6
3   -2.6    1.6  2.6   0.7   4.1
4   -0.8    1.0  1.5   0.6   2.6
5   -3.0   -1.0  1.2   1.2   2.7
6   -2.5   -0.5  2.5   0.9   4.3

Item location parameters (\(\beta\))

The \(\hat{R}\) values range between 0.999 and 1.004, suggesting satisfactory convergence. For all items, with the exception of Item 29, the 95% credible intervals includes the true parameter value. The accompanying plot indicates that the parameter estimates align closely with the true values. It’s worth noting the discrete pattern observed in this plot, which arises because the true item parameter values repeat the same set of values ten times.

beta <- as.data.frame(summary(stanfit, 
                              pars = c("beta"), 
                              probs = c(0.025, 0.975))$summary)
beta$true_beta <- ipar$beta
round(beta,3)
          mean se_mean    sd  2.5% 97.5%    n_eff  Rhat true_beta
beta[1]  2.513   0.003 0.105 2.303 2.713 1560.433 0.999       2.5
beta[2]  2.260   0.002 0.087 2.091 2.430 1738.110 1.001       2.2
beta[3]  2.607   0.002 0.071 2.469 2.747 1430.908 1.002       2.6
beta[4]  1.532   0.002 0.114 1.307 1.760 2179.813 1.001       1.5
beta[5]  1.167   0.005 0.174 0.829 1.528 1214.619 1.000       1.2
beta[6]  2.497   0.003 0.126 2.256 2.756 1323.151 1.004       2.5
beta[7]  2.336   0.003 0.109 2.119 2.556 1404.682 1.003       2.2
beta[8]  2.705   0.002 0.077 2.557 2.857 2085.582 1.001       2.6
beta[9]  1.519   0.002 0.086 1.346 1.684 2724.344 1.000       1.5
beta[10] 0.997   0.012 0.322 0.358 1.599  777.547 1.001       1.2
beta[11] 2.620   0.002 0.090 2.443 2.795 1761.492 1.000       2.5
beta[12] 2.076   0.002 0.090 1.899 2.257 2532.362 1.000       2.2
beta[13] 2.681   0.002 0.080 2.527 2.835 1690.718 1.002       2.6
beta[14] 1.668   0.002 0.117 1.442 1.902 2928.582 1.000       1.5
beta[15] 1.073   0.005 0.177 0.730 1.426 1484.100 1.001       1.2
beta[16] 2.625   0.004 0.147 2.340 2.912 1564.001 1.000       2.5
beta[17] 2.136   0.002 0.084 1.969 2.293 1634.485 1.000       2.2
beta[18] 2.583   0.002 0.083 2.419 2.739 2343.124 0.999       2.6
beta[19] 1.617   0.002 0.097 1.428 1.811 2479.952 1.002       1.5
beta[20] 1.317   0.008 0.287 0.745 1.871 1256.285 1.004       1.2
beta[21] 2.426   0.002 0.100 2.235 2.622 1917.839 1.002       2.5
beta[22] 2.189   0.002 0.097 1.993 2.377 2556.906 1.000       2.2
beta[23] 2.625   0.002 0.074 2.479 2.766 2337.705 1.000       2.6
beta[24] 1.538   0.002 0.125 1.285 1.781 3929.291 0.999       1.5
beta[25] 1.183   0.005 0.194 0.820 1.556 1623.103 1.000       1.2
beta[26] 2.703   0.004 0.127 2.460 2.960  935.076 1.003       2.5
beta[27] 2.272   0.002 0.088 2.099 2.451 1513.121 1.001       2.2
beta[28] 2.509   0.002 0.093 2.326 2.694 1669.302 1.000       2.6
beta[29] 1.699   0.002 0.093 1.512 1.884 2869.049 0.999       1.5
beta[30] 1.097   0.004 0.249 0.592 1.573 3798.335 0.999       1.2
beta[31] 2.428   0.002 0.097 2.240 2.625 1673.594 1.001       2.5
beta[32] 2.286   0.002 0.079 2.127 2.434 2587.109 1.001       2.2
beta[33] 2.565   0.002 0.078 2.409 2.713 2251.528 1.001       2.6
beta[34] 1.578   0.002 0.098 1.385 1.769 2979.829 1.001       1.5
beta[35] 1.141   0.006 0.190 0.777 1.528 1121.054 1.001       1.2
beta[36] 2.492   0.004 0.135 2.227 2.763 1284.036 1.002       2.5
beta[37] 2.277   0.003 0.098 2.087 2.465 1369.885 1.003       2.2
beta[38] 2.641   0.002 0.088 2.471 2.817 2791.380 1.001       2.6
beta[39] 1.539   0.002 0.084 1.372 1.697 3024.733 0.999       1.5
beta[40] 1.379   0.005 0.225 0.922 1.824 1762.989 1.001       1.2
beta[41] 2.327   0.002 0.093 2.140 2.506 1430.469 1.002       2.5
beta[42] 2.387   0.002 0.099 2.190 2.573 2059.242 1.001       2.2
beta[43] 2.734   0.002 0.075 2.586 2.879 2105.289 1.001       2.6
beta[44] 1.399   0.002 0.102 1.204 1.603 3289.765 1.000       1.5
beta[45] 1.277   0.004 0.161 0.977 1.606 1653.183 1.000       1.2
beta[46] 2.374   0.003 0.111 2.159 2.590 1588.147 1.003       2.5
beta[47] 2.183   0.002 0.083 2.019 2.348 2184.960 1.001       2.2
beta[48] 2.568   0.002 0.103 2.364 2.761 1898.822 0.999       2.6
beta[49] 1.403   0.002 0.093 1.222 1.590 2838.486 1.000       1.5
beta[50] 0.863   0.008 0.313 0.237 1.470 1579.977 1.004       1.2
ggplot(beta, aes(x = true_beta, y = mean)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("Item Location Parameters") +
  xlab("True Value") +
  ylab("Posterior Mean Estimate")+
  xlim(0.5,3)+
  ylim(0.5,3)

Item discrimination parameters (\(\alpha\))

The \(\hat{R}\) values fall between 0.999 and 1.05, indicating satisfactory convergence. For a majority of the items, specifically 46 out of 50, the 95% credible intervals includes the true parameter value. The associated plot suggests that the parameter estimates correspond well with the true values.

alpha <- as.data.frame(summary(stanfit, 
                               pars = c("alpha"), 
                               probs = c(0.025, 0.975))$summary)

alpha$true_alpha <- ipar$alpha
round(alpha,3)
           mean se_mean    sd  2.5% 97.5%    n_eff  Rhat true_alpha
alpha[1]  0.843   0.003 0.106 0.640 1.058 1358.581 1.000        0.9
alpha[2]  1.121   0.003 0.108 0.921 1.347 1166.369 1.001        0.9
alpha[3]  0.822   0.002 0.078 0.672 0.975 1268.278 1.000        0.7
alpha[4]  0.794   0.003 0.116 0.572 1.028 1934.286 1.002        0.6
alpha[5]  1.314   0.005 0.163 1.001 1.649 1027.671 1.000        1.2
alpha[6]  0.842   0.003 0.113 0.627 1.072 1092.539 1.002        0.9
alpha[7]  1.121   0.003 0.115 0.902 1.352 1321.368 1.003        0.9
alpha[8]  0.609   0.002 0.074 0.464 0.758 2070.663 1.000        0.7
alpha[9]  0.589   0.002 0.088 0.413 0.763 2400.635 1.000        0.6
alpha[10] 0.531   0.021 0.325 0.000 1.168  237.703 1.006        0.7
alpha[11] 0.933   0.002 0.094 0.755 1.122 1451.750 1.004        0.9
alpha[12] 0.865   0.002 0.094 0.683 1.049 1836.322 1.001        0.9
alpha[13] 0.860   0.002 0.085 0.698 1.035 1500.334 1.000        0.7
alpha[14] 0.587   0.002 0.115 0.364 0.821 2266.946 1.000        0.6
alpha[15] 1.144   0.004 0.148 0.861 1.455 1218.010 1.003        1.2
alpha[16] 0.929   0.003 0.123 0.695 1.176 1471.775 1.001        0.9
alpha[17] 0.842   0.002 0.090 0.668 1.020 1716.525 1.000        0.9
alpha[18] 0.737   0.002 0.097 0.545 0.932 1705.945 1.005        0.7
alpha[19] 0.750   0.003 0.105 0.546 0.957 1692.985 1.002        0.6
alpha[20] 0.368   0.029 0.261 0.000 0.858   78.832 1.036        0.7
alpha[21] 0.938   0.002 0.095 0.755 1.127 1744.862 1.000        0.9
alpha[22] 0.877   0.003 0.113 0.662 1.107 1545.978 1.000        0.9
alpha[23] 0.629   0.002 0.078 0.475 0.781 2140.931 1.000        0.7
alpha[24] 0.587   0.003 0.121 0.360 0.828 2336.432 1.000        0.6
alpha[25] 1.065   0.005 0.172 0.756 1.422 1314.016 1.001        1.2
alpha[26] 1.139   0.004 0.114 0.914 1.359  885.729 1.005        0.9
alpha[27] 1.015   0.002 0.085 0.856 1.186 1401.790 1.002        0.9
alpha[28] 0.681   0.002 0.083 0.518 0.841 1436.525 1.005        0.7
alpha[29] 0.561   0.002 0.082 0.404 0.731 2759.687 1.000        0.6
alpha[30] 0.094   0.005 0.146 0.000 0.493  741.443 1.005        0.7
alpha[31] 0.895   0.003 0.104 0.697 1.103 1509.224 1.002        0.9
alpha[32] 0.810   0.002 0.096 0.629 1.008 2126.512 1.000        0.9
alpha[33] 0.711   0.002 0.076 0.563 0.860 1944.959 1.000        0.7
alpha[34] 0.746   0.002 0.112 0.525 0.968 2126.342 0.999        0.6
alpha[35] 1.187   0.006 0.188 0.845 1.567  963.064 1.000        1.2
alpha[36] 0.911   0.004 0.128 0.661 1.175 1225.312 1.003        0.9
alpha[37] 1.030   0.003 0.118 0.803 1.264 1217.695 1.002        0.9
alpha[38] 0.735   0.001 0.076 0.590 0.889 2623.830 1.001        0.7
alpha[39] 0.503   0.002 0.087 0.333 0.677 2351.746 0.999        0.6
alpha[40] 0.789   0.006 0.191 0.419 1.177  899.466 1.003        0.7
alpha[41] 0.879   0.003 0.099 0.687 1.076 1278.835 1.004        0.9
alpha[42] 0.923   0.002 0.105 0.722 1.129 2052.518 0.999        0.9
alpha[43] 0.667   0.001 0.070 0.527 0.806 2257.076 1.000        0.7
alpha[44] 0.517   0.002 0.099 0.316 0.712 2762.885 1.001        0.6
alpha[45] 1.300   0.004 0.137 1.055 1.589 1475.508 1.000        1.2
alpha[46] 0.795   0.003 0.100 0.611 0.996 1486.050 1.002        0.9
alpha[47] 0.878   0.002 0.082 0.718 1.037 2065.036 1.001        0.9
alpha[48] 0.712   0.002 0.089 0.542 0.890 1733.540 1.000        0.7
alpha[49] 0.612   0.002 0.097 0.426 0.800 2617.502 1.000        0.6
alpha[50] 0.452   0.027 0.257 0.000 0.922   91.815 1.051        0.7
ggplot(alpha, aes(x = true_alpha, y = mean)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  xlab("True Value") +
  ylab("Posterior Mean Estimate")+
  xlim(0,2)+
  ylim(0,2)

Category threshold parameters for 0s (\(\gamma_0\))

The \(\hat{R}\) values fall between 0.999 and 1.001, indicating satisfactory convergence. The 95% credible intervals includes the true parameter value for 48 out of 50 items. The associated plot suggests that the parameter estimates correspond well with the true values.

gamma0 <- as.data.frame(summary(stanfit, 
                                pars = c("gamma0"), 
                                probs = c(0.025, 0.975))$summary)

gamma0$true_gamma0 <- ipar$gamma0
round(gamma0,3)
             mean se_mean    sd   2.5%  97.5%    n_eff  Rhat true_gamma0
gamma0[1]  -2.198   0.004 0.233 -2.667 -1.765 2763.966 1.000        -2.0
gamma0[2]  -0.948   0.003 0.177 -1.296 -0.615 3675.425 1.000        -0.9
gamma0[3]  -3.166   0.005 0.343 -3.863 -2.530 4133.812 1.000        -2.6
gamma0[4]  -0.784   0.003 0.172 -1.133 -0.458 2734.434 1.000        -0.8
gamma0[5]  -2.901   0.006 0.291 -3.496 -2.358 2499.605 1.000        -3.0
gamma0[6]  -2.788   0.005 0.299 -3.416 -2.251 3753.849 0.999        -2.5
gamma0[7]  -2.950   0.005 0.304 -3.561 -2.391 3528.532 1.000        -2.9
gamma0[8]  -2.490   0.004 0.255 -2.986 -2.008 3667.113 0.999        -2.6
gamma0[9]  -2.022   0.004 0.243 -2.524 -1.575 3849.667 0.999        -1.8
gamma0[10]  1.977   0.007 0.259  1.512  2.517 1217.491 0.999         2.0
gamma0[11] -1.886   0.004 0.230 -2.354 -1.459 3120.985 1.000        -2.0
gamma0[12] -0.846   0.003 0.173 -1.205 -0.524 3208.884 1.000        -0.9
gamma0[13] -2.829   0.005 0.309 -3.456 -2.254 4409.888 1.000        -2.6
gamma0[14] -0.598   0.003 0.171 -0.929 -0.265 2813.766 1.000        -0.8
gamma0[15] -2.807   0.007 0.303 -3.442 -2.262 2146.651 1.000        -3.0
gamma0[16] -2.465   0.005 0.263 -2.997 -1.972 2945.608 0.999        -2.5
gamma0[17] -2.988   0.005 0.314 -3.672 -2.406 3823.145 1.001        -2.9
gamma0[18] -2.367   0.004 0.251 -2.894 -1.899 3902.275 1.000        -2.6
gamma0[19] -1.533   0.003 0.200 -1.938 -1.162 3294.909 0.999        -1.8
gamma0[20]  2.045   0.007 0.255  1.582  2.590 1332.554 1.001         2.0
gamma0[21] -2.083   0.004 0.228 -2.532 -1.655 3512.427 0.999        -2.0
gamma0[22] -0.636   0.002 0.169 -0.972 -0.310 5306.466 1.000        -0.9
gamma0[23] -2.778   0.004 0.293 -3.416 -2.241 4290.608 1.000        -2.6
gamma0[24] -0.867   0.003 0.177 -1.209 -0.531 3440.786 1.000        -0.8
gamma0[25] -2.735   0.006 0.299 -3.322 -2.162 2332.849 1.000        -3.0
gamma0[26] -3.246   0.007 0.316 -3.925 -2.661 2243.082 1.000        -2.5
gamma0[27] -3.808   0.007 0.411 -4.672 -3.046 3912.522 1.000        -2.9
gamma0[28] -2.726   0.005 0.297 -3.323 -2.171 3704.939 1.000        -2.6
gamma0[29] -1.807   0.003 0.196 -2.217 -1.440 4147.618 1.001        -1.8
gamma0[30]  1.952   0.004 0.228  1.535  2.419 2606.711 1.000         2.0
gamma0[31] -1.790   0.004 0.215 -2.231 -1.388 3436.074 0.999        -2.0
gamma0[32] -0.995   0.003 0.173 -1.338 -0.664 3944.271 0.999        -0.9
gamma0[33] -2.601   0.004 0.292 -3.209 -2.052 4817.333 0.999        -2.6
gamma0[34] -0.989   0.003 0.175 -1.331 -0.650 3577.962 1.001        -0.8
gamma0[35] -2.688   0.007 0.281 -3.244 -2.161 1804.067 0.999        -3.0
gamma0[36] -2.754   0.006 0.289 -3.347 -2.224 2710.716 1.000        -2.5
gamma0[37] -2.783   0.006 0.326 -3.452 -2.179 3017.385 1.001        -2.9
gamma0[38] -2.958   0.005 0.312 -3.591 -2.391 4351.536 1.000        -2.6
gamma0[39] -1.995   0.004 0.232 -2.454 -1.575 4155.679 0.999        -1.8
gamma0[40]  2.114   0.005 0.261  1.629  2.637 2526.353 1.000         2.0
gamma0[41] -2.226   0.004 0.234 -2.704 -1.768 3285.618 1.001        -2.0
gamma0[42] -0.621   0.003 0.181 -0.975 -0.270 3382.318 0.999        -0.9
gamma0[43] -2.665   0.005 0.288 -3.258 -2.148 2982.102 1.000        -2.6
gamma0[44] -0.917   0.003 0.161 -1.245 -0.605 3349.697 1.001        -0.8
gamma0[45] -3.215   0.005 0.296 -3.827 -2.658 3018.075 1.000        -3.0
gamma0[46] -3.045   0.005 0.336 -3.707 -2.447 4043.815 0.999        -2.5
gamma0[47] -2.989   0.004 0.313 -3.656 -2.412 4926.150 0.999        -2.9
gamma0[48] -2.468   0.005 0.279 -3.054 -1.945 3601.212 1.000        -2.6
gamma0[49] -2.064   0.003 0.228 -2.521 -1.637 4246.423 1.000        -1.8
gamma0[50]  2.205   0.004 0.248  1.741  2.703 3755.308 1.001         2.0
ggplot(gamma0, aes(x = true_gamma0, y = mean)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  xlab("True Value") +
  ylab("Posterior Mean Estimate")+
  xlim(-4,3)+
  ylim(-4,3)

Category threshold parameters for 1s (\(\gamma_1\))

The \(\hat{R}\) values fall between 0.999 and 1.002, indicating satisfactory convergence. The 95% credible intervals includes the true parameter value for 49 out of 50 items. The associated plot suggests that the parameter estimates correspond well with the true values.

gamma1 <- as.data.frame(summary(stanfit, 
                                pars = c("gamma1"), 
                                probs = c(0.025, 0.975))$summary)

gamma1$true_gamma1 <- ipar$gamma1
round(gamma1,3)
             mean se_mean    sd   2.5%  97.5%    n_eff  Rhat true_gamma1
gamma1[1]   0.035   0.003 0.158 -0.270  0.340 3624.878 0.999         0.5
gamma1[2]   2.084   0.004 0.220  1.681  2.532 3042.715 1.000         1.9
gamma1[3]   1.649   0.003 0.214  1.241  2.079 4219.022 0.999         1.6
gamma1[4]   0.946   0.003 0.179  0.603  1.301 2984.754 1.000         1.0
gamma1[5]  -1.062   0.004 0.195 -1.451 -0.675 2621.880 0.999        -1.0
gamma1[6]  -0.372   0.003 0.166 -0.699 -0.051 3289.044 0.999        -0.5
gamma1[7]   0.918   0.003 0.181  0.570  1.287 2882.369 1.001         0.9
gamma1[8]   0.499   0.002 0.153  0.207  0.806 4061.995 1.000         0.6
gamma1[9]   0.860   0.003 0.167  0.529  1.191 4337.434 1.000         0.8
gamma1[10]  2.701   0.008 0.324  2.141  3.363 1548.622 1.000         3.0
gamma1[11]  0.827   0.003 0.184  0.463  1.204 2929.529 1.000         0.5
gamma1[12]  1.980   0.004 0.220  1.563  2.423 3031.132 1.000         1.9
gamma1[13]  1.866   0.003 0.219  1.458  2.309 4510.721 0.999         1.6
gamma1[14]  0.838   0.003 0.174  0.505  1.180 3482.110 1.000         1.0
gamma1[15] -1.307   0.005 0.208 -1.718 -0.897 2080.761 1.000        -1.0
gamma1[16] -0.824   0.003 0.181 -1.178 -0.487 2998.991 1.000        -0.5
gamma1[17]  0.660   0.003 0.174  0.316  0.996 3384.687 0.999         0.9
gamma1[18]  0.560   0.003 0.155  0.255  0.872 3481.881 1.000         0.6
gamma1[19]  0.601   0.003 0.170  0.259  0.932 3390.571 1.000         0.8
gamma1[20]  2.998   0.010 0.355  2.357  3.757 1284.688 1.000         3.0
gamma1[21]  0.388   0.003 0.163  0.071  0.714 3582.093 1.000         0.5
gamma1[22]  1.877   0.003 0.215  1.458  2.302 3952.689 1.000         1.9
gamma1[23]  1.921   0.003 0.222  1.507  2.358 4060.768 1.000         1.6
gamma1[24]  0.784   0.003 0.173  0.452  1.126 3029.086 1.000         1.0
gamma1[25] -1.083   0.004 0.202 -1.481 -0.695 2570.980 1.000        -1.0
gamma1[26] -0.603   0.004 0.169 -0.943 -0.282 2155.477 1.002        -0.5
gamma1[27]  0.751   0.003 0.175  0.410  1.101 3373.155 1.000         0.9
gamma1[28]  0.499   0.002 0.161  0.192  0.819 4894.097 1.000         0.6
gamma1[29]  0.559   0.002 0.150  0.278  0.859 3739.784 1.001         0.8
gamma1[30]  3.059   0.007 0.355  2.393  3.796 2265.193 1.002         3.0
gamma1[31]  0.264   0.003 0.169 -0.062  0.595 3463.384 0.999         0.5
gamma1[32]  2.337   0.005 0.251  1.851  2.851 2771.971 1.001         1.9
gamma1[33]  1.341   0.003 0.194  0.974  1.725 4003.321 0.999         1.6
gamma1[34]  0.890   0.003 0.166  0.558  1.222 3540.347 1.000         1.0
gamma1[35] -1.016   0.004 0.192 -1.400 -0.656 2096.062 1.000        -1.0
gamma1[36] -0.541   0.003 0.171 -0.885 -0.207 3365.795 1.000        -0.5
gamma1[37]  0.823   0.004 0.186  0.482  1.196 2674.757 1.000         0.9
gamma1[38]  0.687   0.002 0.166  0.359  1.022 4398.944 0.999         0.6
gamma1[39]  0.882   0.003 0.161  0.571  1.199 3197.039 0.999         0.8
gamma1[40]  3.239   0.009 0.381  2.530  4.045 2011.782 1.001         3.0
gamma1[41]  0.641   0.003 0.169  0.308  0.965 3703.414 1.000         0.5
gamma1[42]  1.861   0.004 0.231  1.439  2.324 3588.874 0.999         1.9
gamma1[43]  1.595   0.003 0.204  1.201  1.999 3696.041 0.999         1.6
gamma1[44]  0.838   0.003 0.163  0.528  1.157 3654.328 1.000         1.0
gamma1[45] -1.032   0.003 0.189 -1.402 -0.677 3165.991 1.000        -1.0
gamma1[46] -0.567   0.003 0.174 -0.906 -0.234 4053.895 0.999        -0.5
gamma1[47]  1.006   0.003 0.176  0.662  1.362 4191.528 1.001         0.9
gamma1[48]  0.489   0.003 0.171  0.154  0.823 3609.162 1.000         0.6
gamma1[49]  0.758   0.003 0.169  0.431  1.087 4255.866 0.999         0.8
gamma1[50]  3.051   0.006 0.337  2.426  3.719 3221.403 1.001         3.0
ggplot(gamma1, aes(x = true_gamma1, y = mean)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  xlab("True Value") +
  ylab("Posterior Mean Estimate")+
  xlim(-1.5,3.5)+
  ylim(-1.5,3.5)

Dispersion Parameters (\(\omega\))

The \(\hat{R}\) values fall between 0.999 and 1.001, indicating satisfactory convergence. The 95% credible intervals includes the true parameter value for all 50 items. The associated plot suggests that the parameter estimates correspond well with the true values.

omega <- as.data.frame(summary(stanfit, 
                               pars = c("omega"), 
                               probs = c(0.025, 0.975))$summary)

omega$true_omega <- ipar$omega

round(omega,3)
           mean se_mean    sd   2.5% 97.5%    n_eff  Rhat true_omega
omega[1]  4.411   0.010 0.444  3.557 5.265 1835.941 1.000        4.3
omega[2]  4.211   0.012 0.411  3.448 5.034 1251.043 1.000        3.6
omega[3]  4.263   0.007 0.301  3.688 4.855 1908.463 1.001        4.1
omega[4]  2.442   0.009 0.389  1.662 3.186 2059.323 1.000        2.6
omega[5]  3.104   0.018 0.626  1.891 4.377 1174.984 1.003        2.7
omega[6]  4.400   0.014 0.517  3.369 5.431 1294.254 1.000        4.3
omega[7]  3.208   0.010 0.350  2.515 3.881 1333.349 1.002        3.6
omega[8]  4.353   0.007 0.311  3.751 4.951 2203.044 1.000        4.1
omega[9]  2.843   0.005 0.300  2.237 3.413 3316.476 1.000        2.6
omega[10] 2.389   0.017 0.911  0.542 4.094 2769.831 0.999        2.7
omega[11] 4.777   0.011 0.431  3.946 5.665 1449.947 1.000        4.3
omega[12] 3.337   0.007 0.327  2.701 3.982 2166.427 1.000        3.6
omega[13] 3.981   0.007 0.311  3.379 4.581 1743.133 1.001        4.1
omega[14] 2.686   0.008 0.411  1.846 3.466 2453.515 0.999        2.6
omega[15] 3.303   0.018 0.708  1.915 4.722 1607.010 1.002        2.7
omega[16] 4.462   0.014 0.580  3.383 5.610 1613.494 1.001        4.3
omega[17] 4.034   0.008 0.350  3.345 4.726 1886.399 1.001        3.6
omega[18] 4.089   0.008 0.333  3.424 4.725 1721.303 1.003        4.1
omega[19] 3.304   0.007 0.374  2.565 4.026 2493.978 1.000        2.6
omega[20] 2.277   0.039 0.915  0.438 4.040  548.298 1.008        2.7
omega[21] 4.209   0.009 0.399  3.413 4.989 2155.864 1.000        4.3
omega[22] 3.320   0.007 0.341  2.636 3.976 2620.566 0.999        3.6
omega[23] 3.683   0.005 0.261  3.160 4.186 2494.843 1.000        4.1
omega[24] 2.101   0.007 0.386  1.304 2.868 3123.788 1.000        2.6
omega[25] 2.376   0.011 0.552  1.321 3.471 2370.574 0.999        2.7
omega[26] 4.986   0.018 0.575  3.907 6.178  987.429 1.003        4.3
omega[27] 3.872   0.008 0.320  3.287 4.525 1606.734 1.003        3.6
omega[28] 3.647   0.007 0.324  2.996 4.271 2145.663 0.999        4.1
omega[29] 2.760   0.005 0.286  2.197 3.316 3605.700 0.999        2.6
omega[30] 1.794   0.013 0.706  0.347 3.123 3165.487 1.001        2.7
omega[31] 4.368   0.010 0.438  3.486 5.225 1960.754 1.000        4.3
omega[32] 3.735   0.006 0.307  3.134 4.334 2785.723 1.002        3.6
omega[33] 4.067   0.006 0.311  3.455 4.669 2500.990 1.000        4.1
omega[34] 2.753   0.006 0.333  2.110 3.399 3198.443 1.000        2.6
omega[35] 2.778   0.019 0.640  1.553 4.137 1097.147 1.000        2.7
omega[36] 4.107   0.012 0.485  3.164 5.075 1774.701 1.003        4.3
omega[37] 4.088   0.013 0.429  3.273 4.923 1026.055 1.002        3.6
omega[38] 3.822   0.005 0.292  3.247 4.376 3083.202 1.000        4.1
omega[39] 2.713   0.005 0.288  2.149 3.271 3134.494 1.000        2.6
omega[40] 4.123   0.049 1.182  2.063 6.593  570.788 1.008        2.7
omega[41] 3.827   0.008 0.354  3.118 4.490 1879.021 1.000        4.3
omega[42] 3.975   0.009 0.393  3.200 4.749 1935.176 0.999        3.6
omega[43] 3.983   0.006 0.275  3.451 4.520 2447.292 1.000        4.1
omega[44] 2.223   0.005 0.320  1.587 2.863 3880.638 0.999        2.6
omega[45] 3.339   0.014 0.553  2.317 4.470 1477.088 1.000        2.7
omega[46] 4.768   0.012 0.513  3.766 5.820 1923.610 1.000        4.3
omega[47] 3.359   0.006 0.305  2.769 3.982 2458.900 1.000        3.6
omega[48] 3.455   0.006 0.335  2.768 4.105 2749.159 1.000        4.1
omega[49] 2.187   0.005 0.281  1.620 2.735 3388.890 1.000        2.6
omega[50] 1.761   0.026 0.925 -0.155 3.533 1263.098 1.005        2.7
ggplot(omega, aes(x = true_omega, y = mean)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("Scatterplot of True Values vs. Mean Estimates") +
  xlab("True Values") +
  ylab("Mean Estimates")+
  xlim(1,5)+
  ylim(1,5)

Person location parameters (\(\theta\))

The \(\hat{R}\) values for the person location estimates range from 0.999 to 1.005, signifying good convergence. The 95% credible intervals include the true parameter value for a substantial majority, specifically 94.3%, of the 1,000 test-takers. With a correlation of 0.93 between the true and estimated parameters, the accompanying plot indicates that the person parameters have been estimated with reasonable accuracy. A slight bias observed at the distribution’s tail—overestimating at lower proficiency levels and underestimating at higher ones—is expected due to the variance shrinkage and the pull towards the mean from the imposed prior distribution.

theta <- as.data.frame(summary(stanfit, 
                               pars = c("theta"), 
                               probs = c(0.025, 0.975))$summary)

theta$true_theta <- d_wide$true_theta

head(round(theta,3))
           mean se_mean    sd   2.5%  97.5%    n_eff  Rhat true_theta
theta[1] -0.494   0.004 0.228 -0.950 -0.041 3231.060 1.001     -0.243
theta[2]  0.414   0.006 0.406 -0.377  1.221 4205.017 1.000      0.698
theta[3]  0.514   0.006 0.326 -0.099  1.169 3283.535 1.000      0.548
theta[4] -0.779   0.004 0.261 -1.274 -0.252 3806.343 0.999     -0.364
theta[5]  0.918   0.007 0.438  0.082  1.779 4275.642 1.000      1.165
theta[6] -1.879   0.006 0.341 -2.584 -1.210 2841.389 1.000     -2.007
describe(theta[,c('mean','true_theta')])
           vars    n mean   sd median trimmed  mad   min  max range  skew
mean          1 1000 0.00 0.93   0.02    0.01 0.98 -2.96 2.52  5.48 -0.10
true_theta    2 1000 0.03 1.02   0.04    0.03 1.04 -3.09 3.04  6.13 -0.02
           kurtosis   se
mean          -0.41 0.03
true_theta    -0.20 0.03
cor(theta$mean,theta$true_theta)
[1] 0.9294544
ggplot(theta, aes(x = true_theta, y = mean)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  xlab("True Value") +
  ylab("Posterior Mean Estimate")+
  xlim(-5,4)+
  ylim(-5,4)

Latent regression coefficients (\(\xi_1\) and \(\xi_2\))

The \(\hat{R}\) values for both latent regression coefficients were 1.000. The coefficients were estimated at .221 and 0.474, compared to their true values of 0.3 and 0.4, respectively. The 95% credible intervals did not include the true parameter values for either coefficient, indicating a bias in the estimates. Specifically, one regression coefficient was underestimated, while the other was overestimated.

coefs2 <- summary(stanfit, pars = c("a","b"), probs = c(0.025, 0.975))$summary

round(coefs2,3)
   mean se_mean    sd  2.5% 97.5%    n_eff Rhat
a 0.221   0.001 0.033 0.158 0.286 2546.348    1
b 0.475   0.001 0.030 0.413 0.536 2373.995    1

Predictive Fit

From the generated responses based on the posterior distribution of the model parameters, we computed the posterior distribution for the sum of squared error (SSE) in predictions pertaining to the test data. The posterior mean estimate for the SSE in the unseen test data stood at 187.46.

posteriors <- extract(stanfit)

Yhat_test  <- posteriors$Yhat_test

SSE_test <- rep(0,3000)

for(i in 1:3000){
  tmp          <- Yhat_test[i,]
  SSE_test[i]  <- sum((d_test$resp - tmp)^2)
}

ggplot() + 
  geom_histogram(aes(x=SSE_test),fill='white',color='black')+
  theme_minimal()

mean(SSE_test)
[1] 187.4556

While the aforementioned number provides insight, it lacks context without a baseline for comparison. To establish this baseline SSE, we employed a method where, for each observed response, we randomly sampled a prediction from the respective item-specific observed distribution. This process was repeated 3,000 times.

SSE_test_baseline <- matrix(nrow=3000,ncol=1000)

for(i in 1:3000){
  for(j in 1:50){
    ind  <- d_test[j,]$item
    loc  <- which(d_test$item==j)
    sub  <- d_train[d_train$item==ind,]$resp
    SSE_test_baseline[i,loc] = sample(sub,length(loc),replace=TRUE)
  }
}

Then, we calculated the distribution of SSE and average baseline SSE.

SSE_baseline <- rep(0,3000)

for(i in 1:3000){
  tmp             <-  SSE_test_baseline[i,]
  SSE_baseline[i] <- sum((d_test$resp - tmp)^2)
}

ggplot() + 
  geom_histogram(aes(x=SSE_baseline),fill='white',color='black')+
  theme_minimal()

mean(SSE_baseline)
[1] 302.4526

The SSE dropped from 302.45 to 187.46 when the model is used to predict the responses in the unseen test data. The proportional reduction in SSE can be interpreted as the variance explained by the model, \[\frac{302.45-187.46}{302.45} = 0.38.\]

We also used this approach in the paper to compare different models in terms of predictive utility for unseen data.

Posterior Predictive Checks

We evaluate certain aspects of the model fit by using the posterior predictive model-checking approach (PPMC). Essentially, PPMC contrasts actual data with model-predicted or generated data. If there’s a noticeable divergence between the real data and the model’s predictions, it suggests the model isn’t adequately capturing certain data facets. Visual representations are often the most user-friendly means to conduct these posterior predictive assessments. So, we created several visualizations to check the alignment between real data and model predictions. We primarily focused on three aspects:

  • Recovery of average item response

  • Recovery of standard deviaion of item responses

  • Recovery of the total test score distribution

  • Recovery of item specific distributions

The average item response

obs_  <- c()   # The observed average item score in the test data
pred_ <- c()   # The posterior mean of average item scores in the test data

for(item in 1:50){ # iterate over items
  
  loc1        <- which(d_test$item==item)
  obs         <- d_test$resp[loc1]
  obs_[item]  <- mean(obs)
  
  pred_[item] <- mean(rowMeans(Yhat_test[,loc1]))
}


ggplot() +
  geom_point(aes(y = obs_, x = pred_)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  ylab("Average Item Score for Observed Responses") +
  xlab("Posterior Mean of Average Item Score for Model Generated Responses")+
  xlim(0,1)+
  ylim(0,1)

The standard deviation of item responses

obs_  <- c()   # The observed standard deviation in the test data
pred_ <- c()   # The posterior mean of standard deviation in the test data

for(item in 1:50){
  
  loc1        <- which(d_test$item==item)
  obs         <- d_test$resp[loc1]
  obs_[item]  <- sd(obs)
  
  Yhat_test[,loc1]
    
  # Var[E(Y|params)], variance of expected grade conditional on model parameters
  
  comp1 <- var(colMeans(Yhat_test[,loc1]))     
      
  # E[Var(Y|params)], expected value of variance of grades conditional on model parameters
  
  comp2 <- mean(apply(Yhat_test[,loc1],1,var)) 

  
  pred_[item] <- sqrt(comp1 + comp2)
  
}


ggplot() +
  geom_point(aes(y = obs_, x = pred_)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  ylab("Standard Deviation of Observed Responses") +
  xlab("Standard Deviation of Model Generated Responses")+
  xlim(0,1)+
  ylim(0,1)

The total score distribution

For this one, we use the training data because test data has only one or two items per test taker, so it doesn’t make sense to create a total score.

Yhat_train <- posteriors$Yhat_train

d_train_wide <- reshape(data = d_train,
                        idvar = c('id','w','s','true_theta'),
                        timevar = 'item',
                        direction= 'wide')

obs_score          <- rowMeans(d_train_wide[,5:54],na.rm=TRUE)

score_posterior    <- c()

for(p in 1:1000){
  
  loc1  <- which(d_train$id==d_train_wide$id[p])
  
  tmp <- rowMeans(Yhat_train[,loc1])
  
  score_posterior[p] <- sample(tmp,1)
}

ggplot() +
  geom_point(aes(y = obs_score, x = score_posterior)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_minimal() +
  ggtitle("") +
  ylab("Observed Average Person Score") +
  xlab("Posterior Mean of Model Predicted Average Person Score")+
  xlim(0,1)+
  ylim(0,1)

score <- data.frame(type=c(rep('Observed',1000),rep('Posterior',1000)),
                    score = c(obs_score,score_posterior))

ggplot(score, aes(x=score, linetype=type)) + 
  geom_histogram(data=subset(score, type=="Observed"), 
                 aes(y=after_stat(density)),fill='white',color='black') + 
  geom_density(linewidth=0.1) + 
  theme_minimal() +
  labs(title="", 
       x="Total Score", y="Density") +  
  theme(legend.title = element_blank(),
        legend.key = element_blank())

The item specific distribution

We again use the training data for checking this because test data has only few observations per item to get any meaningful distribution. The code below generates a function to compare the item specific distributions for the observed data and model generated data.

compare_item_dist <- function(item){

  loc1        <- which(d_train$item==item)
  obs         <- d_train$resp[loc1]
  pred        <- c()
  
  for(i in 1:length(loc1)){
    pred[i]   <- sample(Yhat_train[,loc1[i]],1)
  }
  
  score <- data.frame(type=c(rep('Observed',length(loc1)),
                             rep('Posterior',length(loc1))),
                      score = c(obs,pred))

  p <- ggplot(score, aes(x=score, linetype=type)) + 
    geom_histogram(data=subset(score, type=="Observed"), 
                   aes(y=after_stat(density)),fill='white',color='black') + 
    geom_density(linewidth=0.1) + 
    theme_minimal() +
    labs(title="", 
         x="Total Score", y="Density") +  
    theme(legend.title = element_blank(),
          legend.key = element_blank()) +
    ggtitle(paste0('Item ',item))
  
  print(p)
}
compare_item_dist(item=1)

compare_item_dist(item=2)

compare_item_dist(item=3)

compare_item_dist(item=4)

References

  • Molenaar, D., Cúri, M., & Bazán, J. L. (2022). Zero and one inflated item response theory models for bounded continuous data. Journal of Educational and Behavioral Statistics, 47(6), 693-735.

  • Stenhaug, B. A., & Domingue, B. W. (2022). Predictive fit metrics for item response models. Applied Psychological Measurement, 46(2), 136-155.

  • Zopluoglu, C., Lockwood, J.R. (under review). A Comparative Study of Item Response Theory Models for Mixed Discrete-Continous Responses. Journal of Intelligence.