A simple way to model rankings with Stan

Bruno Nicenboim / 2021-03-21 15 min read
Categories: stats / Tags: stan Bayesian r /

The initial problem

I wrote what I thought it was the generative process for some modeling work, and it looked too common to not have a name, so I started to ask around in Twitter:

One useful clue was this one:

It turns out that the distribution that I was describing is in general used for rankings or ordered data and it is called an exploded logit distribution.1

In this post, I’ll show how this model can be fit in the probabilistic programming language Stan, and how it can be used to describe the underlying order of ranking data.

I’m going to load some R packages that will be useful throughout this post.

library(dplyr) # Data manipulation
library(purrr) # List manipulation
library(ggplot2) # Nice plots
library(extraDistr) # More distributions
library(rcorpora) # Get random words
library(cmdstanr) # Lightweight Stan interface
library(bayesplot) # Nice Bayesian plots

Ranking data

Ranking data appear when we care about the underlying order that certain elements have. We might want to know which are the best horses after looking at several races (Gakis et al. 2018), which is the best candidate for a job after a series of interviewers talked to several candidates. More in line with cognitive science, we might want to know which are the best possible completions for a sentence or the best exemplars of a category.

One way to get a ranking of exemplars of a category, for example, is to present them to participants and ask them to order all (or a subset) of them (see Barsalou 1985).

A ranking simulation using pizza toppings

Let’s consider the following 25 pizza toppings:

toppings <- corpora("foods/pizzaToppings")$pizzaToppings
N_toppings <- length(toppings)
toppings
 [1] "anchovies"        "artichoke"       
 [3] "bacon"            "breakfast bacon" 
 [5] "Canadian bacon"   "cheese"          
 [7] "chicken"          "chili peppers"   
 [9] "feta"             "garlic"          
[11] "green peppers"    "grilled onions"  
[13] "ground beef"      "ham"             
[15] "hot sauce"        "meatballs"       
[17] "mushrooms"        "olives"          
[19] "onions"           "pepperoni"       
[21] "pineapple"        "sausage"         
[23] "spinach"          "sun-dried tomato"
[25] "tomatoes"        

Let’s say that we want to know the underlying order of pizza toppings. For the modeling, I’m going to assume that the toppings are ordered according to an underlying value, which also represents how likely it is for each topping to be the exemplar of their category.

To get a known ground truth for the ranking, I’m going to simulate an order of pizza toppings. I assign probabilities that sum up to one to the toppings by drawing a random sample from a Dirichlet distribution. The Dirichlet distribution is the generalization of the Beta distribution. It has a concentration parameter, usually \(\boldsymbol{\alpha}\), which is a vector as long as the probabilities we are sampling (25 here). When the vector is full of ones, the distribution is uniform: All probabilities are equally likely, so on average each one is \(\frac{1}{vector \text{ } length}\) (\(\frac{1}{25}\) here). By setting all the concentration parameters below one (namely \(.2\)), I’m enforcing sparsity in the random values that I’m generating, that is, many probability values close to zero.

These is the true order that I’m assuming here:

# all elements of the vector are .5
alpha <- rep(.2, N_toppings)
# Generate one draw from a Dirichlet distribution
P_toppings <- c(rdirichlet(1, alpha)) %>%
  # Add names
  setNames(toppings) %>%
  # Sort from the best exemplar
  sort(decreasing = TRUE)
P_toppings %>%
  round(3)
 breakfast bacon          chicken             feta 
           0.294            0.241            0.087 
       anchovies sun-dried tomato           olives 
           0.087            0.077            0.057 
       pepperoni        artichoke           cheese 
           0.056            0.049            0.010 
  Canadian bacon            bacon              ham 
           0.008            0.008            0.006 
       meatballs    chili peppers           garlic 
           0.004            0.004            0.004 
     ground beef         tomatoes        hot sauce 
           0.003            0.003            0.002 
          onions          sausage        pineapple 
           0.000            0.000            0.000 
         spinach        mushrooms   grilled onions 
           0.000            0.000            0.000 
   green peppers 
           0.000 
  • Given these values, if I were to ask a participant “What’s the most appropriate topping for a pizza?” I would assume that 29.37 percent of the time, I would get breakfast bacon.

Essentially, we expect something like this to be happening.

\[\begin{equation} response \sim Categorical(\Theta_{toppings}) \tag{1} \end{equation}\]

With \(\Theta_{toppings}\) representing the different probabilities for each topping. The probability mass function of the categorical distribution is absurdly simple: It’s just the probability of the outcome.

\[\begin{equation} p(x = i) = \Theta_i \tag{2} \end{equation}\]

where \(i = \{\)breakfast bacon, chicken, feta, anchovies, sun-dried tomato, olives, pepperoni, artichoke, cheese, Canadian bacon, bacon, ham, meatballs, chili peppers, garlic, ground beef, tomatoes, hot sauce, onions, sausage, pineapple, spinach, mushrooms, grilled onions, green peppers\(\}\).

We can simulate this with 100 participants as follows:

response <- rcat(100, P_toppings, names(P_toppings))

And this should match approximately P_toppings.

table(response)/100
response
 breakfast bacon          chicken             feta 
            0.26             0.19             0.16 
       anchovies sun-dried tomato           olives 
            0.07             0.15             0.06 
       pepperoni        artichoke           cheese 
            0.01             0.08             0.00 
  Canadian bacon            bacon              ham 
            0.02             0.00             0.00 
       meatballs    chili peppers           garlic 
            0.00             0.00             0.00 
     ground beef         tomatoes        hot sauce 
            0.00             0.00             0.00 
          onions          sausage        pineapple 
            0.00             0.00             0.00 
         spinach        mushrooms   grilled onions 
            0.00             0.00             0.00 
   green peppers 
            0.00 

It seems that by only asking participants to give the best topping we could already deduce the underlying order…

True, but one motivation for considering ranking data is the amount of information that we gather with a list due to their combinatorial nature. If we ask participants to rank \(n\) items, an answer consists in making a single selection out of \(n!\) possibilities. Ordering 7 pizza toppings, for example, constitutes making a single selection out of 5040 possibilities!

If we don’t relay on lists and there is sparcity, it requires a large number of participants until we get answers of low probability. (For example, we’ll need a very large number of participants until we hear something else but hammer as an exemplar of tools).

  • Now, what happens if we ask about the second most appropriate topping for a pizza?

Now we need to exclude the first topping that was given, and draw another sample from a categorical distribution. (We don’t allow the participant to repeat toppings, that is, to say that the best topping is pineapple and the second best is also pineapple). This means that now the probability of the topping already given is zero, and that we need to normalize our original probability values by dividing them by the new total probability (which will be lower than 1).

Here, the probability of getting the element \(j\) (where \(j \neq i\)) is

\[\begin{equation} p(x = j) = \frac{\Theta_j}{\sum \Theta_{[-i]}} \tag{3} \end{equation}\]

where \(\Theta_{[-i]}\) represents the probabilities of all the outcomes except of \(i\), which was the first one.

  • We can go on with the third best topping, where we need to normalize the remaining probabilities by dividing by the new sum of probabilities.

\[\begin{equation} p(x = k) = \frac{\Theta_k}{\sum \Theta_{[-i,-j]}} \tag{4} \end{equation}\]

  • We can do this until we get to the last element, which will be drawn with probability 1.

And this is the exploded logit distribution.

This process can be simulated in R as follows:

rexploded <-  function(n, ranked = 3, prob, labels = NULL){
  #run n times
  lapply(1:n, function(nn){
    res <- rep(NA,ranked)
    if(!is.null(labels)){
      res <- factor(res, labels)
    } else {
      # if there are no labels, just 1,2,3,...
      labels <- seq_along(prob)
    }
  for(i in 1:ranked){
    # normalize the probability so that it sums to 1
    prob <- prob/sum(prob)
    res[i] <- rcat(1, prob = prob, labels = labels)
    # remove the choice from the set:
    prob[res[i]] <- 0
  }
    res
  })
}

If we would like to simulate 50 subjects creating a ranking of the best 7 toppings, we would do the following:

res <- rexploded(n = 50,
                 ranked = 7,
                 prob = P_toppings,
                 labels = names(P_toppings))
# subject 1:
res[[1]]
[1] sun-dried tomato artichoke        olives          
[4] breakfast bacon  chicken          pepperoni       
[7] anchovies       
25 Levels: breakfast bacon chicken feta ... green peppers

We have simulated ranking data of pizza toppings, can we recover the original probability values and “discover” the underlying order?

Fitting the exploded logistic distribution in Stan

To fit the model in Stan, I’m going to create a custom probability mass function that takes an array of integers, x, which represents a set of rankings, and a vector of probability values, theta, that sums up to one.

The logic of this function is that the probability mass function of a ranking \(\{i,j,k, \ldots, N \}\) can be written as a product of normalized categorical distributions (where the first one is just divided by 1).

\[\begin{equation} p(x = \{i,j,k,\ldots\}) = \frac{\Theta_i}{\sum \Theta} \cdot \frac{\Theta_j}{\sum \Theta_{[-i]}} \cdot \frac{\Theta_k}{\sum \Theta_{[-i, -j]}} \ldots \tag{5} \end{equation}\]

For Stan, we need the log-PDF. In log-space, products become sums, and divisions differences, and the log of \(\sum \Theta\) will be zero:

\[\begin{equation} \begin{aligned} log(p(x = \{i,j,k,\ldots\})) =& \log(\Theta_i) - log(\sum \Theta) \\ & + \log(\Theta_j) - \log(\sum \Theta_{[-i]}) \\ &+ \log(\Theta_k) -\log(\sum \Theta_{[-i, -j]}) \\ & + \ldots \end{aligned} \tag{5} \end{equation}\]

The following Stan custom function follows this logic but iterating over the rankings. In each iteration, it aggregates in the variable out the addends of the log probability mass function, and turns the probability of selecting again the already ranked element to zero.

 real exploded_lpmf(int[] x, vector Theta){
    real out = 0;
    vector[num_elements(Theta)] thetar = Theta;
    for(pos in x){
      out += log(thetar[pos]) - log(sum(thetar));
      thetar[pos] = 0;
      }
     return(out);
 }

The whole model named exploded.stan includes the usual data declaration, the parameter Theta declared as a simplex (i.e., it sums to one), and a uniform Dirichlet prior for Theta. (I’m assuming that I don’t know how sparse the probabilities are).

functions {
 real exploded_lpmf(int[] x, vector Theta){
    real out = 0;
    vector[num_elements(Theta)] thetar = Theta;
    for(pos in x){
      out += log(thetar[pos]) - log(sum(thetar));
      thetar[pos] = 0;
      }
     return(out);
 }
}
data{
  int N_ranking; //total times the choices were ranked
  int N_ranked; //total choices ranked
  int N_options; //total options
  int res[N_ranking, N_ranked];
}
parameters {
  simplex[N_options] Theta;
}
model {
  target += dirichlet_lpdf(Theta| rep_vector(1, N_options));
  for(r in 1:N_ranking){
    target += exploded_lpmf(res[r]|Theta);
  }
}

Let’s see if I can recover the parameter values.

# Makethe list of lists into a matrix
res_matrix <- t(sapply(res, as.numeric))
ldata <- list(res = res_matrix, 
              N_ranked = length(res[[1]]), 
              N_options = length(P_toppings), 
              N_ranking = length(res)) 

m_expl <- cmdstan_model("./exploded.stan")

f_exploded <- m_expl$sample(
  data = ldata,
  seed = 123,
  parallel_chains = 4
)

f_exploded
 variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
 lp__     -724.91 -724.59 3.59 3.49 -731.39 -719.59 1.00     1262     2290
 Theta[1]    0.29    0.29 0.04 0.04    0.23    0.35 1.00     5575     2538
 Theta[2]    0.27    0.27 0.04 0.04    0.21    0.33 1.00     5320     3114
 Theta[3]    0.08    0.08 0.01 0.01    0.06    0.10 1.00     6830     3167
 Theta[4]    0.09    0.08 0.01 0.01    0.06    0.11 1.00     5790     2746
 Theta[5]    0.07    0.07 0.01 0.01    0.05    0.09 1.00     6440     3142
 Theta[6]    0.05    0.04 0.01 0.01    0.03    0.06 1.00     6608     3177
 Theta[7]    0.04    0.04 0.01 0.01    0.03    0.06 1.00     6373     2697
 Theta[8]    0.05    0.05 0.01 0.01    0.03    0.06 1.00     5620     2863
 Theta[9]    0.01    0.01 0.00 0.00    0.00    0.01 1.00     7235     2664

 # showing 10 of 26 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)

I plot the posterior distributions of the probability values and the true probability values below.

mcmc_recover_hist(f_exploded$draws("Theta"),
                  P_toppings,
                  facet_args =
                    list(scales = "fixed", ncol = 3)) +
  theme(legend.position="bottom")

It looks reasonable. However, if we really want to be sure that this is working, we should probably use simulation based calibration (Talts et al. 2018).

What is this good for?

This super simple example shows how to get an underlying ranking based on a set of responses from a number of subjects. It’s straightforward to adapt this model to data from participants ranking elements from different sets of the same size (e.g., 7 out of 25 toppings, 7 out of 25 tools). It’s a little less straightforward if the sets are of different sizes, e.g., rank 7 toppings out 25, but 7 tools out 50. This is just because Stan doesn’t allow ragged arrays. See here some tips for implementing the latter model.

Could this be used as a cognitive model of people’s rankings?

Maybe. And I enter here in the realm of half baked research, ideal for a blog post.

Lee, Steyvers, and Miller (2014) show the implementation of a cognitive model for rank order data from the latent knowledge of participants, which is based on Thurstonian models (Thurstone 1927, 1931) fitted with Bayesian methods in JAGS (Johnson and Kuhn 2013).

The exploded logit model seems to be closely related to the Thurstonian model. The Thurstonian model assumes that each participant assigns an underlying score to each item of a set, which is drawn from a true score with normally distributed error. The score determines the order that the participant gives. We can think about the exploded logit similarly. While I modeled the underlying ranking based on probability values, one could assume that each participant \(s\) had their own score \(mu_{is}\) for each item (or pizza topping) \(i\), which is built as a common score \(mu_i\) together with some individual deviation \(\epsilon_{is}\):

\[\begin{equation} \mu_{is} = \mu_i + \epsilon_{is} \tag{6} \end{equation}\]

If we assume that \(\epsilon_{is}\) has a Gumbel distribution, then the probability of \(\mu_{is}\) being ranked first out of N options is determined by a softmax function:

\[\begin{equation} P(i) = \frac{\exp(\mu_i)}{\sum \exp(\mu)} \tag{7} \end{equation}\]

where \(\mu\) is the vector of scores for all elements of the set.

And the probability of ordering \(j\) second is:

\[\begin{equation} P(i,j,\ldots) = \frac{\exp(\mu_j)}{\sum \exp(\mu_{[-i]} )} \tag{8} \end{equation}\]

and so forth.

These last equations are essentially the same categorical distributions that I used before in (2) and (3), but the softmax function converts the unbounded scores into probabilities first. However, with the exploded logit, the error term goes away leading to a more tractable model. This is not the case for the Thurstonian model. The Thurstonian model is more complex, but at the same time we gain more flexibility. With the error term, the Thurstonian model can incorporate the reliability of the participants’ judgments and even correlations, which, as far as I know, can’t be included in the exploded logit model.

Session info2

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bayesplot_1.8.1     cmdstanr_0.4.0.9000 rcorpora_2.0.0      extraDistr_1.9.1    ggplot2_3.3.5       purrr_0.3.4         dplyr_1.0.7        

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1     xfun_0.26            bslib_0.3.0          reshape2_1.4.4       colorspace_2.0-2     vctrs_0.3.8          generics_0.1.1      
 [8] htmltools_0.5.2      yaml_2.2.1           utf8_1.2.2           rlang_0.4.12         jquerylib_0.1.4      pillar_1.6.4         glue_1.5.1          
[15] withr_2.4.2          DBI_1.1.1            distributional_0.2.2 matrixStats_0.60.1   lifecycle_1.0.1      plyr_1.8.6           stringr_1.4.0       
[22] posterior_1.1.0      munsell_0.5.0        blogdown_1.5.2       gtable_0.3.0         evaluate_0.14        labeling_0.4.2       papaja_0.1.0.9997   
[29] knitr_1.36           fastmap_1.1.0        ps_1.6.0             fansi_0.5.0          highr_0.9            Rcpp_1.0.7           scales_1.1.1        
[36] backports_1.2.1      checkmate_2.0.0      jsonlite_1.7.2       abind_1.4-5          farver_2.1.0         tensorA_0.36.2       digest_0.6.28       
[43] stringi_1.7.6        processx_3.5.2       bookdown_0.22        grid_4.1.1           tools_4.1.1          magrittr_2.0.1       sass_0.4.0          
[50] tibble_3.1.5         crayon_1.4.2         pkgconfig_2.0.3      ellipsis_0.3.2       data.table_1.14.2    ggridges_0.5.3       assertthat_0.2.1    
[57] rmarkdown_2.11       R6_2.5.1             compiler_4.1.1      

References

Barsalou, Lawrence W. 1985. “Ideals, Central Tendency, and Frequency of Instantiation as Determinants of Graded Structure in Categories.” Journal of Experimental Psychology: Learning, Memory, and Cognition 11 (4): 629.
Beggs, S, S Cardell, and J Hausman. 1981. “Assessing the Potential Demand for Electric Cars.” Journal of Econometrics 17 (1): 1–19. https://doi.org/https://doi.org/10.1016/0304-4076(81)90056-7.
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.
Gabry, Jonah, and Rok Češnovar. 2020. Cmdstanr: R Interface to ’CmdStan’.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” J. R. Stat. Soc. A 182: 389–402. https://doi.org/10.1111/rssa.12378.
Gakis, Konstantinos, Panos Pardalos, Chang-Hwan Choi, Jae-Hyeon Park, and Jiwun Yoon. 2018. “Simulation of a Probabilistic Model for Multi-Contestant Races.” Athens Journal of Sports 5 (2): 95–114.
Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Johnson, Timothy R., and Kristine M. Kuhn. 2013. “Bayesian Thurstonian Models for Ranking Data Using JAGS.” Behavior Research Methods 45 (3): 857–72. https://doi.org/10.3758/s13428-012-0300-3.
Kazemi, Darius, Cole Willsea, Serin Delaunay, Karl Swedberg, Matthew Rothenberg, Greg Kennedy, Nathaniel Mitchell, et al. 2018. Rcorpora: A Collection of Small Text Corpora of Interesting Data. https://CRAN.R-project.org/package=rcorpora.
Lee, Michael D., Mark Steyvers, and Brent Miller. 2014. “A Cognitive Model for Aggregating People’s Rankings.” PLOS ONE 9 (5): e96431. https://doi.org/10.1371/journal.pone.0096431.
Luce, R. Duncan. 1959. Individual Choice Behavior : A Theoretical Analysis. Book. Wiley N.Y.
Plackett, R. L. 1975. “The Analysis of Permutations.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 24 (2): 193–202. http://www.jstor.org/stable/2346567.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration,” April. http://arxiv.org/abs/1804.06788.
Thurstone, Louis L. 1927. “A Law of Comparative Judgement.” Psychological Reviews 34: 273–86.
———. 1931. “Rank Order as a Psycho-Physical Method.” Journal of Experimental Psychology 14 (3): 187.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wolodzko, Tymoteusz. 2020. extraDistr: Additional Univariate and Multivariate Distributions. https://CRAN.R-project.org/package=extraDistr.
Yutani, Hiroaki. 2021. Mediumr: R Interface to ’Medium’ API.

  1. This model is also called the rank ordered logit model (Beggs, Cardell, and Hausman 1981) or Plackett–Luce model due to Plackett (1975) and Luce (1959), but I liked the explosion part more.↩︎

  2. I used R [Version 4.1.1; R Core Team (2020)] and the R-packages bayesplot [Version 1.8.1; Gabry et al. (2019)], cmdstanr [Version 0.4.0.9000; Gabry and Češnovar (2020)], dplyr [Version 1.0.7; Wickham et al. (2021)], extraDistr [Version 1.9.1; Wolodzko (2020)], ggplot2 [Version 3.3.5; Wickham (2016)], mediumr (Yutani 2021), purrr [Version 0.3.4; Henry and Wickham (2020)], rcorpora [Version 2.0.0; Kazemi et al. (2018)], and shiny [Version 1.6.0; Chang et al. (2021)] to generate this document.↩︎

Citation:

For attribution, please cite this work as

Bruno Nicenboim. (2021). "A simple way to model rankings with Stan". Retrieved from /2021/03/21/a-simple-way-to-model-rankings-with-stan/.

BibTex citation

  @misc{Bruno Nicenboim,
    author = { Bruno Nicenboim },
    title = {A simple way to model rankings with Stan},
    howpublished = {\url{/2021/03/21/a-simple-way-to-model-rankings-with-stan/}},
    year = {2021}
  }