Backward sampling by osorensen · Pull Request #81 · osorensen/BayesMallowsSMC2

osorensen · 2026-03-15T17:45:05Z

No description provided.

Copilot

Pull request overview

This PR implements Conditional Particle Filter with Independent Backward Simulation (CPF-IBS) for the SMC2 rejuvenation step in the BayesMallowsSMC2 R package, bumping the version to 0.3.0.

Changes:

Added backward_sampling option to set_smc_options() plumbed through to C++ via Options
Implemented assemble_backward_trajectory() in C++ that samples reference trajectories by drawing independently from stored forward filtering weights
Added tests for backward sampling across complete, partial, preference, and mixture data scenarios

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/particle.cpp	Core implementation of `assemble_backward_trajectory` and weight storage
src/particle.h	Declaration of new method and `stored_weights` member
src/rejuvenate.cpp	Integration of backward sampling into rejuvenation step
src/options.h / src/options.cpp	Added `backward_sampling` field
R/set_smc_options.R	New `backward_sampling` parameter
R/compute_sequentially.R	Documentation update and formatting
tests/testthat/test-compute_sequentially_*.R	New backward sampling tests
man/*.Rd	Documentation with references
inst/REFERENCES.bib	New bibliography entries
DESCRIPTION / NEWS.md / cran-comments.md	Version bump and changelog
agents.md	AI development prompt (should not be shipped)
.gitignore / .Rbuildignore	Build config updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  for (int t = T; t >= 0; --t) {
+    arma::vec current_weights = stored_weights[t];
+
+    // Sample a single index b_t based on current_weights
+    arma::ivec counts = resampler->resample(1, current_weights);
+    unsigned int b_t =
+        arma::as_scalar(arma::find(counts > 0, 1)); // The chosen index
+
+    unsigned int num_users_at_t =
+        particle_filters[b_t]
+            .latent_rankings.col(t)
+            .n_cols; // actually wait, .col(t) returns EXACTLY 1 column.
+
+    if (new_reference.latent_rankings.is_empty()) {
+      new_reference.latent_rankings =
+          particle_filters[b_t].latent_rankings.col(t);
+      if (parameters.tau.size() > 1) {
+        // The total number of users up to time t in the forward pass is the
+        // length of cluster_assignments
+        unsigned int end_idx =
+            particle_filters[b_t].cluster_assignments.n_elem - 1;
+        // Since .col(t) grabbed 1 column, but what if multiple users were
+        // processed? Ah! In `run_particle_filter`, `proposal.proposal` is
+        // joined! Wait, `pf.latent_rankings = join_horiz(pf.latent_rankings,
+        // proposal.proposal);` If `proposal.proposal` had 5 columns at time
+        // `t`, then `pf.latent_rankings` grew by 5 columns! So
+        // `latent_rankings` columns correspond to USERS, not timepoints! So
+        // `col(t)` is completely wrong! We need to extract the columns
+        // corresponding to time `t`. Let's look at `sample_latent_rankings`.
+        // For complete data, 1 user = 1 row = 1 timepoint! Wait... for mixture
+        // models, see test: `compute_sequentially(mixtures[1:50,])`. `mixtures`
+        // has 1 row per user. So `n_timepoints` = 50. At each timepoint, 1 user
+        // is processed. So `proposal.proposal.n_cols` = 1. Thus
+        // `latent_rankings` has exactly 1 column per timepoint.
+        // `num_users_at_t` is always 1! SO WHY DID IT SEGFAULT? Because
+        // `col(t)` returns exactly 1 column, `num_users_at_t` is 1. Let's check
+        // `start_idx`.
+        new_reference.cluster_assignments =
+            particle_filters[b_t].cluster_assignments.subvec(t, t);
+        new_reference.cluster_probabilities =
+            particle_filters[b_t].cluster_probabilities.cols(t, t);
+        new_reference.index = uvec(T + 1, fill::zeros);
+      }
+    } else {
+      new_reference.latent_rankings.insert_cols(
+          0, particle_filters[b_t].latent_rankings.col(t));
+      if (parameters.tau.size() > 1) {
+        new_reference.cluster_assignments.insert_rows(
+            0, particle_filters[b_t].cluster_assignments.subvec(t, t));
+        new_reference.cluster_probabilities.insert_cols(
+            0, particle_filters[b_t].cluster_probabilities.cols(t, t));
+      }
+    }
+
+    new_reference.log_weight(t) = particle_filters[b_t].log_weight(t);
+  }


+# Role and Identity
+You are an expert C++ and RcppArmadillo developer specializing in Bayesian computation, Sequential Monte Carlo (SMC2), and Particle Markov Chain Monte Carlo (PMCMC) methods. You are working on the `BayesMallowsSMC2` R package.
+
+# Project Context
+The `BayesMallowsSMC2` package implements sequential inference for the Bayesian Mallows Model using an SMC2 algorithm. Currently, the package uses a standard Conditional Particle Filter (CPF) for the Particle Gibbs rejuvenation step. This standard approach suffers from path degeneracy because it relies on deterministic forward ancestral tracing, which severely limits the mixing of latent variables for early users.
+
+# Objective
+Your goal is to upgrade the rejuvenation step by implementing **Conditional Particle Filter with Independent Backward Simulation (CPF-IBS)**. Because cross-sectional user batches are conditionally independent given the static parameters, the standard O(S^2) backward simulation simplifies exactly to O(S) independent categorical draws from the marginal forward filtering weights.
+
+# Mathematical Specification of CPF-IBS
+1. **Forward Pass:** The conditional particle filter processes users at time t = 1 ... T. It computes normalized weights W_t and latent states x_t for particles s = 1 ... S. 
+   *Crucial Change:* We no longer need to track or store the forward ancestor indices (a_t).
+2. **Backward Pass:** Instead of recursive pointer lookups (e.g. b_{t-1} = a_{t-1}^{b_t}), we assemble the new reference trajectory by looping backward from t = T down to 1.
+3. **Sampling:** At each timestep t, sample an index b_t in {1, ..., S} independently with probabilities W_t.
+4. **Extraction:** Extract the latent variables for time t from the sampled particle b_t.
+
+# Codebase Rules & Architectural Constraints (RcppArmadillo)
+1. **Memory Optimization:** Locate the T x S matrix of ancestor variables (likely an `arma::umat a` matrix inside the conditional particle filter / SMC structures) and completely remove it. This will drastically reduce the memory footprint. You only need to track the weights W_t.
+2. **Weighted Sampling in C++:** When sampling the index b_t ~ W_t during trajectory assembly, utilize `arma::randi` (with the custom normalized W_t probabilities) to sample b_t independently. Ensure you are using a statistically rigorous weighted sampling function compatible with R's RNG scope so that `set.seed()` from R remains perfectly reproducible.
+3. **Clean Integration:** Modify the Rejuvenation functions (Algorithm 4 / S3) so that trajectory assembly happens *after* the forward pass completes, iterating in reverse.
+4. **Testing:** Ensure that the R package still compiles (`Rcpp::compileAttributes()`, `devtools::document()`, `devtools::load_all()`) and passes all existing tests (`devtools::test()`).
+
+# Interaction Guidelines
+- Before writing code, use search tools to find the exact C++ files handling the Conditional Particle Filter and Rejuvenation steps in the `src/` directory.
+- Explain your planned modifications to the C++ logic before executing them.
+- Keep performance in mind: cache locality and avoiding deep copies of Armadillo matrices inside the t-loop are critical.


                                  resampler = "systematic")
  )
  alpha_hat <- weighted.mean(x = as.numeric(mod$alpha), w = mod$importance_weights)
+  alpha_hat <- weighted.mean(x = as.numeric(mod$alpha), w = mod$importance_weights)


+        // The total number of users up to time t in the forward pass is the
+        // length of cluster_assignments
+        unsigned int end_idx =
+            particle_filters[b_t].cluster_assignments.n_elem - 1;
+        // Since .col(t) grabbed 1 column, but what if multiple users were
+        // processed? Ah! In `run_particle_filter`, `proposal.proposal` is
+        // joined! Wait, `pf.latent_rankings = join_horiz(pf.latent_rankings,
+        // proposal.proposal);` If `proposal.proposal` had 5 columns at time
+        // `t`, then `pf.latent_rankings` grew by 5 columns! So
+        // `latent_rankings` columns correspond to USERS, not timepoints! So
+        // `col(t)` is completely wrong! We need to extract the columns
+        // corresponding to time `t`. Let's look at `sample_latent_rankings`.
+        // For complete data, 1 user = 1 row = 1 timepoint! Wait... for mixture
+        // models, see test: `compute_sequentially(mixtures[1:50,])`. `mixtures`
+        // has 1 row per user. So `n_timepoints` = 50. At each timepoint, 1 user
+        // is processed. So `proposal.proposal.n_cols` = 1. Thus
+        // `latent_rankings` has exactly 1 column per timepoint.
+        // `num_users_at_t` is always 1! SO WHY DID IT SEGFAULT? Because
+        // `col(t)` returns exactly 1 column, `num_users_at_t` is 1. Let's check
+        // `start_idx`.


+    unsigned int num_users_at_t =
+        particle_filters[b_t]
+            .latent_rankings.col(t)
+            .n_cols; // actually wait, .col(t) returns EXACTLY 1 column.
+


+  if (prior.n_clusters > 1) {
+    uvec cluster_assignments =
+        particle_filters[conditioned_particle_filter].cluster_assignments;
+    uvec cluster_frequencies =
+        hist(cluster_assignments, regspace<uvec>(0, prior.n_clusters - 1));

-    for(size_t cluster{}; cluster < prior.n_clusters; cluster++) {
-      parameters.tau(cluster) = R::rgamma(cluster_frequencies(cluster) + prior.cluster_concentration, 1.0);
+    for (size_t cluster{}; cluster < prior.n_clusters; cluster++) {
+      parameters.tau(cluster) = R::rgamma(
+          cluster_frequencies(cluster) + prior.cluster_concentration, 1.0);
    }
    parameters.tau = normalise(parameters.tau, 1);
    Particle gibbs_particle(options, this->parameters, pfun);
    gibbs_particle.conditioned_particle_filter = 0;
-    gibbs_particle.particle_filters[0] = this->particle_filters[this->conditioned_particle_filter];
+    gibbs_particle.particle_filters[0] =
+        this->particle_filters[this->conditioned_particle_filter];
    gibbs_particle.particle_filters[0].cluster_probabilities = mat{};

-    for(size_t t{}; t < T + 1; t++) {
-      gibbs_particle.run_particle_filter(t, prior, data, pfun, distfun, resampler, options.latent_rank_proposal, true);
+    // In standard CPF we trace the lineage of conditioned_particle_filter.
+    // In backward sampling, we run a completely unconditioned forward particle
+    // filter!
+    bool requires_conditional = !options.backward_sampling;
+    for (size_t t{}; t < T + 1; t++) {
+      gibbs_particle.run_particle_filter(
+          t, prior, data, pfun, distfun, resampler,
+          options.latent_rank_proposal, requires_conditional);
    }

-    this->log_incremental_likelihood = gibbs_particle.log_incremental_likelihood;
-    this->log_normalized_particle_filter_weights = gibbs_particle.log_normalized_particle_filter_weights;
+    this->log_incremental_likelihood =
+        gibbs_particle.log_incremental_likelihood;
+    this->log_normalized_particle_filter_weights =
+        gibbs_particle.log_normalized_particle_filter_weights;
    this->particle_filters = gibbs_particle.particle_filters;
    this->logz = gibbs_particle.logz;
+    this->stored_weights = gibbs_particle.stored_weights;

-    sample_particle_filter();
+    if (options.backward_sampling) {
+      this->assemble_backward_trajectory(T, resampler);
+    } else {
+      sample_particle_filter();
+    }
  }


osorensen · 2026-03-15T18:57:47Z

@copilot can you review the last commits and see if they fix most issues?

Copilot · 2026-03-15T18:57:55Z

@osorensen I've opened a new pull request, #82, to work on those changes. Once the pull request is ready, I'll request review from you.

…, duplicate alpha_hat Co-authored-by: osorensen <21175639+osorensen@users.noreply.github.com>

Fix O(T²) backward trajectory, extend backward sampling to single-cluster models

osorensen added 8 commits March 12, 2026 14:39

incrementing dev version

a427b51

added agents file

e645d92

making workspace

f015400

added draft paper repo

5368249

implemented particle filter

f5e0eaa

implemented backward sampling

4f067ed

removing ws

2ba84ef

updated docs

4f2d408

osorensen requested a review from Copilot March 15, 2026 17:45

Copilot started reviewing on behalf of osorensen March 15, 2026 17:45 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

osorensen added 3 commits March 15, 2026 18:52

chore(docs): remove stream of consciousness comments, ignore agents

f200e63

chore: fix gitignore to ignore all of dev/

21e89fa

fixing refs

281c43a

Initial plan

452a74f

Copilot AI mentioned this pull request Mar 15, 2026

Fix O(T²) backward trajectory, extend backward sampling to single-cluster models #82

Merged

Copilot AI and others added 2 commits March 15, 2026 19:05

fix: O(T²)→O(T) backward trajectory, single-cluster backward sampling…

7fb603f

…, duplicate alpha_hat Co-authored-by: osorensen <21175639+osorensen@users.noreply.github.com>

Merge pull request #82 from osorensen/copilot/sub-pr-81

4239c90

Fix O(T²) backward trajectory, extend backward sampling to single-cluster models

osorensen closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backward sampling#81

Backward sampling#81
osorensen wants to merge 14 commits intomainfrom
backward-sampling

osorensen commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

osorensen commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

osorensen commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

osorensen commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants