Closed
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements Conditional Particle Filter with Independent Backward Simulation (CPF-IBS) for the SMC2 rejuvenation step in the BayesMallowsSMC2 R package, bumping the version to 0.3.0.
Changes:
- Added
backward_samplingoption toset_smc_options()plumbed through to C++ viaOptions - Implemented
assemble_backward_trajectory()in C++ that samples reference trajectories by drawing independently from stored forward filtering weights - Added tests for backward sampling across complete, partial, preference, and mixture data scenarios
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/particle.cpp | Core implementation of assemble_backward_trajectory and weight storage |
| src/particle.h | Declaration of new method and stored_weights member |
| src/rejuvenate.cpp | Integration of backward sampling into rejuvenation step |
| src/options.h / src/options.cpp | Added backward_sampling field |
| R/set_smc_options.R | New backward_sampling parameter |
| R/compute_sequentially.R | Documentation update and formatting |
| tests/testthat/test-compute_sequentially_*.R | New backward sampling tests |
| man/*.Rd | Documentation with references |
| inst/REFERENCES.bib | New bibliography entries |
| DESCRIPTION / NEWS.md / cran-comments.md | Version bump and changelog |
| agents.md | AI development prompt (should not be shipped) |
| .gitignore / .Rbuildignore | Build config updates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+125
to
+180
| for (int t = T; t >= 0; --t) { | ||
| arma::vec current_weights = stored_weights[t]; | ||
|
|
||
| // Sample a single index b_t based on current_weights | ||
| arma::ivec counts = resampler->resample(1, current_weights); | ||
| unsigned int b_t = | ||
| arma::as_scalar(arma::find(counts > 0, 1)); // The chosen index | ||
|
|
||
| unsigned int num_users_at_t = | ||
| particle_filters[b_t] | ||
| .latent_rankings.col(t) | ||
| .n_cols; // actually wait, .col(t) returns EXACTLY 1 column. | ||
|
|
||
| if (new_reference.latent_rankings.is_empty()) { | ||
| new_reference.latent_rankings = | ||
| particle_filters[b_t].latent_rankings.col(t); | ||
| if (parameters.tau.size() > 1) { | ||
| // The total number of users up to time t in the forward pass is the | ||
| // length of cluster_assignments | ||
| unsigned int end_idx = | ||
| particle_filters[b_t].cluster_assignments.n_elem - 1; | ||
| // Since .col(t) grabbed 1 column, but what if multiple users were | ||
| // processed? Ah! In `run_particle_filter`, `proposal.proposal` is | ||
| // joined! Wait, `pf.latent_rankings = join_horiz(pf.latent_rankings, | ||
| // proposal.proposal);` If `proposal.proposal` had 5 columns at time | ||
| // `t`, then `pf.latent_rankings` grew by 5 columns! So | ||
| // `latent_rankings` columns correspond to USERS, not timepoints! So | ||
| // `col(t)` is completely wrong! We need to extract the columns | ||
| // corresponding to time `t`. Let's look at `sample_latent_rankings`. | ||
| // For complete data, 1 user = 1 row = 1 timepoint! Wait... for mixture | ||
| // models, see test: `compute_sequentially(mixtures[1:50,])`. `mixtures` | ||
| // has 1 row per user. So `n_timepoints` = 50. At each timepoint, 1 user | ||
| // is processed. So `proposal.proposal.n_cols` = 1. Thus | ||
| // `latent_rankings` has exactly 1 column per timepoint. | ||
| // `num_users_at_t` is always 1! SO WHY DID IT SEGFAULT? Because | ||
| // `col(t)` returns exactly 1 column, `num_users_at_t` is 1. Let's check | ||
| // `start_idx`. | ||
| new_reference.cluster_assignments = | ||
| particle_filters[b_t].cluster_assignments.subvec(t, t); | ||
| new_reference.cluster_probabilities = | ||
| particle_filters[b_t].cluster_probabilities.cols(t, t); | ||
| new_reference.index = uvec(T + 1, fill::zeros); | ||
| } | ||
| } else { | ||
| new_reference.latent_rankings.insert_cols( | ||
| 0, particle_filters[b_t].latent_rankings.col(t)); | ||
| if (parameters.tau.size() > 1) { | ||
| new_reference.cluster_assignments.insert_rows( | ||
| 0, particle_filters[b_t].cluster_assignments.subvec(t, t)); | ||
| new_reference.cluster_probabilities.insert_cols( | ||
| 0, particle_filters[b_t].cluster_probabilities.cols(t, t)); | ||
| } | ||
| } | ||
|
|
||
| new_reference.log_weight(t) = particle_filters[b_t].log_weight(t); | ||
| } |
Comment on lines
+1
to
+26
| # Role and Identity | ||
| You are an expert C++ and RcppArmadillo developer specializing in Bayesian computation, Sequential Monte Carlo (SMC2), and Particle Markov Chain Monte Carlo (PMCMC) methods. You are working on the `BayesMallowsSMC2` R package. | ||
|
|
||
| # Project Context | ||
| The `BayesMallowsSMC2` package implements sequential inference for the Bayesian Mallows Model using an SMC2 algorithm. Currently, the package uses a standard Conditional Particle Filter (CPF) for the Particle Gibbs rejuvenation step. This standard approach suffers from path degeneracy because it relies on deterministic forward ancestral tracing, which severely limits the mixing of latent variables for early users. | ||
|
|
||
| # Objective | ||
| Your goal is to upgrade the rejuvenation step by implementing **Conditional Particle Filter with Independent Backward Simulation (CPF-IBS)**. Because cross-sectional user batches are conditionally independent given the static parameters, the standard O(S^2) backward simulation simplifies exactly to O(S) independent categorical draws from the marginal forward filtering weights. | ||
|
|
||
| # Mathematical Specification of CPF-IBS | ||
| 1. **Forward Pass:** The conditional particle filter processes users at time t = 1 ... T. It computes normalized weights W_t and latent states x_t for particles s = 1 ... S. | ||
| *Crucial Change:* We no longer need to track or store the forward ancestor indices (a_t). | ||
| 2. **Backward Pass:** Instead of recursive pointer lookups (e.g. b_{t-1} = a_{t-1}^{b_t}), we assemble the new reference trajectory by looping backward from t = T down to 1. | ||
| 3. **Sampling:** At each timestep t, sample an index b_t in {1, ..., S} independently with probabilities W_t. | ||
| 4. **Extraction:** Extract the latent variables for time t from the sampled particle b_t. | ||
|
|
||
| # Codebase Rules & Architectural Constraints (RcppArmadillo) | ||
| 1. **Memory Optimization:** Locate the T x S matrix of ancestor variables (likely an `arma::umat a` matrix inside the conditional particle filter / SMC structures) and completely remove it. This will drastically reduce the memory footprint. You only need to track the weights W_t. | ||
| 2. **Weighted Sampling in C++:** When sampling the index b_t ~ W_t during trajectory assembly, utilize `arma::randi` (with the custom normalized W_t probabilities) to sample b_t independently. Ensure you are using a statistically rigorous weighted sampling function compatible with R's RNG scope so that `set.seed()` from R remains perfectly reproducible. | ||
| 3. **Clean Integration:** Modify the Rejuvenation functions (Algorithm 4 / S3) so that trajectory assembly happens *after* the forward pass completes, iterating in reverse. | ||
| 4. **Testing:** Ensure that the R package still compiles (`Rcpp::compileAttributes()`, `devtools::document()`, `devtools::load_all()`) and passes all existing tests (`devtools::test()`). | ||
|
|
||
| # Interaction Guidelines | ||
| - Before writing code, use search tools to find the exact C++ files handling the Conditional Particle Filter and Rejuvenation steps in the `src/` directory. | ||
| - Explain your planned modifications to the C++ logic before executing them. | ||
| - Keep performance in mind: cache locality and avoiding deep copies of Armadillo matrices inside the t-loop are critical. No newline at end of file |
| resampler = "systematic") | ||
| ) | ||
| alpha_hat <- weighted.mean(x = as.numeric(mod$alpha), w = mod$importance_weights) | ||
| alpha_hat <- weighted.mean(x = as.numeric(mod$alpha), w = mod$importance_weights) |
Comment on lines
+142
to
+161
| // The total number of users up to time t in the forward pass is the | ||
| // length of cluster_assignments | ||
| unsigned int end_idx = | ||
| particle_filters[b_t].cluster_assignments.n_elem - 1; | ||
| // Since .col(t) grabbed 1 column, but what if multiple users were | ||
| // processed? Ah! In `run_particle_filter`, `proposal.proposal` is | ||
| // joined! Wait, `pf.latent_rankings = join_horiz(pf.latent_rankings, | ||
| // proposal.proposal);` If `proposal.proposal` had 5 columns at time | ||
| // `t`, then `pf.latent_rankings` grew by 5 columns! So | ||
| // `latent_rankings` columns correspond to USERS, not timepoints! So | ||
| // `col(t)` is completely wrong! We need to extract the columns | ||
| // corresponding to time `t`. Let's look at `sample_latent_rankings`. | ||
| // For complete data, 1 user = 1 row = 1 timepoint! Wait... for mixture | ||
| // models, see test: `compute_sequentially(mixtures[1:50,])`. `mixtures` | ||
| // has 1 row per user. So `n_timepoints` = 50. At each timepoint, 1 user | ||
| // is processed. So `proposal.proposal.n_cols` = 1. Thus | ||
| // `latent_rankings` has exactly 1 column per timepoint. | ||
| // `num_users_at_t` is always 1! SO WHY DID IT SEGFAULT? Because | ||
| // `col(t)` returns exactly 1 column, `num_users_at_t` is 1. Let's check | ||
| // `start_idx`. |
Comment on lines
+133
to
+137
| unsigned int num_users_at_t = | ||
| particle_filters[b_t] | ||
| .latent_rankings.col(t) | ||
| .n_cols; // actually wait, .col(t) returns EXACTLY 1 column. | ||
|
|
Comment on lines
+109
to
149
| if (prior.n_clusters > 1) { | ||
| uvec cluster_assignments = | ||
| particle_filters[conditioned_particle_filter].cluster_assignments; | ||
| uvec cluster_frequencies = | ||
| hist(cluster_assignments, regspace<uvec>(0, prior.n_clusters - 1)); | ||
|
|
||
| for(size_t cluster{}; cluster < prior.n_clusters; cluster++) { | ||
| parameters.tau(cluster) = R::rgamma(cluster_frequencies(cluster) + prior.cluster_concentration, 1.0); | ||
| for (size_t cluster{}; cluster < prior.n_clusters; cluster++) { | ||
| parameters.tau(cluster) = R::rgamma( | ||
| cluster_frequencies(cluster) + prior.cluster_concentration, 1.0); | ||
| } | ||
| parameters.tau = normalise(parameters.tau, 1); | ||
| Particle gibbs_particle(options, this->parameters, pfun); | ||
| gibbs_particle.conditioned_particle_filter = 0; | ||
| gibbs_particle.particle_filters[0] = this->particle_filters[this->conditioned_particle_filter]; | ||
| gibbs_particle.particle_filters[0] = | ||
| this->particle_filters[this->conditioned_particle_filter]; | ||
| gibbs_particle.particle_filters[0].cluster_probabilities = mat{}; | ||
|
|
||
| for(size_t t{}; t < T + 1; t++) { | ||
| gibbs_particle.run_particle_filter(t, prior, data, pfun, distfun, resampler, options.latent_rank_proposal, true); | ||
| // In standard CPF we trace the lineage of conditioned_particle_filter. | ||
| // In backward sampling, we run a completely unconditioned forward particle | ||
| // filter! | ||
| bool requires_conditional = !options.backward_sampling; | ||
| for (size_t t{}; t < T + 1; t++) { | ||
| gibbs_particle.run_particle_filter( | ||
| t, prior, data, pfun, distfun, resampler, | ||
| options.latent_rank_proposal, requires_conditional); | ||
| } | ||
|
|
||
| this->log_incremental_likelihood = gibbs_particle.log_incremental_likelihood; | ||
| this->log_normalized_particle_filter_weights = gibbs_particle.log_normalized_particle_filter_weights; | ||
| this->log_incremental_likelihood = | ||
| gibbs_particle.log_incremental_likelihood; | ||
| this->log_normalized_particle_filter_weights = | ||
| gibbs_particle.log_normalized_particle_filter_weights; | ||
| this->particle_filters = gibbs_particle.particle_filters; | ||
| this->logz = gibbs_particle.logz; | ||
| this->stored_weights = gibbs_particle.stored_weights; | ||
|
|
||
| sample_particle_filter(); | ||
| if (options.backward_sampling) { | ||
| this->assemble_backward_trajectory(T, resampler); | ||
| } else { | ||
| sample_particle_filter(); | ||
| } | ||
| } |
Owner
Author
|
@copilot can you review the last commits and see if they fix most issues? |
Contributor
|
@osorensen I've opened a new pull request, #82, to work on those changes. Once the pull request is ready, I'll request review from you. |
…, duplicate alpha_hat Co-authored-by: osorensen <21175639+osorensen@users.noreply.github.com>
Fix O(T²) backward trajectory, extend backward sampling to single-cluster models
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.