Skip to content

Latest commit

 

History

History
394 lines (299 loc) · 12.9 KB

File metadata and controls

394 lines (299 loc) · 12.9 KB

LUCIDus: LUCID with Multiple Omics Data

CRAN_Status_Badge

LUCIDus implements Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) for joint analysis of exposures, omics, latent clusters, and outcomes. The current package version in this repository is 3.1.0.

Compared with the original single-omics workflow, the current package supports:

  • Early integration: one latent structure for a single omics matrix.
  • Parallel integration: layer-specific latent structures for multiple omics layers.
  • Serial integration: multi-stage workflows that chain early and/or parallel submodels.
  • Model tuning over latent-cluster counts and, for early/parallel models, regularization penalties.
  • Feature selection for exposures and omics means/covariances.
  • Missing-data diagnostics and imputation for incomplete omics matrices.
  • Prediction of latent-cluster membership and outcomes, including g_computation = TRUE.
  • Bootstrap inference for early, parallel, and serial models.
  • S3 summaries for all supported model types and Sankey-style plots for early models.

The package builds on:

Integration strategies

LUCID integration strategies

LUCIDus now supports three model families with different data structures:

Model Z input K input Typical use
"early" One matrix Integer or integer vector Single integrated omics layer
"parallel" List of matrices Integer vector or list by layer Multiple omics layers modeled in parallel
"serial" Nested list of stages Nested list matching Z Multi-stage biology or late integration pipelines

For serial models, each stage can itself be an early block (a matrix) or a parallel block (a list of matrices), and K must mirror that topology.

Installation

Install the current CRAN release with:

install.packages("LUCIDus")

Workflow overview

LUCIDus workflow

The comprehensive guide in vagnette/lucidus_full_functionality_guide.Rmd organizes the package around a full analysis workflow:

  1. inspect missingness and imputation quality
  2. fit early, parallel, or serial models with estimate_lucid()
  3. tune K and penalties with tune_lucid() or lucid()
  4. summarize fitted models
  5. predict latent clusters and outcomes, including g_computation
  6. quantify uncertainty with boot_lucid()
  7. visualize early-model structure with plot()
  8. use numerical-stability helpers when building robust workflows

Quick start

Early integration

library(LUCIDus)

data(sim_data)

G <- sim_data$G[1:300, , drop = FALSE]
Z <- sim_data$Z[1:300, , drop = FALSE]
Y <- sim_data$Y_normal[1:300]
CoY <- sim_data$Covariate[1:300, , drop = FALSE]

fit_early <- estimate_lucid(
  G = G,
  Z = Z,
  Y = Y,
  CoY = CoY,
  lucid_model = "early",
  family = "normal",
  K = 2,
  seed = 1008
)

summary(fit_early)
plot(fit_early)

Parallel integration

library(LUCIDus)

data(simulated_HELIX_data)

G <- simulated_HELIX_data$exposure
Y <- simulated_HELIX_data$outcome
CoY <- simulated_HELIX_data$covariate
omics <- simulated_HELIX_data$omics

Z_parallel <- list(
  methylomics = omics[, 1:10, drop = FALSE],
  transcriptomics = omics[, 11:20, drop = FALSE],
  miRNA = omics[, 21:30, drop = FALSE]
)

fit_parallel <- estimate_lucid(
  G = G,
  Z = Z_parallel,
  Y = Y,
  CoY = CoY,
  lucid_model = "parallel",
  family = "normal",
  K = c(2, 2, 2),
  seed = 1008
)

summary(fit_parallel)

Serial integration with mixed topology

library(LUCIDus)

data(sim_data)

G <- sim_data$G[1:200, , drop = FALSE]
Y <- sim_data$Y_normal[1:200]

Z_stage1 <- list(
  layer1 = sim_data$Z[1:200, 1:5, drop = FALSE],
  layer2 = sim_data$Z[1:200, 6:10, drop = FALSE]
)
Z_stage2 <- sim_data$Z[1:200, 1:4, drop = FALSE]

fit_serial <- estimate_lucid(
  G = G,
  Z = list(Z_stage1, Z_stage2),
  Y = Y,
  lucid_model = "serial",
  family = "normal",
  K = list(list(2, 2), 2),
  seed = 1008
)

summary(fit_serial)

Main functionality

Task Main API
Fit a model directly estimate_lucid()
Tune K and penalties tune_lucid()
Fit or auto-tune in one step lucid()
Summarize fitted objects summary(), summary_lucid()
Predict clusters and outcomes predict_lucid()
Bootstrap confidence intervals boot_lucid()
Diagnose missingness analyze_missing_pattern(), check_na()
Impute missing omics values safe_impute(), fill_data()
Check imputation quality check_imputation_quality()
Numerical stability helpers safe_log_sum_exp(), safe_normalize(), safe_solve(), check_and_stabilize_sigma(), check_convergence()
Visualize fitted early models plot()

The guide also demonstrates these workflows across continuous and binary outcomes, missing-data settings, and multi-stage serial topologies.

Full workflow patterns

Direct fitting with estimate_lucid()

Use estimate_lucid() when you already know the model topology and cluster counts. This is the lowest-level exported fitting API and supports:

  • supervised and unsupervised fits
  • optional CoG and CoY covariates
  • normal and binary outcomes
  • early, parallel, and serial model structures
  • missing-data handling during fitting

For development and diagnostics, the guide also uses verbose = TRUE to print iteration-level fitting traces:

fit_verbose <- estimate_lucid(
  G = G,
  Z = Z,
  Y = Y,
  lucid_model = "early",
  family = "normal",
  K = 2,
  max_itr = 2,
  max_tot.itr = 8,
  verbose = TRUE
)

Tuning with tune_lucid() and lucid()

The guide treats tuning as a first-class workflow rather than an optional extra:

  • tune_lucid() explicitly evaluates a grid of K and penalty values
  • lucid() is the higher-level wrapper that fits directly or auto-tunes depending on whether K and penalty inputs are scalars or vectors
  • penalty tuning is available for "early" and "parallel"
  • serial models currently accept scalar penalties, while still supporting topology selection through K

Tuning, prediction, and inference

Tune over K and penalties

# Using G, Z, and Y from the early integration example above
tuned <- tune_lucid(
  G = G,
  Z = Z,
  Y = Y,
  lucid_model = "early",
  family = "normal",
  K = 2:3,
  Rho_G = c(0, 0.1),
  Rho_Z_Mu = c(0, 5),
  Rho_Z_Cov = c(0, 0.1),
  seed = 1008
)

The lucid() wrapper uses the same inputs and will automatically dispatch to tuning when K or penalty arguments are vectors. Penalty tuning is currently supported for "early" and "parallel"; "serial" accepts scalar penalty inputs.

Structured summaries

The summary methods are richer than a simple coefficient printout. As shown in the guide, summary() returns model information, fit statistics, feature-selection summaries, parameter tables, missing-data summaries, and optionally bootstrap confidence intervals.

s <- summary(fit_early)

After running boot_lucid(), you can also attach bootstrap confidence intervals with summary(fit_early, boot.se = boot_fit).

Prediction and g-computation

# Using fit_early, G, Z, and Y from the early integration example above
pred <- predict_lucid(
  model = fit_early,
  G = G,
  Z = Z,
  Y = Y,
  lucid_model = "early"
)

pred_g <- predict_lucid(
  model = fit_early,
  G = G,
  Z = NULL,
  Y = NULL,
  lucid_model = "early",
  g_computation = TRUE
)

predict_lucid() supports early, parallel, and serial models. In g_computation = TRUE mode, it uses the fitted G -> X path to generate cluster and outcome predictions under modified exposures.

Bootstrap inference

# Using fit_early, G, Z, and Y from the early integration example above
boot_fit <- boot_lucid(
  G = G,
  Z = Z,
  Y = Y,
  lucid_model = "early",
  model = fit_early,
  R = 20,
  conf = 0.95
)

summary(fit_early, boot.se = boot_fit)

Bootstrap inference is available for early, parallel, and serial models. If a fitted model uses nonzero penalties, boot_lucid() will refit a zero-penalty version internally because bootstrap CIs are derived from unpenalized fits. If feature selection has removed variables, refit the reduced model before bootstrapping.

Missing-data utilities

The package now includes dedicated helpers for incomplete omics data:

# Using Z from the early integration example above
Z_miss <- Z
Z_miss[1, ] <- NA
Z_miss[2:4, 1] <- NA

miss_pattern <- analyze_missing_pattern(Z_miss)
na_summary <- check_na(Z_miss, lucid_model = "early")
Z_imp <- safe_impute(Z_miss, method = "mean")
imp_quality <- check_imputation_quality(Z_miss, Z_imp)

These helpers are useful both before fitting and when validating simple imputations outside the model-based missing-data workflow.

The guide uses them in three roles:

  • pre-fit diagnostics with analyze_missing_pattern() and check_na()
  • simple standalone imputation with safe_impute()
  • low-level likelihood-based filling via fill_data()

Binary outcomes

The guide includes a dedicated binary-outcome example. LUCIDus supports:

  • family = "normal" for continuous outcomes
  • family = "binary" for 0/1 outcomes
  • predict_lucid(..., response = TRUE) for class labels
  • predict_lucid(..., response = FALSE) for predicted probabilities
fit_early_binary <- estimate_lucid(
  G = G,
  Z = Z,
  Y = sim_data$Y_binary[1:300],
  lucid_model = "early",
  family = "binary",
  K = 2,
  seed = 1008
)

pred_binary_prob <- predict_lucid(
  model = fit_early_binary,
  G = G,
  Z = Z,
  Y = sim_data$Y_binary[1:300],
  lucid_model = "early",
  response = FALSE
)

Visualization and stability helpers

The guide covers both plotting and the lower-level numerical helpers that support more stable workflows.

  • plot() currently provides Sankey-style visualization for early models
  • parallel and serial plotting methods are present but still under development
  • safe_log_sum_exp(), safe_normalize(), safe_solve(), check_and_stabilize_sigma(), and check_convergence() help with numerically fragile workflows
plot(fit_early)

safe_log_sum_exp(c(-1000, -1001, -999))
safe_normalize(c(1e-300, 2, 3))
sigma_stable <- check_and_stabilize_sigma(matrix(c(1, 0.999999, 0.999999, 0.999998), 2))
sigma_inv <- safe_solve(sigma_stable)

Practical notes from the full guide

The full functionality guide is intentionally lightweight so examples run quickly. For real analyses, it recommends increasing:

  • max_itr and max_tot.itr
  • bootstrap replicates R
  • the size of the tuning grid

For serial pipelines in particular, the guide recommends closely reviewing stage-wise missingness summaries and fitted submodels.

Documentation and tutorials

For fuller, end-to-end examples, see:

The comprehensive guide is the best reference for the current package surface. It covers:

  • missing-data diagnostics and imputation helpers
  • direct fitting with estimate_lucid()
  • wrapper workflows with tune_lucid() and lucid()
  • verbose fitting demos
  • summaries, prediction, and g_computation
  • bootstrap inference
  • plotting and numerical stability helpers
  • continuous and binary outcome examples

Citation

If you use LUCIDus, please cite:

  1. Peng C., Wang J., Asante I., Louie S., Jin R., Chatzi L., Casey G., Thomas D. C., Conti D. V. (2020). A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits. Bioinformatics.
  2. Zhao Y., Jia Q., Goodrich J. A., Conti D. V. (2024). LUCIDus: An R Package for Implementing Latent Unknown Clustering by Integrating Multi-omics Data (LUCID) With Phenotypic Traits. The R Journal.
  3. Zhao Y., Jia Q., Goodrich J., Darst B., Conti D. V. (2024). An extension of latent unknown clustering integrating multi-omics data (LUCID) incorporating incomplete omics data. Bioinformatics Advances.