hidgenclassifier/man/calc_one_v_rest_auc.Rd at master · c7rishi/hidgenclassifier · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/calc_one_v_rest_auc.R
\name{calc_one_v_rest_auc}
\alias{calc_one_v_rest_auc}
\title{Calculating area under Precision-Recall curve (PRC) and
Receiver-Operator characteristic curve (ROC) for all one-vs-rest
comparisons in the fitted model}
\usage{
calc_one_v_rest_auc(
  fit = NULL,
  Xnew = NULL,
  Ynew = NULL,
  normalize_rows = NULL,
  measure = c("PRC", "ROC"),
  fitted_prob = NULL,
  include_baseline = TRUE,
  ...
)
}
\arguments{
\item{fit}{fitted hidden genome classifier object. Experimental: can be NULL, in which case
\code{fitted_prob} and \code{Ynew} must be provided.}

\item{Xnew, Ynew}{New predictor design matrix and corresponding cancer site labels. If provided,
the trained hidden genome model (supplied through \code{fit}) is used to obtain
predicted probabilities based on \code{Xnew} and the resulting resulting
probabilities are used as \code{fitted_prob}, along with \code{Ynew} to
calculate the AUCs. If \code{Xnew} is supplied, then \code{Ynew} must also
be supplied. If \code{fitted_prob} is supplied, then \code{Xnew} is ignored.}

\item{normalize_rows}{vector of the same length as \code{nrow(Xnew)} to be used
to normalize the rows of \code{Xnew}. If NULL (default), no normalization is performed.}

\item{measure}{Type of curve to use. Options include "PRC" (Precision Recall Curve) and
"ROC" (Receiver Operator characteristic Curve). Can be a vector.}

\item{fitted_prob}{an n_tumor x n_cancer matrix of predicted classification probabilities of
(corresponding to the "true" class labels provided in \code{Ynew}, if supplied, or
the original training Y labels, as stored in the trained model) to use for calculating ROC/PRC AUCs,
where n_tumor denotes the number of tumor/sample units,
and n_cancer is the number of cancer sites in the fitted hidden genome model (supplied
through \code{"fit"}). Row names and column names must
be identical to the the tumor/sample names and cancer labels in \code{Ynew} (if supplied) or
as used in  the fitted model. If \code{NULL}
(default) then the fitted probabilities are obtained from the model itself by
either extracting pre-validated
predictive probabilities (only available for mlogit models),
or simply using the fitted model to
make predictions on the training set.}

\item{include_baseline}{logical. Along with the computed \emph{observed} value(s) of the measure(s)
should the null baseline value(s) be returned. Here null baseline  refers to the expected
value of the corresponding measure associated with a "baseline" classifier that (uniform) randomly assigns
class labels to the sample units.}
}
\value{
Returns a data.table with \code{length(measure) + 1} columns
("Class" and measure(s)) (\code{2 * length(measure) + 1} many columns if
\code{include_baseline = TRUE}) and n_class + 1 many rows, where n_class
denotes the number of cancer types present in the fitted model; the
final row provides the Macro (average) metrics.
}
\description{
Calculating area under Precision-Recall curve (PRC) and
Receiver-Operator characteristic curve (ROC) for all one-vs-rest
comparisons in the fitted model
}
\details{
Under the hood, the function uses several functions from R package \code{precrec}
to compute the performance
metrics. The argument \code{fitted_prob}, when supplied, should ideally
contain predictive probabilities for training set tumors evaluated under a
cross-validation framework. If not supplied, pre-validated
prediction probabilities extracted from  mlogit models, and
overoptimistic prediction probabilities (obtained by simply using the fitted
model on the training data) for other models are used.
}
\note{
The function uses package {precrec} under the hood to compute the AUCs.
Please install {precrec} before using calc_one_v_rest_auc.
}
\examples{
data("impact")
top_v <- variant_screen_mi(
  maf = impact,
  variant_col = "Variant",
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id",
  mi_rank_thresh = 50,
  return_prob_mi = FALSE
)
var_design <- extract_design(
  maf = impact,
  variant_col = "Variant",
  sample_id_col = "patient_id",
  variant_subset = top_v
)

canc_resp <- extract_cancer_response(
  maf = impact,
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id"
)
pid <- names(canc_resp)
# create five stratified random folds
# based on the response cancer categories
set.seed(42)
folds <- data.table::data.table(
  resp = canc_resp
)[,
  foldid := sample(rep(1:5, length.out = .N)),
  by = resp
]$foldid

# 80\%-20\% stratified separation of training and
# test set tumors
idx_train <- pid[folds != 5]
idx_test <- pid[folds == 5]

# train a classifier on the training set
# using only variants (will have low accuracy
# -- no meta-feature information used
fit0 <- fit_mlogit(
  X = var_design[idx_train, ],
  Y = canc_resp[idx_train]
)

calc_one_v_rest_auc(fit0)
calc_one_v_rest_auc(fit0, measure = "PRC")
calc_one_v_rest_auc(fit0, measure = "ROC")

}