❓ what make a topic model (i.e. Latent Dirichlet Allocation)
📚 its general application
🎗️ Its use in cancer research
📊 data, modeling, software
🏥 medical implications
Signatures of Mutational Processes in Human Cancer
[0, 1]
sample | chr | position | ref | alt |
---|---|---|---|---|
sample1 | chr1 | 100 | A | C |
sample1 | chr2 | 100 | G | T |
sample2 | chr1 | 300 | T | C |
sample3 | chr3 | 400 | T | C |
sample | chr | position | ref | alt |
---|---|---|---|---|
sample1 | chr1 | 100 | A | C |
sample1 | chr2 | 100 | G | T |
sample2 | chr1 | 300 | T | C |
sample3 | chr3 | 400 | T | C |
Yang et al. HiLDA: a statistical approach to investigate differences in mutational signatures, 2019
Are they different?
fractions pk and concentration parameters αk p1,⋯,pK∼Dirichlet(α1,⋯,αK) μk=αk∑kαk
To capture the group difference for signature i, Δi=μ(1)i−μ(2)i What we'd like to know, H0:Δi=0,i=1,2,...,K
Why don't we just compare the estimated values?
Δi=^μ(1)i−^μ(2)i
My Journey From Frequentist to Bayesian Statistics
When there is WinBUGS already, why is JAGS necessary?
install.packages("R2jags")library(R2jags)
The focus of the analysis presented here is on accurately estimating the mean of IQ using simulated data.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158865/
https://towardsdatascience.com/visualizing-beta-distribution-7391c18031f1
We hold meetings every month and post it on meetup.com, which tells us how many people've signed up, n. But not everyone show up (i.e., traffic). So, we'd like to know what percentage of people, θ, will show up to order enough pizzas.
Yi|θ∼Binom(ni,θ)
θ∼Beta(a,b)
Weak prior:
Beta(1,1)↔Unif(0,1)
Y <- c(77, 62, 30)N <- c(39, 29, 16)
But as you know, someone (@lukanegoita) said
Yi|θ∼Binom(ni,θ)
θ∼Beta(a,b)
In R
model{for(i in 1:3){ Y[i] <- rbinom(1, N[i], theta) } theta <- rbeta(1, 1, 1)}
In JAGS
model{for(i in 1:3){ Y[i] ~ dbinom(theta, N[i])} theta ~ dbeta(1, 1)}
dist
instead of rdist
for random samplinglibrary(R2jags)model_string<-"model{# Likelihoodfor(i in 1:3){ Y[i] ~ dbinom(theta, N[i])}# Prior theta ~ dbeta(1, 1)}"# DataN <- c(77, 62, 30)Y <- c(39, 29, 16)# specify parameters to be moniteredparams <- c("theta")
R2jags
and runjags
are just wrappers of rjagsrunjags
makes it easy to run chains in parallel on multiple processesR2jags
allows running until converged and paralleling MCMC chainsR2jags
output <- jags(model.file = textConnection(model_string), data = list(Y = Y, N = N), parameters.to.save = params, n.chains = 2, n.iter = 2000)
## Compiling model graph## Resolving undeclared variables## Allocating nodes## Graph information:## Observed stochastic nodes: 3## Unobserved stochastic nodes: 1## Total graph size: 8## ## Initializing model## ## | | | 0% | |++ | 4% | |++++ | 8% | |++++++ | 12% | |++++++++ | 16% | |++++++++++ | 20% | |++++++++++++ | 24% | |++++++++++++++ | 28% | |++++++++++++++++ | 32% | |++++++++++++++++++ | 36% | |++++++++++++++++++++ | 40% | |++++++++++++++++++++++ | 44% | |++++++++++++++++++++++++ | 48% | |++++++++++++++++++++++++++ | 52% | |++++++++++++++++++++++++++++ | 56% | |++++++++++++++++++++++++++++++ | 60% | |++++++++++++++++++++++++++++++++ | 64% | |++++++++++++++++++++++++++++++++++ | 68% | |++++++++++++++++++++++++++++++++++++ | 72% | |++++++++++++++++++++++++++++++++++++++ | 76% | |++++++++++++++++++++++++++++++++++++++++ | 80% | |++++++++++++++++++++++++++++++++++++++++++ | 84% | |++++++++++++++++++++++++++++++++++++++++++++ | 88% | |++++++++++++++++++++++++++++++++++++++++++++++ | 92% | |++++++++++++++++++++++++++++++++++++++++++++++++ | 96% | |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%## | | | 0% | |** | 4% | |**** | 8% | |****** | 12% | |******** | 16% | |********** | 20% | |************ | 24% | |************** | 28% | |**************** | 32% | |****************** | 36% | |******************** | 40% | |********************** | 44% | |************************ | 48% | |************************** | 52% | |**************************** | 56% | |****************************** | 60% | |******************************** | 64% | |********************************** | 68% | |************************************ | 72% | |************************************** | 76% | |**************************************** | 80% | |****************************************** | 84% | |******************************************** | 88% | |********************************************** | 92% | |************************************************ | 96% | |**************************************************| 100%
## Compiling model graph## Resolving undeclared variables## Allocating nodes## Graph information:## Observed stochastic nodes: 3## Unobserved stochastic nodes: 1## Total graph size: 9## ## Initializing model## ## | | | 0% | |++ | 4% | |++++ | 8% | |++++++ | 12% | |++++++++ | 16% | |++++++++++ | 20% | |++++++++++++ | 24% | |++++++++++++++ | 28% | |++++++++++++++++ | 32% | |++++++++++++++++++ | 36% | |++++++++++++++++++++ | 40% | |++++++++++++++++++++++ | 44% | |++++++++++++++++++++++++ | 48% | |++++++++++++++++++++++++++ | 52% | |++++++++++++++++++++++++++++ | 56% | |++++++++++++++++++++++++++++++ | 60% | |++++++++++++++++++++++++++++++++ | 64% | |++++++++++++++++++++++++++++++++++ | 68% | |++++++++++++++++++++++++++++++++++++ | 72% | |++++++++++++++++++++++++++++++++++++++ | 76% | |++++++++++++++++++++++++++++++++++++++++ | 80% | |++++++++++++++++++++++++++++++++++++++++++ | 84% | |++++++++++++++++++++++++++++++++++++++++++++ | 88% | |++++++++++++++++++++++++++++++++++++++++++++++ | 92% | |++++++++++++++++++++++++++++++++++++++++++++++++ | 96% | |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%## | | | 0% | |** | 4% | |**** | 8% | |****** | 12% | |******** | 16% | |********** | 20% | |************ | 24% | |************** | 28% | |**************** | 32% | |****************** | 36% | |******************** | 40% | |********************** | 44% | |************************ | 48% | |************************** | 52% | |**************************** | 56% | |****************************** | 60% | |******************************** | 64% | |********************************** | 68% | |************************************ | 72% | |************************************** | 76% | |**************************************** | 80% | |****************************************** | 84% | |******************************************** | 88% | |********************************************** | 92% | |************************************************ | 96% | |**************************************************| 100%
plot(as.mcmc(output))
library(mcmcplots)mcmcplot(as.mcmc(output),c("theta"))denplot(as.mcmc(output),c("theta"))
So it might take a longer time for run
output$BUGSoutput$summary
## mean sd 2.5% 25% 50% 75%## deviance 14.6755846 1.3992791 13.6488245 13.7614902 14.1370965 15.0345916## theta 0.4973384 0.0388237 0.4187856 0.4710636 0.4970235 0.5249785## 97.5% Rhat n.eff## deviance 18.7558006 1.017801 220## theta 0.5720824 1.003696 2000
If it fails to converge, we can do the following,
update(output, n.iter = 1000)autojags(output, n.iter = 1000, Rhat = 1.05, n.update = 2)
output$BUGSoutput$summary
## mean sd 2.5% 25% 50% 75%## deviance 14.6755846 1.3992791 13.6488245 13.7614902 14.1370965 15.0345916## theta 0.4973384 0.0388237 0.4187856 0.4710636 0.4970235 0.5249785## 97.5% Rhat n.eff## deviance 18.7558006 1.017801 220## theta 0.5720824 1.003696 2000
output2$BUGSoutput$summary
## mean sd 2.5% 25% 50% 75%## deviance 15.2841452 1.77163845 13.6509996 13.9603596 14.6893500 16.0268236## theta 0.5357841 0.02994589 0.4754806 0.5161242 0.5361084 0.5560317## 97.5% Rhat n.eff## deviance 20.0407688 1.005639 490## theta 0.5934454 1.002685 690
denplot(as.mcmc(output), parms = "theta", xlim = c(0.3, 0.7))
denplot(as.mcmc(output2), parms = "theta", xlim = c(0.3, 0.7))
In JAGS
model<-"model{# Likelihoodfor(i in 1:N){ y[i] ~ dnorm(mu[i], tau) mu[i] <- beta0 + beta1*x[i]}# Prior # intercept beta0 ~ dnorm(0, 0.0001) # slope beta1 ~ dnorm(0, 0.0001) # error term tau ~ dgamma(0.01, 0.01) sigma <- 1/sqrt(tau)}"lm_output <- jags(model.file = textConnection(model), data = list(y = y, x = x, N=15), parameters.to.save = c("beta0","beta1", "sigma"), n.chains = 2, n.iter = 2000, progress.bar ="none")
In R
lm(y~x)
In statistics 101
Yi=b0+b1Xi+σ Y∼N(b0+b1X,σ2)
## Compiling model graph## Resolving undeclared variables## Allocating nodes## Graph information:## Observed stochastic nodes: 15## Unobserved stochastic nodes: 3## Total graph size: 70## ## Initializing model
In JAGS
lm_output$BUGSoutput$summary[1:2, c(3,7)]
## 2.5% 97.5%## beta0 -8.853992 52.131691## beta1 -16.496277 3.192873
95% credible interval
In R
confint(lm(y~x))
## 2.5 % 97.5 %## (Intercept) -9.32782 51.322335## x -16.19329 3.486528
95% confidence interval
In runjags
model <- template.jags( y ~ x, data, n.chains=2, family='gaussian')
In R
lm(y~x)
RJAGS: how to get started by Rens van de Schoot, retweeted by Martyn Plummer
"perky ears" + "floppy ears" ?= "half-up ears"
Slides: https://lda-mutations.netlify.com/
@zhiiiyang zhiiiyang zyang895@gmail.com
Slides created via the R package xaringan and xaringanthemer
❓ what make a topic model (i.e. Latent Dirichlet Allocation)
📚 its general application
🎗️ Its use in cancer research
📊 data, modeling, software
🏥 medical implications
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |