+ - 0:00:00
Notes for current slide
Notes for next slide

Extracting topics 📚 from

🎗cancer patients’ mutational profiles

Zhi Yang

2019-07-30

1 / 56

About this talk

  • For those who've never heard of it

    ❓ what make a topic model (i.e. Latent Dirichlet Allocation)

    📚 its general application

  • For those who know about it

    🎗️ Its use in cancer research

    📊 data, modeling, software

    🏥 medical implications

2 / 56

We usually extract topics from:

books, customer reviews, tweets, ...

scientific journals, medical records, ...

what if they can't be "read"?

3 / 56

We usually extract topics from:

books, customer reviews, tweets, ...

scientific journals, medical records, ...

what if they can't be "read"?

3 / 56

Why studying somatic mutations?

4 / 56

What do we need computers for?

Volume

Velocity

Variety

Variability

5 / 56

Data sources

10,952 exomes

1,048 whole-genomes

40 distinct types of human cancer

The Cancer Genome Atlas (TCGA),the International Cancer Genome Consortium (ICGC), data from peer-reviewed journals

Signatures of Mutational Processes in Human Cancer

6 / 56

What are in a topic model?

documents

topics

words

7 / 56

What are in a topic model?

documents

topics --> latent

words

8 / 56

In a single document

  • every document is unique

  • each topic has different fractions, [0, 1]

9 / 56

In a topic

  • Topics are consistent across documents

  • choosing the number of topics could be a headache 😟

10 / 56

The hierarchical structure

  • a single document contains topics with weights

  • a topic is a unique distribution of words

11 / 56

LDA Topic Modeling on emojis

https://lettier.com/projects/lda-topic-modeling/

12 / 56

Data collection


Tissue ➡Sequencing ➡ input

13 / 56

Data collection


Tissue ➡Sequencing ➡ input

sample chr position ref alt
sample1 chr1 100 A C
sample1 chr2 100 G T
sample2 chr1 300 T C
sample3 chr3 400 T C
13 / 56

Data collection


Tissue ➡Sequencing ➡ input

sample chr position ref alt
sample1 chr1 100 A C
sample1 chr2 100 G T
sample2 chr1 300 T C
sample3 chr3 400 T C

? ? A>C ? ?

G T A>C T C

13 / 56

Where are those "words"?

15 / 56

Hierarchical topic model

documents

topics --> weights

words

22 / 56

After applying the topic model,

Are they different?

23 / 56

Statistical inference

24 / 56

Group differences

fractions pk and concentration parameters αk p1,,pKDirichlet(α1,,αK) μk=αkkαk

To capture the group difference for signature i, Δi=μi(1)μi(2) What we'd like to know, H0:Δi=0,i=1,2,...,K


Why don't we just compare the estimated values?

Δi=μ^i(1)μ^i(2)

25 / 56

A mini GOTB family tree

27 / 56

Why Bayesian?

28 / 56

Why Bayesian?

You learn from the data

28 / 56

Why Bayesian?

You learn from the data

You know the uncertainty of the estimated parameters

My Journey From Frequentist to Bayesian Statistics

28 / 56

JAGS

stands for

Just Another Gibbs Sampler

developed by Martyn Plummer

29 / 56

Why JAGS?

why not WinBUGS 🐞 or Stan?

30 / 56

JAGS vs. WinBUGS

WinBUGS is historically very important with slow development

31 / 56

JAGS vs. WinBUGS

WinBUGS is historically very important with slow development

JAGS has very similar syntax to WinBUGS and it allows cross-platform

31 / 56

JAGS vs. WinBUGS

WinBUGS is historically very important with slow development

JAGS has very similar syntax to WinBUGS and it allows cross-platform

Great interface to R! (rjags, R2jags,runjags)

When there is WinBUGS already, why is JAGS necessary?

31 / 56

JAGS

  • longer history with tons of resources 📕📝🎓
  • easy to learn and run it
  • need to recompile model
  • fewer developers
  • less viz tools
  • various MCMC sampling methods

Stan

  • A newer tool with great team/active community
  • steeper learning curve
  • no need to recompile
  • Syntax check in R 😱
  • ShinyStan (applicable to JAGS)
  • require fewer iterations to converge

Which one's faster? Inconclusive

My situation Stan doesn't support discrete sampling

32 / 56

Why me?

I wrote a part of my dissertation with JAGS

I've repeatedly read through JAGS, WinBUGS manual and numerous journals

I've four years' experiences with searching JAGS examples in Google and GitHub

33 / 56

Installation

Install R

Install RStudio

Install the JAGS from https://sourceforge.net/projects/mcmc-jags/

Install the R2jags package from CRAN:

install.packages("R2jags")
library(R2jags)
34 / 56

Shiny Demo

The focus of the analysis presented here is on accurately estimating the mean of IQ using simulated data.

FBI: First Bayesian Inference

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158865/

35 / 56

Let's get started!

We hold meetings every month and post it on meetup.com, which tells us how many people've signed up, n. But not everyone show up (i.e., traffic). So, we'd like to know what percentage of people, θ, will show up to order enough pizzas.

Yi|θBinom(ni,θ)

θBeta(a,b)

Weak prior:

Beta(1,1)Unif(0,1)

Y <- c(77, 62, 30)
N <- c(39, 29, 16)
37 / 56

Write the model

But as you know, someone (@lukanegoita) said

38 / 56

Write the model

Yi|θBinom(ni,θ)

θBeta(a,b)

In R

model{
for(i in 1:3){
Y[i] <- rbinom(1, N[i], theta)
}
theta <- rbeta(1, 1, 1)
}

In JAGS

model{
for(i in 1:3){
Y[i] ~ dbinom(theta, N[i])
}
theta ~ dbeta(1, 1)
}
  • A variable should appear only once on the left hand side
  • A variable may appear multiple times on the right hand side
  • variables that only appear on the right must be supplied as data
39 / 56

Tips on syntax

No need for variable declaration

No need for semi-colons at end of statements

use ddist instead of rdist for random sampling

dnorm(mean, precision) not dnorm(mean, sd)

It allows matrix and array

It is illegal to offer definitions of the variable twice

40 / 56

Write the model

library(R2jags)
model_string<-"model{
# Likelihood
for(i in 1:3){
Y[i] ~ dbinom(theta, N[i])
}
# Prior
theta ~ dbeta(1, 1)
}
"
# Data
N <- c(77, 62, 30)
Y <- c(39, 29, 16)
# specify parameters to be monitered
params <- c("theta")
41 / 56

rjags vs. runjags vs. R2jags

R2jags and runjags are just wrappers of rjags

runjags makes it easy to run chains in parallel on multiple processes

R2jags allows running until converged and paralleling MCMC chains

Otherwise, they all have very similar syntax and pre-processing steps

42 / 56

Choosing R2jags

output <- jags(model.file = textConnection(model_string),
data = list(Y = Y, N = N),
parameters.to.save = params,
n.chains = 2, n.iter = 2000)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 3
## Unobserved stochastic nodes: 1
## Total graph size: 8
##
## Initializing model
##
##
|
| | 0%
|
|++ | 4%
|
|++++ | 8%
|
|++++++ | 12%
|
|++++++++ | 16%
|
|++++++++++ | 20%
|
|++++++++++++ | 24%
|
|++++++++++++++ | 28%
|
|++++++++++++++++ | 32%
|
|++++++++++++++++++ | 36%
|
|++++++++++++++++++++ | 40%
|
|++++++++++++++++++++++ | 44%
|
|++++++++++++++++++++++++ | 48%
|
|++++++++++++++++++++++++++ | 52%
|
|++++++++++++++++++++++++++++ | 56%
|
|++++++++++++++++++++++++++++++ | 60%
|
|++++++++++++++++++++++++++++++++ | 64%
|
|++++++++++++++++++++++++++++++++++ | 68%
|
|++++++++++++++++++++++++++++++++++++ | 72%
|
|++++++++++++++++++++++++++++++++++++++ | 76%
|
|++++++++++++++++++++++++++++++++++++++++ | 80%
|
|++++++++++++++++++++++++++++++++++++++++++ | 84%
|
|++++++++++++++++++++++++++++++++++++++++++++ | 88%
|
|++++++++++++++++++++++++++++++++++++++++++++++ | 92%
|
|++++++++++++++++++++++++++++++++++++++++++++++++ | 96%
|
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
##
|
| | 0%
|
|** | 4%
|
|**** | 8%
|
|****** | 12%
|
|******** | 16%
|
|********** | 20%
|
|************ | 24%
|
|************** | 28%
|
|**************** | 32%
|
|****************** | 36%
|
|******************** | 40%
|
|********************** | 44%
|
|************************ | 48%
|
|************************** | 52%
|
|**************************** | 56%
|
|****************************** | 60%
|
|******************************** | 64%
|
|********************************** | 68%
|
|************************************ | 72%
|
|************************************** | 76%
|
|**************************************** | 80%
|
|****************************************** | 84%
|
|******************************************** | 88%
|
|********************************************** | 92%
|
|************************************************ | 96%
|
|**************************************************| 100%
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 3
## Unobserved stochastic nodes: 1
## Total graph size: 9
##
## Initializing model
##
##
|
| | 0%
|
|++ | 4%
|
|++++ | 8%
|
|++++++ | 12%
|
|++++++++ | 16%
|
|++++++++++ | 20%
|
|++++++++++++ | 24%
|
|++++++++++++++ | 28%
|
|++++++++++++++++ | 32%
|
|++++++++++++++++++ | 36%
|
|++++++++++++++++++++ | 40%
|
|++++++++++++++++++++++ | 44%
|
|++++++++++++++++++++++++ | 48%
|
|++++++++++++++++++++++++++ | 52%
|
|++++++++++++++++++++++++++++ | 56%
|
|++++++++++++++++++++++++++++++ | 60%
|
|++++++++++++++++++++++++++++++++ | 64%
|
|++++++++++++++++++++++++++++++++++ | 68%
|
|++++++++++++++++++++++++++++++++++++ | 72%
|
|++++++++++++++++++++++++++++++++++++++ | 76%
|
|++++++++++++++++++++++++++++++++++++++++ | 80%
|
|++++++++++++++++++++++++++++++++++++++++++ | 84%
|
|++++++++++++++++++++++++++++++++++++++++++++ | 88%
|
|++++++++++++++++++++++++++++++++++++++++++++++ | 92%
|
|++++++++++++++++++++++++++++++++++++++++++++++++ | 96%
|
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
##
|
| | 0%
|
|** | 4%
|
|**** | 8%
|
|****** | 12%
|
|******** | 16%
|
|********** | 20%
|
|************ | 24%
|
|************** | 28%
|
|**************** | 32%
|
|****************** | 36%
|
|******************** | 40%
|
|********************** | 44%
|
|************************ | 48%
|
|************************** | 52%
|
|**************************** | 56%
|
|****************************** | 60%
|
|******************************** | 64%
|
|********************************** | 68%
|
|************************************ | 72%
|
|************************************** | 76%
|
|**************************************** | 80%
|
|****************************************** | 84%
|
|******************************************** | 88%
|
|********************************************** | 92%
|
|************************************************ | 96%
|
|**************************************************| 100%
43 / 56

Visualizing output

plot(as.mcmc(output))

44 / 56

Visualizing output

library(mcmcplots)
mcmcplot(as.mcmc(output),c("theta"))
denplot(as.mcmc(output),c("theta"))

45 / 56

Model diagnostic

So it might take a longer time for run

46 / 56

Model diagnostic and convergence

output$BUGSoutput$summary
## mean sd 2.5% 25% 50% 75%
## deviance 14.6755846 1.3992791 13.6488245 13.7614902 14.1370965 15.0345916
## theta 0.4973384 0.0388237 0.4187856 0.4710636 0.4970235 0.5249785
## 97.5% Rhat n.eff
## deviance 18.7558006 1.017801 220
## theta 0.5720824 1.003696 2000

If it fails to converge, we can do the following,

update(output, n.iter = 1000)
autojags(output, n.iter = 1000, Rhat = 1.05, n.update = 2)
47 / 56

weak prior vs. strong prior

output$BUGSoutput$summary
## mean sd 2.5% 25% 50% 75%
## deviance 14.6755846 1.3992791 13.6488245 13.7614902 14.1370965 15.0345916
## theta 0.4973384 0.0388237 0.4187856 0.4710636 0.4970235 0.5249785
## 97.5% Rhat n.eff
## deviance 18.7558006 1.017801 220
## theta 0.5720824 1.003696 2000
output2$BUGSoutput$summary
## mean sd 2.5% 25% 50% 75%
## deviance 15.2841452 1.77163845 13.6509996 13.9603596 14.6893500 16.0268236
## theta 0.5357841 0.02994589 0.4754806 0.5161242 0.5361084 0.5560317
## 97.5% Rhat n.eff
## deviance 20.0407688 1.005639 490
## theta 0.5934454 1.002685 690
48 / 56

weak prior vs. strong prior

denplot(as.mcmc(output),
parms = "theta",
xlim = c(0.3, 0.7))

denplot(as.mcmc(output2),
parms = "theta",
xlim = c(0.3, 0.7))

49 / 56

Linear regression

In JAGS

model<-"model{
# Likelihood
for(i in 1:N){
y[i] ~ dnorm(mu[i], tau)
mu[i] <- beta0 + beta1*x[i]
}
# Prior
# intercept
beta0 ~ dnorm(0, 0.0001)
# slope
beta1 ~ dnorm(0, 0.0001)
# error term
tau ~ dgamma(0.01, 0.01)
sigma <- 1/sqrt(tau)
}"
lm_output <-
jags(model.file = textConnection(model),
data = list(y = y, x = x, N=15),
parameters.to.save = c("beta0","beta1", "sigma"),
n.chains = 2, n.iter = 2000, progress.bar ="none")


In R

lm(y~x)

In statistics 101

Yi=b0+b1Xi+σ YN(b0+b1X,σ2)

## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 15
## Unobserved stochastic nodes: 3
## Total graph size: 70
##
## Initializing model
50 / 56

Interpretation

In JAGS

lm_output$BUGSoutput$summary[1:2,
c(3,7)]
## 2.5% 97.5%
## beta0 -8.853992 52.131691
## beta1 -16.496277 3.192873

95% credible interval

In R

confint(lm(y~x))
## 2.5 % 97.5 %
## (Intercept) -9.32782 51.322335
## x -16.19329 3.486528

95% confidence interval

51 / 56

An alternative way

In runjags

model <- template.jags(
y ~ x,
data, n.chains=2,
family='gaussian')

In R

lm(y~x)
52 / 56

About sources 📕+💻

If you've got 2hrs:

  • useR! International R User 2017 Conference JAGS workshop by Martyn Plummer <-- 👨 of JAGS

If you've got 5 mins:

RJAGS: how to get started by Rens van de Schoot, retweeted by Martyn Plummer

If you're a book lover:

53 / 56

What about puppies?

"perky ears" + "floppy ears" ?= "half-up ears"

54 / 56

More ...

55 / 56

Thanks! and Keep in touch

Slides: https://lda-mutations.netlify.com/

@zhiiiyang zhiiiyang zyang895@gmail.com

Slides created via the R package xaringan and xaringanthemer

56 / 56

About this talk

  • For those who've never heard of it

    ❓ what make a topic model (i.e. Latent Dirichlet Allocation)

    📚 its general application

  • For those who know about it

    🎗️ Its use in cancer research

    📊 data, modeling, software

    🏥 medical implications

2 / 56
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow