Extracting topics 📚 from🎗cancer patients’ mutational profilesZhi Yang2019-07-301 / 56

About this talk

For those who've never heard of it

❓ what make a topic model (i.e. Latent Dirichlet Allocation)

📚 its general application
For those who know about it

🎗️ Its use in cancer research

📊 data, modeling, software

🏥 medical implications

2 / 56

We usually extract topics from:books, customer reviews, tweets, ...scientific journals, medical records, ...what if they can't be "read"?3 / 56

We usually extract topics from:

books, customer reviews, tweets, ...

scientific journals, medical records, ...

what if they can't be "read"?

3 / 56

Why studying somatic mutations?

4 / 56

What do we need computers for?

Volume

Velocity

Variety

Variability

Understanding big data themes from scientific biomedical literature through topic modeling

5 / 56

Data sources

10,952 exomes

1,048 whole-genomes

40 distinct types of human cancer

The Cancer Genome Atlas (TCGA)，the International Cancer Genome Consortium (ICGC), data from peer-reviewed journals

Signatures of Mutational Processes in Human Cancer

6 / 56

What are in a topic model?

documents

topics

words

7 / 56

What are in a topic model?

documents

topics --> latent

words

8 / 56

In a single document

every document is unique
each topic has different fractions, [0, 1]

9 / 56

In a topic

Topics are consistent across documents
choosing the number of topics could be a headache 😟

10 / 56

The hierarchical structure

a single document contains topics with weights
a topic is a unique distribution of words

11 / 56

LDA Topic Modeling on emojis

https://lettier.com/projects/lda-topic-modeling/

12 / 56

Data collection

Tissue ➡Sequencing ➡ input

13 / 56

Data collection

Tissue ➡Sequencing ➡ input

sample	chr	position	ref	alt
sample1	chr1	100	A	C
sample1	chr2	100	G	T
sample2	chr1	300	T	C
sample3	chr3	400	T	C

13 / 56

Data collection

Tissue ➡Sequencing ➡ input

sample	chr	position	ref	alt
sample1	chr1	100	A	C
sample1	chr2	100	G	T
sample2	chr1	300	T	C
sample3	chr3	400	T	C

? ? A>C ? ?

G T A>C T C

13 / 56

A glimpse of mutational profiles

Yang et al. HiLDA: a statistical approach to investigate differences in mutational signatures, 2019

14 / 56

Where are those "words"?

15 / 56

Topics made of words

Introduction to the Amazon SageMaker Neural Topic Model

16 / 56

Topics made of mutations

Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015

17 / 56

Topics made of mutations

Signatures of Mutational Processes in Human Cancer

18 / 56

Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015

19 / 56

Association with known risk factors

Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015

20 / 56

Software implementation

R package and Shiny app: pmsignature

Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015

21 / 56

Hierarchical topic model

documents

topics --> weights

words

22 / 56

After applying the topic model,

Are they different?

23 / 56

Statistical inference

24 / 56

Group differences

fractions $p_{k}$ and concentration parameters $α_{k}$ $p_{1}, \dots, p_{K} \sim D i r i c h l e t (α_{1}, \dots, α_{K})$ $μ_{k} = \frac{α_{k}}{\sum_{k} α_{k}}$

To capture the group difference for signature $i$ , $Δ_{i} = μ_{i}^{(1)} - μ_{i}^{(2)}$ What we'd like to know, $H_{0} : Δ_{i} = 0, i = 1, 2, . . ., K$

Why don't we just compare the estimated values?

$Δ_{i} = {\hat{μ}}_{i}^{(1)} - {\hat{μ}}_{i}^{(2)}$

25 / 56

Potentiaized treatment

Proposed subclassification of EAC based on mutational signatures

26 / 56

A mini GOTB family tree

27 / 56

Why Bayesian?28 / 56

Why Bayesian?You learn from the data28 / 56

Why Bayesian?

You learn from the data

You know the uncertainty of the estimated parameters

My Journey From Frequentist to Bayesian Statistics

28 / 56

JAGS

stands for

Just Another Gibbs Sampler

developed by Martyn Plummer

29 / 56

Why JAGS?why not WinBUGS 🐞 or Stan?30 / 56

JAGS vs. WinBUGSWinBUGS is historically very important with slow development31 / 56

JAGS vs. WinBUGSWinBUGS is historically very important with slow developmentJAGS has very similar syntax to WinBUGS and it allows cross-platform31 / 56

JAGS vs. WinBUGS

WinBUGS is historically very important with slow development

JAGS has very similar syntax to WinBUGS and it allows cross-platform

Great interface to R! (rjags, R2jags,runjags)

When there is WinBUGS already, why is JAGS necessary?

31 / 56

JAGS

longer history with tons of resources 📕📝🎓
easy to learn and run it
need to recompile model
fewer developers
less viz tools
various MCMC sampling methods

Stan

A newer tool with great team/active community
steeper learning curve
no need to recompile
Syntax check in R 😱
ShinyStan (applicable to JAGS)
require fewer iterations to converge

Which one's faster? Inconclusive

My situation Stan doesn't support discrete sampling

Stan vs. WinBugs: A search for informed opinions

32 / 56

Why me?I wrote a part of my dissertation with JAGSI've repeatedly read through JAGS, WinBUGS manual and numerous journalsI've four years' experiences with searching JAGS examples in Google and GitHub33 / 56

Installation

Install R

Install RStudio

Install the JAGS from https://sourceforge.net/projects/mcmc-jags/

Install the R2jags package from CRAN:

install.packages("R2jags")
library(R2jags)

34 / 56

Shiny Demo

The focus of the analysis presented here is on accurately estimating the mean of IQ using simulated data.

FBI: First Bayesian Inference

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158865/

35 / 56

Strong prior vs. weak prior

https://towardsdatascience.com/visualizing-beta-distribution-7391c18031f1

36 / 56

Let's get started!

We hold meetings every month and post it on meetup.com, which tells us how many people've signed up, $n$ . But not everyone show up (i.e., traffic). So, we'd like to know what percentage of people, $θ$ , will show up to order enough pizzas.

$Y_{i} | θ \sim B i n o m (n_{i}, θ)$

$θ \sim B e t a (a, b)$

Weak prior:

$B e t a (1, 1) \leftrightarrow U n i f (0, 1)$

Y <- c(77, 62, 30)
N <- c(39, 29, 16)

37 / 56

Write the model

But as you know, someone (@lukanegoita) said

38 / 56

Write the model

$Y_{i} | θ \sim B i n o m (n_{i}, θ)$

$θ \sim B e t a (a, b)$

In R

model{
for(i in 1:3){
  Y[i] <- rbinom(1, N[i], theta) 
}
  theta <- rbeta(1, 1, 1)
}

In JAGS

model{
for(i in 1:3){
  Y[i] ~ dbinom(theta, N[i])
}
  theta ~ dbeta(1, 1)
}

A variable should appear only once on the left hand side
A variable may appear multiple times on the right hand side
variables that only appear on the right must be supplied as data

39 / 56

Tips on syntaxNo need for variable declarationNo need for semi-colons at end of statementsuse ddist instead of rdist for random samplingdnorm(mean, precision) not dnorm(mean, sd)It allows matrix and arrayIt is illegal to offer definitions of the variable twice40 / 56

Write the model

library(R2jags)
model_string<-"model{
# Likelihood
for(i in 1:3){
  Y[i] ~ dbinom(theta, N[i])
}
# Prior
  theta ~ dbeta(1, 1)
}
"
# Data
N <- c(77, 62, 30)
Y <- c(39, 29, 16)
# specify parameters to be monitered
params <- c("theta")

41 / 56

rjags vs. runjags vs. R2jagsR2jags and runjags are just wrappers of rjagsrunjags makes it easy to run chains in parallel on multiple processesR2jags allows running until converged and paralleling MCMC chainsOtherwise, they all have very similar syntax and pre-processing steps42 / 56

Choosing `R2jags`

output <- jags(model.file = textConnection(model_string), 
               data = list(Y = Y, N = N),
               parameters.to.save = params,
               n.chains = 2, n.iter = 2000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 3
##    Unobserved stochastic nodes: 1
##    Total graph size: 8
## 
## Initializing model
## 
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |++                                                |   4%
  |                                                        
  |++++                                              |   8%
  |                                                        
  |++++++                                            |  12%
  |                                                        
  |++++++++                                          |  16%
  |                                                        
  |++++++++++                                        |  20%
  |                                                        
  |++++++++++++                                      |  24%
  |                                                        
  |++++++++++++++                                    |  28%
  |                                                        
  |++++++++++++++++                                  |  32%
  |                                                        
  |++++++++++++++++++                                |  36%
  |                                                        
  |++++++++++++++++++++                              |  40%
  |                                                        
  |++++++++++++++++++++++                            |  44%
  |                                                        
  |++++++++++++++++++++++++                          |  48%
  |                                                        
  |++++++++++++++++++++++++++                        |  52%
  |                                                        
  |++++++++++++++++++++++++++++                      |  56%
  |                                                        
  |++++++++++++++++++++++++++++++                    |  60%
  |                                                        
  |++++++++++++++++++++++++++++++++                  |  64%
  |                                                        
  |++++++++++++++++++++++++++++++++++                |  68%
  |                                                        
  |++++++++++++++++++++++++++++++++++++              |  72%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++            |  76%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++          |  80%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++        |  84%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++      |  88%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++++    |  92%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++++++  |  96%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |**                                                |   4%
  |                                                        
  |****                                              |   8%
  |                                                        
  |******                                            |  12%
  |                                                        
  |********                                          |  16%
  |                                                        
  |**********                                        |  20%
  |                                                        
  |************                                      |  24%
  |                                                        
  |**************                                    |  28%
  |                                                        
  |****************                                  |  32%
  |                                                        
  |******************                                |  36%
  |                                                        
  |********************                              |  40%
  |                                                        
  |**********************                            |  44%
  |                                                        
  |************************                          |  48%
  |                                                        
  |**************************                        |  52%
  |                                                        
  |****************************                      |  56%
  |                                                        
  |******************************                    |  60%
  |                                                        
  |********************************                  |  64%
  |                                                        
  |**********************************                |  68%
  |                                                        
  |************************************              |  72%
  |                                                        
  |**************************************            |  76%
  |                                                        
  |****************************************          |  80%
  |                                                        
  |******************************************        |  84%
  |                                                        
  |********************************************      |  88%
  |                                                        
  |**********************************************    |  92%
  |                                                        
  |************************************************  |  96%
  |                                                        
  |**************************************************| 100%

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 3
##    Unobserved stochastic nodes: 1
##    Total graph size: 9
## 
## Initializing model
## 
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |++                                                |   4%
  |                                                        
  |++++                                              |   8%
  |                                                        
  |++++++                                            |  12%
  |                                                        
  |++++++++                                          |  16%
  |                                                        
  |++++++++++                                        |  20%
  |                                                        
  |++++++++++++                                      |  24%
  |                                                        
  |++++++++++++++                                    |  28%
  |                                                        
  |++++++++++++++++                                  |  32%
  |                                                        
  |++++++++++++++++++                                |  36%
  |                                                        
  |++++++++++++++++++++                              |  40%
  |                                                        
  |++++++++++++++++++++++                            |  44%
  |                                                        
  |++++++++++++++++++++++++                          |  48%
  |                                                        
  |++++++++++++++++++++++++++                        |  52%
  |                                                        
  |++++++++++++++++++++++++++++                      |  56%
  |                                                        
  |++++++++++++++++++++++++++++++                    |  60%
  |                                                        
  |++++++++++++++++++++++++++++++++                  |  64%
  |                                                        
  |++++++++++++++++++++++++++++++++++                |  68%
  |                                                        
  |++++++++++++++++++++++++++++++++++++              |  72%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++            |  76%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++          |  80%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++        |  84%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++      |  88%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++++    |  92%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++++++  |  96%
  |                                                        
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |**                                                |   4%
  |                                                        
  |****                                              |   8%
  |                                                        
  |******                                            |  12%
  |                                                        
  |********                                          |  16%
  |                                                        
  |**********                                        |  20%
  |                                                        
  |************                                      |  24%
  |                                                        
  |**************                                    |  28%
  |                                                        
  |****************                                  |  32%
  |                                                        
  |******************                                |  36%
  |                                                        
  |********************                              |  40%
  |                                                        
  |**********************                            |  44%
  |                                                        
  |************************                          |  48%
  |                                                        
  |**************************                        |  52%
  |                                                        
  |****************************                      |  56%
  |                                                        
  |******************************                    |  60%
  |                                                        
  |********************************                  |  64%
  |                                                        
  |**********************************                |  68%
  |                                                        
  |************************************              |  72%
  |                                                        
  |**************************************            |  76%
  |                                                        
  |****************************************          |  80%
  |                                                        
  |******************************************        |  84%
  |                                                        
  |********************************************      |  88%
  |                                                        
  |**********************************************    |  92%
  |                                                        
  |************************************************  |  96%
  |                                                        
  |**************************************************| 100%

43 / 56

Visualizing output

plot(as.mcmc(output))

44 / 56

Visualizing output

library(mcmcplots)
mcmcplot(as.mcmc(output),c("theta"))
denplot(as.mcmc(output),c("theta"))

45 / 56

Model diagnostic

So it might take a longer time for run

46 / 56

Model diagnostic and convergence

output$BUGSoutput$summary

##                mean        sd       2.5%        25%        50%        75%
## deviance 14.6755846 1.3992791 13.6488245 13.7614902 14.1370965 15.0345916
## theta     0.4973384 0.0388237  0.4187856  0.4710636  0.4970235  0.5249785
##               97.5%     Rhat n.eff
## deviance 18.7558006 1.017801   220
## theta     0.5720824 1.003696  2000

If it fails to converge, we can do the following,

update(output, n.iter = 1000)
autojags(output, n.iter = 1000, Rhat = 1.05, n.update = 2)

47 / 56

weak prior vs. strong prior

output$BUGSoutput$summary

##                mean        sd       2.5%        25%        50%        75%
## deviance 14.6755846 1.3992791 13.6488245 13.7614902 14.1370965 15.0345916
## theta     0.4973384 0.0388237  0.4187856  0.4710636  0.4970235  0.5249785
##               97.5%     Rhat n.eff
## deviance 18.7558006 1.017801   220
## theta     0.5720824 1.003696  2000

output2$BUGSoutput$summary

##                mean         sd       2.5%        25%        50%        75%
## deviance 15.2841452 1.77163845 13.6509996 13.9603596 14.6893500 16.0268236
## theta     0.5357841 0.02994589  0.4754806  0.5161242  0.5361084  0.5560317
##               97.5%     Rhat n.eff
## deviance 20.0407688 1.005639   490
## theta     0.5934454 1.002685   690

48 / 56

weak prior vs. strong prior

denplot(as.mcmc(output), 
         parms = "theta", 
         xlim = c(0.3, 0.7))

denplot(as.mcmc(output2), 
        parms = "theta", 
        xlim = c(0.3, 0.7))

49 / 56

Linear regression

In JAGS

model<-"model{
# Likelihood
for(i in 1:N){
  y[i] ~ dnorm(mu[i], tau)
  mu[i] <- beta0 + beta1*x[i]
}
# Prior
  # intercept
  beta0 ~ dnorm(0, 0.0001)
  # slope
  beta1 ~ dnorm(0, 0.0001)
  # error term
  tau ~ dgamma(0.01, 0.01)
  sigma <- 1/sqrt(tau)
}"
lm_output <- 
jags(model.file = textConnection(model), 
     data = list(y = y, x = x, N=15),
     parameters.to.save = c("beta0","beta1", "sigma"),
     n.chains = 2, n.iter = 2000, progress.bar ="none")

In R

lm(y~x)

In statistics 101

$Y_{i} = b_{0} + b_{1} X_{i} + σ$ $Y \sim N (b_{0} + b_{1} X, σ^{2})$

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 15
##    Unobserved stochastic nodes: 3
##    Total graph size: 70
## 
## Initializing model

50 / 56

Interpretation

In JAGS

lm_output$BUGSoutput$summary[1:2, 
                             c(3,7)]

##             2.5%     97.5%
## beta0  -8.853992 52.131691
## beta1 -16.496277  3.192873

95% credible interval

In R

confint(lm(y~x))

##                 2.5 %    97.5 %
## (Intercept)  -9.32782 51.322335
## x           -16.19329  3.486528

95% confidence interval

51 / 56

An alternative way

In runjags

model <- template.jags(
  y ~ x, 
  data, n.chains=2, 
  family='gaussian')

In R

lm(y~x)

52 / 56

About sources 📕+💻

If you've got 2hrs:

useR! International R User 2017 Conference JAGS workshop by Martyn Plummer <-- 👨 of JAGS

If you've got 5 mins:

RJAGS: how to get started by Rens van de Schoot, retweeted by Martyn Plummer

If you're a book lover:

53 / 56

What about puppies?

"perky ears" + "floppy ears" ?= "half-up ears"

54 / 56

More ...

55 / 56

Thanks! and Keep in touch

Slides: https://lda-mutations.netlify.com/

@zhiiiyang zhiiiyang zyang895@gmail.com

Slides created via the R package xaringan and xaringanthemer

56 / 56

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Extracting topics 📚 from

🎗cancer patients’ mutational profiles

Zhi Yang

2019-07-30

About this talk

For those who've never heard of it

For those who know about it

We usually extract topics from:

books, customer reviews, tweets, ...

scientific journals, medical records, ...

what if they can't be "read"?

We usually extract topics from:

books, customer reviews, tweets, ...

scientific journals, medical records, ...

what if they can't be "read"?

Why studying somatic mutations?

What do we need computers for?

Volume

Velocity

Variety

Variability

Data sources

10,952 exomes

1,048 whole-genomes

40 distinct types of human cancer

The Cancer Genome Atlas (TCGA)，the International Cancer Genome Consortium (ICGC), data from peer-reviewed journals

What are in a topic model?

documents

topics

words

What are in a topic model?

documents

topics --> latent

words

In a single document

every document is unique

each topic has different fractions, [0, 1]

In a topic

Topics are consistent across documents

choosing the number of topics could be a headache 😟

The hierarchical structure

a single document contains topics with weights

a topic is a unique distribution of words

LDA Topic Modeling on emojis

Data collection

Tissue ➡Sequencing ➡ input

Data collection

Tissue ➡Sequencing ➡ input

Data collection

Tissue ➡Sequencing ➡ input

? ? A>C ? ?

G T A>C T C

A glimpse of mutational profiles

Where are those "words"?

Topics made of words

Topics made of mutations

Topics made of mutations

Association with known risk factors

Software implementation

Hierarchical topic model

documents

topics --> weights

words

After applying the topic model,

Statistical inference

Group differences

Potentiaized treatment

A mini GOTB family tree

Why Bayesian?

Why Bayesian?

You learn from the data

Why Bayesian?

You learn from the data

You know the uncertainty of the estimated parameters

JAGS

stands for

Just Another Gibbs Sampler

developed by Martyn Plummer

Why JAGS?

why not WinBUGS 🐞 or Stan?

each topic has different fractions, `[0, 1]`

use d`dist` instead of r`dist` for random sampling

`R2jags` and `runjags` are just wrappers of rjags

`runjags` makes it easy to run chains in parallel on multiple processes

`R2jags` allows running until converged and paralleling MCMC chains

Choosing `R2jags`