Another way to explore this data is to use supervised learning to build a classifier. Now, there are two uses of a classifier, and one tends to be emphasized much more than the other.
1. A classifier can be used to generate an automatic procedure for labeling points. This is useful if you want a method that automatically tells cars whether the object in front of them is a car or a pedestrian.
2. To identify features that are good predictors of labeled points.
I suppose that a central part of my philosophy is that point (2) is much more important in biological research than point (1). This has some implications down-stream (for example, we are slightly less worried about overfitting if we can test selected features via an orthogonal experimental approach).
Here, we want to build a classifier that uses gene expression levels to predict a binary outcome, luminal A
or luminal B
. However, we would like our machine to identify a small number of features that can accurately predict the outcome.
Typically, binary outcome prediction can be performed using logistic regression. However, in general, if we input $n$ features into a logistic regression algorithm, all $n$ features will have non-zero prediction coefficients: all genes will participate in the classification problem. While this could be useful in principle, in reality we often want to interpret the biomarkers, so we would like an algorithm that discards most gene expression levels in favor of a select few (there are additional problems with using 20,000 covariates, namely, overfitting).
How can we modify logistic regression algorithms to do this? Well, an easy way is to exchange the commonly used $l_2$ norm for an $l_1$ norm. That is, instead of minimizing the sum of squares of the residuals, we can minimize the absolute value of the residuals. This has profound consequences, because the $l_1$ norm naturally drives coefficients that would simply be small under $l_2$ penalties towards extinction.
Note: Some people would use LASSO to address this problem. LASSO is a modified linear regression procedure similar to least-squares. Whereas least-squares uses an $l_2$ norm, LASSO performs the linear regression under an $l_1$ constraint. However, since our problem is categorical, I find $l_1$-norm regularized logistic regression a more suitable solution.
import pandas as pd
import numpy as np
import scanpy as sc
import umap
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rc
from sklearn.linear_model import LogisticRegression
rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})
%matplotlib inline
# This enables SVG graphics inline.
%config InlineBackend.figure_formats = {'png', 'retina'}
rc = {'lines.linewidth': 2,
'axes.labelsize': 18,
'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['legend.fontsize'] = 14
ensembl = sc.queries.biomart_annotations(org='hsapiens',
attrs=['ensembl_gene_id', 'gene_biotype',
'external_gene_name', 'description'])
Load data
df = pd.read_csv('../luminalAB/mrna/trimmedTMM.csv', index_col='geneid')
logdf = pd.read_csv('../luminalAB/mrna/logTMM.csv', index_col='geneid')
normed = pd.read_csv('../luminalAB/mrna/NormalizedlogTMM.csv', index_col='geneid')
meta = pd.read_csv('../luminalAB/mrna/lumAB_id_cancer.txt', sep='\t')
meta.ID = meta.ID.astype('category')
meta.ID.cat.set_categories(df.columns, inplace=True)
meta.sort_values("ID", inplace=True)
ensembl = ensembl[ensembl.ensembl_gene_id.isin(df.index)]
Subset data into training sets:
train_idx = np.random.choice(normed.T.index.values,
size=np.int(np.floor(len(normed.columns) / 3 * 2)),
replace=False).tolist()
test_idx = [n for n in normed.T.index if n not in train_idx]
train_exp = normed.T[normed.T.index.isin(train_idx)]
train_labels = meta[meta.ID.isin(train_idx)].cancer
test_exp = normed.T[normed.T.index.isin(test_idx)]
test_labels = meta[meta.ID.isin(test_idx)].cancer
For this exercise, I use the sklearn
implementation of logistic regression. In this implementation, we can, and do, select the $l_1$ norm to naturally sparsify values. Next, we can also modulate the regularization penalty -- a number that will make the effect of the $l_1$ norm larger or smaller. This parameter ought to be chosen carefully. To do this, I tile over possible values of this parameter, which can take values between (0, 1], and repeatedly train the machine and retrieve its total score. Next, we will define a loss function to identify an optimal sparsity value. Our loss function will be:
where $C$ is the sparsity constraint, and $S_{logit}$ is the fraction of all labels in the training set that were correctly assigned by the machine.
c = np.linspace(0.01, .99, 20)
models = {}
for ci in c:
model = LogisticRegression(
penalty='l1', # use l1 norm for sparsity of values
solver='saga', # i like saga
C=ci, # C sets sparsity; higher C more sparsity
max_iter=5000) # number of iterations to run
model.fit(train_exp, train_labels)
models[ci] = model
# define loss function:
loss = lambda x: x **2 - models[x].score(train_exp, train_labels)
# calculate loss function
l = np.array([loss(ci) for ci in c])
# find minimizing c value for loss:
cval = c[np.argmin(l)]
# plot:
plt.plot(c, l)
plt.title('Loss Function = $c^2$ - score')
plt.xlabel('c')
plt.ylabel('Loss')
plt.xscale('log')
print('Loss is minimized at {0:.3g}'.format(c[np.argmin(l)]))
print('Out of the box performance is: {0:.2g}'.format(models[cval].score(test_exp, test_labels)))
print('Fraction of Luminal_A in test set: {0:.2g}'.format((test_labels == 'Luminal_A').sum() / len(test_labels)))
It looks like our loss is minimized at $C=0.165$. At this value, the out-of-the-box performance is 0.86 -- almost 20 percentile points greater than the accuracy we would achieve by simply naming everything Luminal A
. This looks very promising. Next, let's view the distribution of model coefficients for all gene expression levels. Remember: We expect the great majority to be centered around zero, and the non-zero coefficients probably ought to be symmetrically distributed around 0.
plt.hist(models[cval].coef_[0])
plt.yscale('log')
plt.xlabel('Model coefficients, C={0:.3g}'.format(cval))
_ = plt.ylabel('Frequency')
markers = df.index[np.where(models[cval].coef_[0] != 0)[0]].values
ens_mark = ensembl[ensembl.ensembl_gene_id.isin(markers)].copy()
ens_mark.ensembl_gene_id = ens_mark.ensembl_gene_id.astype('category')
ens_mark.ensembl_gene_id.cat.set_categories(markers, inplace=True)
ens_mark.sort_values("ensembl_gene_id", inplace=True)
x = models[cval].coef_[0][models[cval].coef_[0] != 0]
sort = np.argsort(x)
fig, ax = plt.subplots(figsize=(6, 16))
for i in ['neg', 'pos']:
if i == 'pos':
plt.plot(x[sort][x[sort] > 0],
ens_mark.external_gene_name.values[sort][x[sort] > 0],
'o')
else:
plt.plot(x[sort][x[sort] < 0],
ens_mark.external_gene_name.values[sort][x[sort] < 0],
'o')
_ = plt.xlabel('Model Coefficients')
Sometimes, we can perform pathway enrichment
analyses on the data. Here, I perform the test three ways: On all the coefficients simultaneously, and then on the positive and negative coefficients separately. Intuitively, we might expect that splitting should yield better results, since the two cancer types may be driven by different molecular etiologies.
sc.queries.enrich(ens_mark.ensembl_gene_id)
Let's see what enrichment results from the negative enrichment:
markers = df.index[np.where(models[cval].coef_[0] < 0)[0]].values
ens_mark = ensembl[ensembl.ensembl_gene_id.isin(markers)]
sc.queries.enrich(ens_mark.ensembl_gene_id)
Not a lot. Nothing interpretable, at least!
markers = df.index[np.where(models[cval].coef_[0] > 0)[0]].values
ens_mark = ensembl[ensembl.ensembl_gene_id.isin(markers)]
sc.queries.enrich(ens_mark.ensembl_gene_id)
It looks like the positive coefficients are enriched for genes that function in cell division and replication! Oh, that's a neat result!
ens_mark
It doesn't, but in case you don't believe me, here's the analysis. All runs performed with the same training set and parameter settings, just different random seeds.
# test feature stability:
reps = []
for i in np.arange(1, 20):
model = LogisticRegression(
penalty='l1',
solver='saga',
C=cval,
max_iter=5000,
random_state=i)
model.fit(train_exp, train_labels)
reps += [model]
def get_features(data, cond=np.where(models[cval].coef_[0] != 0)[0]):
markers = data.index.values[cond]
ens_mark = ensembl[ensembl.ensembl_gene_id.isin(markers)].copy()
ens_mark.ensembl_gene_id = ens_mark.ensembl_gene_id.astype('category')
ens_mark.ensembl_gene_id.cat.set_categories(markers, inplace=True)
ens_mark.sort_values("ensembl_gene_id", inplace=True)
missing = np.array([0 if m not in ensembl.ensembl_gene_id.values else 1 for m in markers ])
return ens_mark, missing
genes = []
for r in reps:
f, miss = get_features(normed, np.where(r.coef_[0] != 0)[0])
f = f.external_gene_name.values
genes += f.tolist()
uniq, counts = np.unique(genes, return_counts=True)
print(len(uniq), counts.min(), counts.max())
np.savetxt('../luminalAB/logistic_genes.txt', uniq, delimiter=',', fmt='%s')
The same set of 56 genes shows up each time. That's a bit surprising, at least to me.