Preparing data for totalVI

Hello,

I am hoping to ask about TOTALVI. I have followed the steps in the tutorial, however when I come to do vae.train() I get the following error which I do not understand:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/datamodule.py:424: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup.
  f"DataModule.{name} has already been called, so it will not be called again. "
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/data_loading.py:323: UserWarning: The number of training samples (34) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"
Epoch 1/400:  -0%|          | -1/400 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/tqdm/std.py:538: TqdmWarning: clamping frac to range [0, 1]
  colour=colour)
Epoch 1/400:  -0%|          | -1/400 [00:00<?, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-70-511a56042b55> in <module>()
----> 1 vae.train()

15 frames
/usr/local/lib/python3.7/dist-packages/scvi/train/_trainingplans.py in training_epoch_end(self, outputs)
    133         n_obs, elbo, rec_loss, kl_local = 0, 0, 0, 0
    134         for tensors in outputs:
--> 135             elbo += tensors["reconstruction_loss_sum"] + tensors["kl_local_sum"]
    136             rec_loss += tensors["reconstruction_loss_sum"]
    137             kl_local += tensors["kl_local_sum"]

KeyError: 'reconstruction_loss_sum'

I’ve pretty much followed the instructions to the letter so I’m not really sure why I get this error, or indeed what it means. I wonder if you could please advise? Many thanks in advance.

I should probably add that this is what I get when I do the scvi.data.setup_anndata

INFO     Using batches from adata.obs["batch"]                                               
INFO     No label_key inputted, assuming all cells have same label                           
INFO     Using data from adata.layers["counts"]                                              
INFO     Computing library size prior per batch                                              
INFO     Using protein expression from adata.obsm['protein_expression']                      
INFO     Using protein names from columns of adata.obsm['protein_expression']                
INFO     Found batches with missing protein expression                                       
INFO     Successfully registered anndata object containing 9564 cells, 4000 vars, 2 batches, 
         1 labels, and 137 proteins. Also registered 0 extra categorical covariates and 0    
         extra continuous covariates.                                                        
INFO     Please do not further modify adata until model is trained.

UPDATE…

Doing scvi.data.organize_cite_seq_10x(adata) AFTER concatenating the two objects I have seems to then make it work.

Even though in the tutorial this is applied BEFORE concatenation.

Doing it before leaves a full repertoire of 137 proteins. Doing it after reduces this to 95.

If anyone knows why this is that information would be much appreciated.

It’s difficult for me to understand what steps are being taken for data organization and preprocessing, but this can make sense to me.

Thanks for the reply. The steps undertake so far are:

  1. Importing the data
  2. Scanpy preprocessing/QC metrics
  3. Scrublet

These steps were performed separately for two samples/objects. I then did:

  1. Concatenating the two objects
  2. Filtering min genes/cells/counts/mito
  3. Scanpy tools pca/nearest neighbour/umap
  4. Then the rest is as per the TOTALVI tutorial (normalise, log, adata.raw, highly variable genes, scvi.data.organize, scvi.data.setup_anndata)

Not not really clear why the protein count drops off with this method though compared to doing scvi.data.organize on the two samples separately before concatenating them. Surely we are not losing cells because the QC steps are otherwise the same?

A script showing your exact steps would be helpful to diagnose the issue. For example, Scrublet should probably not be run on an anndata object containing both RNA and protein in adata.X which is how it’s loaded from a 10x output.

Thanks, Adam. That may well be part of the problem. Apologies, I had not realised Scrublet should not be run on adata with RNA and protein.

It looks like I cannot upload an ipynb file here, but this is the script I’m working from:

Install required packages

!pip install scvi-tools -U
!pip install scanpy
!pip install scrublet
!pip install leidenalg
!pip install scrublet

import sys

#if branch is stable, will install via pypi, else will install from source
branch = “stable”
IN_COLAB = “google.colab” in sys.modules

if IN_COLAB and branch == “stable”:
!pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != “stable”:
!pip install --quiet --upgrade jsonschema
!pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]

Import packages

import scanpy as sc
import scvi
import matplotlib.pyplot as plt
import scrublet as scr

#Scanpy set up
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)

#Mount googledrive
from google.colab import drive
drive.mount(’/content/gdrive/’, force_remount=True)

#Set working directory
import os
os.chdir(’/content/gdrive/MyDrive/Adenoid_pilot’)

Importing data from Ad20

ad20 = sc.read_10x_h5(’/content/gdrive/MyDrive/Adenoid_pilot/filtered_feature_bc_ad20_matrix.h5’,
genome=None,
gex_only=False,
backup_url=None)

#Organise and check the data
ad20.var_names_make_unique()
ad20.raw=ad20
ad20.var[“mito”] = ad20.var_names.str.startswith(“MT-”)
sc.pp.calculate_qc_metrics(ad20, qc_vars=[“mito”], inplace=True)

#Scrublet
scrub = scr.Scrublet(ad20.raw.X)
ad20.obs[‘doublet_scores’], ad20.obs[‘predicted_doublets’] = scrub.scrub_doublets()
scrub.plot_histogram()

sum(ad20.obs[‘predicted_doublets’])

ad20.obs[‘doublet_info’] = ad20.obs[“predicted_doublets”].astype(str)

sc.pl.violin(ad20, ‘n_genes_by_counts’,
jitter=0.4, groupby = ‘doublet_info’, rotation=45)

ad20.var[“sample”] = “20”

Importing data from Ad19

ad19 = sc.read_10x_h5(’/content/gdrive/MyDrive/Adenoid_pilot/filtered_feature_bc_ad19_matrix.h5’,
genome=None,
gex_only=False,
backup_url=None) # write a cache file for faster subsequent reading

#Organise and check the data
ad19.var_names_make_unique()
ad19.raw=ad19
ad19.var[“mito”] = ad19.var_names.str.startswith(“MT-”)
sc.pp.calculate_qc_metrics(ad19, qc_vars=[“mito”], inplace=True)

#Scrublet
scrub = scr.Scrublet(ad19.raw.X)
ad19.obs[‘doublet_scores’], ad19.obs[‘predicted_doublets’] = scrub.scrub_doublets()
scrub.plot_histogram()

sum(ad19.obs[‘predicted_doublets’])

ad19.obs[‘doublet_info’] = ad19.obs[“predicted_doublets”].astype(str)

sc.pl.violin(ad19, ‘n_genes_by_counts’,
jitter=0.4, groupby = ‘doublet_info’, rotation=45)

ad19.var[“sample”] = “19”

Combine objects

adata = ad19.concatenate(ad20)
adata.raw=adata
adata.layers[“counts”] = adata.X.copy()

Filter for min cells/genes/counts

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_genes(adata, min_counts=1)

#Examine mito genes
sc.pp.calculate_qc_metrics(adata, qc_vars=[“mito”], inplace=True)

#Plot QC metrics
sc.pl.violin(adata, [‘n_genes_by_counts’, ‘total_counts’, ‘pct_counts_mito’],
jitter=0.4, multi_panel=True)

#Filter mito genes
adata= adata[adata.obs.pct_counts_mito < 10, :]

sc.tl.pca(adata, svd_solver=‘arpack’)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)
sc.pp.neighbors(adata, n_pcs = 50, n_neighbors = 10)
sc.tl.umap(adata)

sc.pl.umap(adata, color=“doublet_info”)

adata=adata[adata.obs[‘doublet_info’] == “False”]

Prepare for SCVI modelling

adata.layers[“counts”] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata

sc.pp.highly_variable_genes(
adata,
n_top_genes=4000,
flavor=“seurat_v3”,
batch_key=“batch”,
subset=True,
layer=“counts”
)

scvi.data.organize_cite_seq_10x(adata) #This moves the protein bit into obsm

scvi.data.setup_anndata(
adata,
layer=“counts”,
batch_key=“batch”,
protein_expression_obsm_key=“protein_expression”
)

scvi.data.view_anndata_setup(adata)

#Training the model
vae = scvi.model.TOTALVI(adata, latent_distribution=“normal”)

vae.train()

plt.plot(vae.history[“elbo_train”], label=“train”)
plt.plot(vae.history[“elbo_validation”], label=“validation”)
plt.title(“Negative ELBO over training epochs”)
plt.ylim(0, 1400)
plt.legend()

Looking into this more, it could be a problem with one of our dependencies having a recent update. I’ll get back to you soon.

We just pushed a new version of scvi-tools that fixes part of these issues.

The concatenation before or after the organize function shouldn’t really matter… but I use concat instead of concatenate.

Thanks, Adam. I’ve just tried to re-run it but we’re still seeing a fair bit of protein drop out (137->95). I’m not really sure why that is happening. Presumably the hope was that the new version would correct that?

The update fixes this:

/usr/local/lib/python3.7/dist-packages/scvi/train/_trainingplans.py in training_epoch_end(self, outputs)
    133         n_obs, elbo, rec_loss, kl_local = 0, 0, 0, 0
    134         for tensors in outputs:
--> 135             elbo += tensors["reconstruction_loss_sum"] + tensors["kl_local_sum"]
    136             rec_loss += tensors["reconstruction_loss_sum"]
    137             kl_local += tensors["kl_local_sum"]

There could be several other reasons the number of proteins is dropping and it’s probably related to the data manipulation/filtering. This could include the scanpy filtering steps (as proteins are in adata.X when you run filters), or it could be as simple as the proteins having slightly different names in the two separate objects. I would recommend the first step you do is to pull out the protein data from the object in your analysis and make separate protein anndata objects, then you can filter the RNA objects and do

protein_adata = protein_adata[rna_adata.obs_names]
rna_adata.obsm["protein"] = protein_adata.to_df()

Many thanks. I think that’s actually completely right because I’ve gone back through the script and done adata.var[“feature_types”].value_counts() after each line and it is the sc.pp.highly_variable_genes step which is causing the drop.

Thanks for the advice! Much appreciated.