totalVI workflow

I was following this site to learn scvi-tools:
https://docs.scvi-tools.org/en/stable/user_guide/notebooks/totalVI.html

I have exactly followed their steps to get adata from concatenate pbmc3k and pbmc5k (CITE-seq data). I got an error message when running this step:

vae = scvi.model.TOTALVI(adata, latent_distribution=“normal”)

ValueError: No protein data found, please setup or transfer anndata

Can you provide a script showing how you’re assembling the dataset?

pbmc3k = scvi.data.read_h5ad("./data/pbmc3k_raw.h5ad")
pbmc3k
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

pbmc5k = sc.read_10x_h5(
    "./data/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5",
    gex_only=False
)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
pbmc5k.var_names_make_unique()
pbmc5k
AnnData object with n_obs × n_vars = 5247 × 33570
    var: 'gene_ids', 'feature_types', 'genome'

scvi.data.organize_cite_seq_10x(pbmc5k)
pbmc5k
AnnData object with n_obs × n_vars = 5247 × 33538
    var: 'gene_ids', 'feature_types', 'genome'
    obsm: 'protein_expression'

adata = pbmc5k.concatenate(pbmc3k)
adata
AnnData object with n_obs × n_vars = 7947 × 20453
    obs: 'batch'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'gene_ids-1'
sc.pp.filter_genes(adata, min_counts=3)
sc.pp.filter_cells(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()
adata
AnnData object with n_obs × n_vars = 7947 × 14309
    obs: 'batch', 'n_counts'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'gene_ids-1', 'n_counts'
    layers: 'counts'
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
scvi.data.setup_anndata(adata, layer="counts", batch_key="batch")
INFO     Using batches from adata.obs["batch"]                                               
INFO     No label_key inputted, assuming all cells have same label                           
INFO     Using data from adata.layers["counts"]                                              
INFO     Computing library size prior per batch                                              
INFO     Successfully registered anndata object containing 7947 cells, 14309 vars, 2 batches,
         1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra
         continuous covariates.                                                              
INFO     Please do not further modify adata until model is trained.                          

scvi.data.setup_anndata(pbmc5k, protein_expression_obsm_key="protein_expression")
INFO     No batch_key inputted, assuming all cells are same batch                            
INFO     No label_key inputted, assuming all cells have same label                           
INFO     Using data from adata.X                                                             
INFO     Computing library size prior per batch                                              
INFO     Using protein expression from adata.obsm['protein_expression']                      
INFO     Using protein names from columns of adata.obsm['protein_expression']                
INFO     Successfully registered anndata object containing 5247 cells, 33538 vars, 1 batches,
         1 labels, and 32 proteins. Also registered 0 extra categorical covariates and 0     
         extra continuous covariates.                                                        
INFO     Please do not further modify adata until model is trained.                          

scvi.data.view_anndata_setup(pbmc5k)
Anndata setup with scvi-tools version 0.12.1.
              Data Summary              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃             Data             ┃ Count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│            Cells             │ 5247  │
│             Vars             │ 33538 │
│            Labels            │   1   │
│           Batches            │   1   │
│           Proteins           │  32   │
│ Extra Categorical Covariates │   0   │
│ Extra Continuous Covariates  │   0   │
└──────────────────────────────┴───────┘
                   SCVI Data Registry                    
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Data        ┃       scvi-tools Location        ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         X          │             adata.X              │
│   batch_indices    │     adata.obs['_scvi_batch']     │
│    local_l_mean    │ adata.obs['_scvi_local_l_mean']  │
│    local_l_var     │  adata.obs['_scvi_local_l_var']  │
│       labels       │    adata.obs['_scvi_labels']     │
│ protein_expression │ adata.obsm['protein_expression'] │
└────────────────────┴──────────────────────────────────┘
                        Label Categories                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃      Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_labels'] │     0      │          0          │
└───────────────────────────┴────────────┴─────────────────────┘
                       Batch Categories                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃     Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_batch'] │     0      │          0          │
└──────────────────────────┴────────────┴─────────────────────┘

adata
AnnData object with n_obs × n_vars = 7947 × 14309
    obs: 'batch', 'n_counts', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'gene_ids-1', 'n_counts'
    uns: 'log1p', '_scvi'
    layers: 'counts'


vae = scvi.model.TOTALVI(pbmc5k, latent_distribution="normal")

I got the error message when running:
vae = scvi.model.TOTALVI(pbmc5k, latent_distribution=“normal”)

I think adata is from concatenation of pbmc3k + pbmc5k but only 5k has protein expression data.

It’s a bit difficult for me to follow what you’re trying to do. Can you check that pbmc5k.obsm["protein_expression"] works?

It returns the table of 5247 rows × 32 columns

As I said, when adata was created by concatenation of pbmc3k with 5k, there is obsm[protein_expression]!

I am curious, what is the typical preprocessing steps in Scanpy before the scvi modeling? This seems to be very memory intensive? I am using my laptop to run Scanpy and I am not sure if this is feasible with scvi?

Is it possible to reduce the amount of genes for modeling step?

sc.pp.highly_variable_genes(
adata1,
n_top_genes=4000,
flavor=“seurat_v3”,
subset=True,
layer=“counts”
)

Can we do like 2000 genes?

It shouldn’t require much memory.

I can’t reproduce this error. As this is publicly available data, could you perhaps create an example on Google Colab reproducing the issue?

this is the copy of my Jupyter scripts running scvi step: vae:train for pbmc5k (5247 cells for 4000 genes). It has been running 7 mins and it is only 24% finished. Is this normal?

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Epoch 99/400: 24%|██▍ | 98/400 [17:17<51:15, 10.19s/it, loss=1.35e+03, v_num=1]

Yes this would be normal I believe. Everything is much faster if you use a GPU, which Colab provides for free.

Thx and I will try to use Colab to learn scvi using pbmc5k dataset.

I am still reading the tutorial carefully to understand how TOTALVI works:

for this part:

Analyze outputs

We use Scanpy for clustering and visualization after running totalVI. It’s also possible to save totalVI outputs for an R-based workflow. First, we store the totalVI outputs in the appropriate slots in AnnData.

The code is copied here:

adata.obsm[“X_totalVI”] = vae.get_latent_representation()

rna, protein = vae.get_normalized_expression(
n_samples=25,
return_mean=True,
transform_batch=[“PBMC10k”, “PBMC5k”]
)

adata.layers[“denoised_rna”], adata.obsm[“denoised_protein”] = rna, protein

adata.obsm[“protein_foreground_prob”] = vae.get_protein_foreground_probability(
n_samples=25,
return_mean=True,
transform_batch=[“PBMC10k”, “PBMC5k”]
)
parsed_protein_names = [p.split("_")[0] for p in adata.obsm[“protein_expression”].columns]
adata.obsm[“protein_foreground_prob”].columns = parsed_protein_names

I have two questions for these codes and hope you can help me!
n_samples = 25, what does it mean?
If I only have one sample such as pbmc5k, can I skip ‘’ transform_batch=[“PBMC10k”, “PBMC5k”]? "