totalVI workflow

taopeng1100 · July 30, 2021, 6:33pm

I was following this site to learn scvi-tools:
https://docs.scvi-tools.org/en/stable/user_guide/notebooks/totalVI.html

I have exactly followed their steps to get adata from concatenate pbmc3k and pbmc5k (CITE-seq data). I got an error message when running this step:

vae = scvi.model.TOTALVI(adata, latent_distribution=“normal”)

ValueError: No protein data found, please setup or transfer anndata

adamgayoso · July 30, 2021, 6:43pm

Can you provide a script showing how you’re assembling the dataset?

taopeng1100 · July 30, 2021, 7:02pm

pbmc3k = scvi.data.read_h5ad("./data/pbmc3k_raw.h5ad")
pbmc3k
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

pbmc5k = sc.read_10x_h5(
    "./data/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5",
    gex_only=False
)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
pbmc5k.var_names_make_unique()
pbmc5k
AnnData object with n_obs × n_vars = 5247 × 33570
    var: 'gene_ids', 'feature_types', 'genome'

scvi.data.organize_cite_seq_10x(pbmc5k)
pbmc5k
AnnData object with n_obs × n_vars = 5247 × 33538
    var: 'gene_ids', 'feature_types', 'genome'
    obsm: 'protein_expression'

adata = pbmc5k.concatenate(pbmc3k)
adata
AnnData object with n_obs × n_vars = 7947 × 20453
    obs: 'batch'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'gene_ids-1'
sc.pp.filter_genes(adata, min_counts=3)
sc.pp.filter_cells(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()
adata
AnnData object with n_obs × n_vars = 7947 × 14309
    obs: 'batch', 'n_counts'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'gene_ids-1', 'n_counts'
    layers: 'counts'
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
scvi.data.setup_anndata(adata, layer="counts", batch_key="batch")
INFO     Using batches from adata.obs["batch"]                                               
INFO     No label_key inputted, assuming all cells have same label                           
INFO     Using data from adata.layers["counts"]                                              
INFO     Computing library size prior per batch                                              
INFO     Successfully registered anndata object containing 7947 cells, 14309 vars, 2 batches,
         1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra
         continuous covariates.                                                              
INFO     Please do not further modify adata until model is trained.                          

scvi.data.setup_anndata(pbmc5k, protein_expression_obsm_key="protein_expression")
INFO     No batch_key inputted, assuming all cells are same batch                            
INFO     No label_key inputted, assuming all cells have same label                           
INFO     Using data from adata.X                                                             
INFO     Computing library size prior per batch                                              
INFO     Using protein expression from adata.obsm['protein_expression']                      
INFO     Using protein names from columns of adata.obsm['protein_expression']                
INFO     Successfully registered anndata object containing 5247 cells, 33538 vars, 1 batches,
         1 labels, and 32 proteins. Also registered 0 extra categorical covariates and 0     
         extra continuous covariates.                                                        
INFO     Please do not further modify adata until model is trained.                          

scvi.data.view_anndata_setup(pbmc5k)
Anndata setup with scvi-tools version 0.12.1.
              Data Summary              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃             Data             ┃ Count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│            Cells             │ 5247  │
│             Vars             │ 33538 │
│            Labels            │   1   │
│           Batches            │   1   │
│           Proteins           │  32   │
│ Extra Categorical Covariates │   0   │
│ Extra Continuous Covariates  │   0   │
└──────────────────────────────┴───────┘
                   SCVI Data Registry                    
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Data        ┃       scvi-tools Location        ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         X          │             adata.X              │
│   batch_indices    │     adata.obs['_scvi_batch']     │
│    local_l_mean    │ adata.obs['_scvi_local_l_mean']  │
│    local_l_var     │  adata.obs['_scvi_local_l_var']  │
│       labels       │    adata.obs['_scvi_labels']     │
│ protein_expression │ adata.obsm['protein_expression'] │
└────────────────────┴──────────────────────────────────┘
                        Label Categories                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃      Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_labels'] │     0      │          0          │
└───────────────────────────┴────────────┴─────────────────────┘
                       Batch Categories                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃     Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_batch'] │     0      │          0          │
└──────────────────────────┴────────────┴─────────────────────┘

adata
AnnData object with n_obs × n_vars = 7947 × 14309
    obs: 'batch', 'n_counts', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'gene_ids-1', 'n_counts'
    uns: 'log1p', '_scvi'
    layers: 'counts'


vae = scvi.model.TOTALVI(pbmc5k, latent_distribution="normal")

taopeng1100 · July 30, 2021, 7:08pm

I got the error message when running:
vae = scvi.model.TOTALVI(pbmc5k, latent_distribution=“normal”)

I think adata is from concatenation of pbmc3k + pbmc5k but only 5k has protein expression data.

adamgayoso · July 30, 2021, 8:08pm

It’s a bit difficult for me to follow what you’re trying to do. Can you check that pbmc5k.obsm["protein_expression"] works?

taopeng1100 · July 30, 2021, 8:20pm

It returns the table of 5247 rows × 32 columns

As I said, when adata was created by concatenation of pbmc3k with 5k, there is obsm[protein_expression]!

taopeng1100 · July 30, 2021, 8:22pm

I am curious, what is the typical preprocessing steps in Scanpy before the scvi modeling? This seems to be very memory intensive? I am using my laptop to run Scanpy and I am not sure if this is feasible with scvi?

taopeng1100 · July 30, 2021, 8:44pm

Is it possible to reduce the amount of genes for modeling step?

sc.pp.highly_variable_genes(
adata1,
n_top_genes=4000,
flavor=“seurat_v3”,
subset=True,
layer=“counts”
)

Can we do like 2000 genes?

adamgayoso · July 30, 2021, 8:49pm

It shouldn’t require much memory.

I can’t reproduce this error. As this is publicly available data, could you perhaps create an example on Google Colab reproducing the issue?

taopeng1100 · July 30, 2021, 8:53pm

this is the copy of my Jupyter scripts running scvi step: vae:train for pbmc5k (5247 cells for 4000 genes). It has been running 7 mins and it is only 24% finished. Is this normal?

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Epoch 99/400: 24%|██▍ | 98/400 [17:17<51:15, 10.19s/it, loss=1.35e+03, v_num=1]

adamgayoso · July 30, 2021, 9:00pm

Yes this would be normal I believe. Everything is much faster if you use a GPU, which Colab provides for free.

taopeng1100 · July 30, 2021, 9:03pm

Thx and I will try to use Colab to learn scvi using pbmc5k dataset.

taopeng1100 · August 1, 2021, 4:31am

I am still reading the tutorial carefully to understand how TOTALVI works:

for this part:

Analyze outputs

We use Scanpy for clustering and visualization after running totalVI. It’s also possible to save totalVI outputs for an R-based workflow. First, we store the totalVI outputs in the appropriate slots in AnnData.

The code is copied here:

adata.obsm[“X_totalVI”] = vae.get_latent_representation()

rna, protein = vae.get_normalized_expression(
n_samples=25,
return_mean=True,
transform_batch=[“PBMC10k”, “PBMC5k”]
)

adata.layers[“denoised_rna”], adata.obsm[“denoised_protein”] = rna, protein

adata.obsm[“protein_foreground_prob”] = vae.get_protein_foreground_probability(
n_samples=25,
return_mean=True,
transform_batch=[“PBMC10k”, “PBMC5k”]
)
parsed_protein_names = [p.split("_")[0] for p in adata.obsm[“protein_expression”].columns]
adata.obsm[“protein_foreground_prob”].columns = parsed_protein_names

I have two questions for these codes and hope you can help me!
n_samples = 25, what does it mean?
If I only have one sample such as pbmc5k, can I skip ‘’ transform_batch=[“PBMC10k”, “PBMC5k”]? "

Topic		Replies	Views
Error in scvi.model.TOTALVI.setup_anndata when loading protein-only data scvi-tools totalvi	5	514	August 25, 2022
totalVI NaN loss with few proteins scvi-tools	7	871	August 3, 2022
CellAssign keyword error: After Integration scvi-tools scvi	3	638	May 26, 2022
Sparse matrix error using totalVI integration scvi-tools integration	1	249	August 4, 2023
Cannot transfer setup without `extend_categories = True` scvi-tools	2	668	December 15, 2022

totalVI workflow

Analyze outputs

Related Topics