Usage of HVG in scVI

Question copied from GitHub

The question is, does scVI use highly variable genes selected by scanpy?

I ran scvi 1) with no highly variable genes selected, 2) with highly variable genes selected, 3) with highly variable genes selected but manually removed some genes from HVG.

The results are different for 1) and 2) but the same for 2) and 3).

This is confusing. Anyone has any idea?

Thanks!
Ling

Hi Ling,

scVI uses whatever genes are present in the adata object. Empirically, we tend to see better dimensionality reduction results when adata only contains highly variable genes, which is what we do in our tutorials.

This is also the case for most if not all integration/batch-correction methods. E.g., this paper shows using HVG leads to better integration

You would likely see the same phenomenon with a simple PCA.

Dear Adam,

I wonder whether there is a possibility to choose which genes will be used for the model. Let’s assume I compute HVGs using scanpy but don’t subset my anndata object (that is, HVGs will be tagged instead of the removal of non-variable genes) the model will use all genes instead of the HVGs. Therefore, it is necessary to subset the anndata before model setup if one wishes to use the HVGs for training. However, it would be desirable not to subset the data and keep the original anndata object intact.

I think if there was a possiblity to make scvi only use the tagged HVGs, that would increase compatibility with scanpy when using multiple layers for raw counts, normalized counts and simplify later sub-clustering approaches which would require to re-compute the HVGs (which is not possible when only keeping the HVGs from level 1 clustering).

Hi,

The strategy I use to maintain the full dataset after selecting the HVGs is to store all genes in to the adata.raw field: anndata.AnnData.raw — anndata 0.7.7.dev7+g9620645 documentation

If you put all genes the .raw, and do analysis using a subset of genes that results in information stored in .obs or .obsm, then you can recover all genes using adata.raw.to_adata(). This will leave all the new obs and obsm in the anndata.

Dear Valentine,

this is useful information, thank you. I never fully grasped the concept of the .raw in scanpy (I must say the description is not very intuitive in my eyes though).

Will this also recover the layer of raw counts (not only log-normalized)?

Kind regards

Following this pattern that is in our tutorials and suggested by @Valentine_Svensson

adata.layers["counts"] = adata.X.copy() # preserve counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata # freeze the state in `.raw`

leads to adata.raw.X being log normalized. The counts can’t be retrieved from .raw. raw is invariant to adata being subset at the gene level.

We considered this pattern, but it proved to be difficult to implement as we then needed to keep a gene-level mask around. It also was slower as every minibatch of data that goes through the model needed to be subset. While there is some inconvenience on the user-end for managing data, we felt that a what you see is what you get approach is also more straightforward to use in a pipeline.