HVG selection with multiple batches


First, I wanted to say thank you very much for this suite of tools - it looks very powerful and I’m very happy to be analysing my data with it. I had a question about highly variable gene (HVG) selection when it comes to having multiple batches.

In the manuscript, it’s mentioned that:
“In the case where the PBMC datasets are integrated, the 4,000 HVGs are selected by merging HVGs computed on each dataset separately as in the Seurat v3 method.”

However, in the tutorials e.g. the one for “Atlas-level integration and label transfer” (or others), the datasets are concatenated before Seurat v3 HVG selection is performed. Can I clarify what the exact process is that’s recommended?

For example, when I’m analysing my own data, after doing HVG selection on my concatenated anndata object and then run scvi.data.setup_anndata, it registers my anndata object with 4000 vars (I chose to select 4000 HVGs), but looking at the adata.var table, those genes are not necessarily all highly variable for all the batches. Is that what is nevertheless recommended, or is it actually suggested to only use the HVGs which are highly variable in all the batches?


In the mentioned tutorial we see


Where the concatenation happens beforehand, and the “batch” key is given to the HVG function. The way this function merges genes is by looking at how many batches it was considered HVG. If there are 2 batches and only 1000 genes that are HVG in both, this function will then look at genes that are HVG in one batch.

In practice this has typically worked out for us, but you could run this function without subset=True and then inspect the results and filter yourself. We do know that having batch-specific genes can worsen integration.

Thanks for the quick reply. That makes sense. I’ll proceed that way in that case and see how it goes!