First, I wanted to say thank you very much for this suite of tools - it looks very powerful and I’m very happy to be analysing my data with it. I had a question about highly variable gene (HVG) selection when it comes to having multiple batches.
In the manuscript, it’s mentioned that:
“In the case where the PBMC datasets are integrated, the 4,000 HVGs are selected by merging HVGs computed on each dataset separately as in the Seurat v3 method.”
However, in the tutorials e.g. the one for “Atlas-level integration and label transfer” (or others), the datasets are concatenated before Seurat v3 HVG selection is performed. Can I clarify what the exact process is that’s recommended?
For example, when I’m analysing my own data, after doing HVG selection on my concatenated anndata object and then run scvi.data.setup_anndata, it registers my anndata object with 4000 vars (I chose to select 4000 HVGs), but looking at the adata.var table, those genes are not necessarily all highly variable for all the batches. Is that what is nevertheless recommended, or is it actually suggested to only use the HVGs which are highly variable in all the batches?