I was wondering what your thoughts are on whether gene filtering should be performed prior to highly variable gene selection and scVI training?
Publicly available datasets can have vastly different total captured genes which can influence HVG selection. In other non vae integrations I have noticed that even when you specify the dataset level batch IDs as a parameter, the datasets fail to integrate. If you remove dataset specific genes, they do integrate, but this is not ideal if cellular composition changes from dataset to dataset.
For scVI, would the best approach be to specify batch as dataset IDs and categorical covariates as within dataset sample_IDs?