Merging data from multiple cohorts and many donors with scVI

Hi there,
First of all, thanks so much for creating this amazing set of tools! I’m a new user, so my apologies if I’ve missed relevant docs that would answer this question

I’m interested in creating a large “atlas” from multiple (~20) independent cohorts which collectively span ~1000 donors. I have two related questions:

  1. is there a recommended minimum number of cells per donor ie batch (I am using the donor as the “batch” identifier)? I understand that there are suggestions to keep the number of cells greater than the number of genes, but I’m not sure if this applies within individual batches as well.
  2. It looks like there are systematic technical differences in expression between the cohorts. Is there a way to include the cohort as an additional “batch” covariate for the model fitting? Does that even make sense, given that each donor has a cohort membership so the model already has freedom to fit those differences on a per-donor basis?
    Any other thoughts/suggestions you might have on integrating very many batches would be much appreciated! For example, would it be better to integrate a few big batches and then bring the rest into that latent space using something like scarches? Are there SCVI flags (use_layer_norm, use_batch_norm, etc?) that might be appropriate for handling many cells/batches?
    Thanks in advance for any advice you can offer!
    Take care,
    Phil

Hi Phil,

It’s hard to say. What you’re attempting to to is beyond what we have explicitly tested (and very cool!).
We are currently working on an extension related to this post:

Yes you can put both the donor and the cohort as a key using categorical_covariate_keys here: scvi.data.setup_anndata — scvi-tools

I would be interested to learn a bit more about what you’re trying to do. What you stated are reasonable things to try though. Please feel free to email me (firstlast at berkeley dot edu) if you’d like to schedule a meeting!

Hi Adam,
Thanks for your kind reply and the pointer to that other relevant post. I will learn more about the categorical_covariate_keys argument.
I’m also very happy to chat more about this specific application. I’ll email you.
Take care,
Phil

PS. Sorry for the delay replying-- I closed the tab and missed the notification of your reply.