great tool. Use it and love it.
I wanted to use the differential gene expression module. The data set I am using is already processed (accounted for batch variables, neighborhoods and UMAP, Leiden calculated).
If I understood it correct the DGE module requires the model. What would be the next steps then? Running the model on the data without any batch key specified? Or just reprocess the data with specifying the data batch variable I guess, and do the DGE?
Thanks & best wishes,
If you have cell group names stored that you arrived at through some external process (say experimental design, or based on come computational analysis) then either of your strategies should work for the DGE.
Whether you condition on batches, or just train the model without informing about batches, the inferred gene expressions used by the DGE should work. Adjusting for batches would help with the representation of the cells that you would typically use to learn cell types or states. If you don’t adjust for batches the embedding might also contain batch-to-batch variation in addition to cell-to-cell variation. But for the DE, whether you compare cells it shouldn’t matter.
For what you want to do, I would run it without batch adjustment etc. Just fit the model, provide the groups you want to compare, and run the DGE. (Though I would probably get curious and spend a few hours looking at how the scVI embedding/batch correction compares with the other analysis…)
I just want to add to the response of @Valentine_Svensson.
You can either:
- Give the model your processed dataset, with the consideration that it’s probably not ideal to feed the output of one computational batch correction tool, like Seurat v3, into scVI.
- Give scVI the UMI count data and do not specify the batch key.
- Give scVI the UMI count data and specify the batch key.
In the case of 1 (I could be misinterpreting your post, if 1 is what you mean) or 2 or 3, you can always use the leiden labels you derived in your other analysis.
When you give scVI the batch information (3), as Valentine mentioned, it will create an integrated latent space. Keep in mind, the decoder of scVI takes as input the latent representation and the batch information, so in some sense, the decoded output contains the batch effect (even for 3).
If you’re using scVI just for DGE, it won’t matter much if you provide the batch information or not; however, we do have a
batch_correction argument in the function, which can effectively perform DGE while accounting for the batch effect, by integrating over all batches (or the batches specified in
batchid2). The default is for no batch correction (
batch_correction=False) and in general this can only be effectively set to
True for case 3 (not case 2).