Hello, I routinely train scVI models with up to 30 batches without a hitch. In my case I’m integrating all publicly available C. elegans data, which will continue to grow and is likely to reach hundreds of batches within the next year.
I’d like to continue integrating all datasets as they come out, but I know that each additional batch is another neural network, so I’m not sure how the memory scaling works as I increase the number of batches with the current model.
Romain suggested a really interesting solution, which is to make a hierarchical model that in addition to cell embeddings, also learns a batch embedding, instead of one-hot encoding the batches as is currently done.
In addition to being a cool concept, that would allow unlimited batch integration. Imagine integrating all human or mouse data in a single model! That would make it really easy to let people make comparisons between arbitrary groups of cells from ANY experiment! Such a “mega model” could then be updated on a regular basis as new data becomes available, the same way that genome releases are regularly made
PS: First post (: