Models for handling hundreds of batches

Hello, I routinely train scVI models with up to 30 batches without a hitch. In my case I’m integrating all publicly available C. elegans data, which will continue to grow and is likely to reach hundreds of batches within the next year.

I’d like to continue integrating all datasets as they come out, but I know that each additional batch is another neural network, so I’m not sure how the memory scaling works as I increase the number of batches with the current model.

Romain suggested a really interesting solution, which is to make a hierarchical model that in addition to cell embeddings, also learns a batch embedding, instead of one-hot encoding the batches as is currently done.

In addition to being a cool concept, that would allow unlimited batch integration. Imagine integrating all human or mouse data in a single model! That would make it really easy to let people make comparisons between arbitrary groups of cells from ANY experiment! Such a “mega model” could then be updated on a regular basis as new data becomes available, the same way that genome releases are regularly made :exploding_head:

-Eduardo

PS: First post (:

I think this concept about a hierarchical model that Romain suggested extremely exciting!

It would be very interesting to explore these idea on the Human Cell Landscape dataset: https://www.nature.com/articles/s41586-020-2157-4

This dataset consists of ~600,000 cells from 104 batches representing a large number of cell types from a large number of human organs.

You can see an exploratory analysis of this data that I did with scVI here: https://gist.github.com/vals/7232c1ec808cd67eb67fe3cc99c87e18

I’m hosting an optimized copy of the data here in case you want to experiment with it (converting the published data to a sparse matrix took about 8 hours): https://storage.googleapis.com/h5ad/10.1038-s41586-020-2157-4/HCL_combined.h5ad

For the record @romain_lopez: Ren et al just published a dataset with 1.4M (primarily blood) cells and 196 covid patients. I got the data in a 26GB .h5ad file I can share a S3 link for if interested:

https://doi.org/10.1016/j.cell.2021.01.053