Models for handling hundreds of batches

munfred · October 21, 2020, 9:55pm

Hello, I routinely train scVI models with up to 30 batches without a hitch. In my case I’m integrating all publicly available C. elegans data, which will continue to grow and is likely to reach hundreds of batches within the next year.

I’d like to continue integrating all datasets as they come out, but I know that each additional batch is another neural network, so I’m not sure how the memory scaling works as I increase the number of batches with the current model.

Romain suggested a really interesting solution, which is to make a hierarchical model that in addition to cell embeddings, also learns a batch embedding, instead of one-hot encoding the batches as is currently done.

In addition to being a cool concept, that would allow unlimited batch integration. Imagine integrating all human or mouse data in a single model! That would make it really easy to let people make comparisons between arbitrary groups of cells from ANY experiment! Such a “mega model” could then be updated on a regular basis as new data becomes available, the same way that genome releases are regularly made

-Eduardo

PS: First post (:

Valentine_Svensson · October 22, 2020, 4:24pm

I think this concept about a hierarchical model that Romain suggested extremely exciting!

It would be very interesting to explore these idea on the Human Cell Landscape dataset: https://www.nature.com/articles/s41586-020-2157-4

This dataset consists of ~600,000 cells from 104 batches representing a large number of cell types from a large number of human organs.

You can see an exploratory analysis of this data that I did with scVI here: https://gist.github.com/vals/7232c1ec808cd67eb67fe3cc99c87e18

I’m hosting an optimized copy of the data here in case you want to experiment with it (converting the published data to a sparse matrix took about 8 hours): https://storage.googleapis.com/h5ad/10.1038-s41586-020-2157-4/HCL_combined.h5ad

munfred · June 13, 2021, 7:53pm

For the record @romain_lopez: Ren et al just published a dataset with 1.4M (primarily blood) cells and 196 covid patients. I got the data in a 26GB .h5ad file I can share a S3 link for if interested:

https://doi.org/10.1016/j.cell.2021.01.053

Justin_Hong · September 7, 2021, 7:05pm

Hi @munfred, I’m a recent hire of the scvi-tools team, and I’d be interested in exploring the dataset you linked and experimenting with the batch embedding idea. Do you mind sending me the S3 link if still available?

maarten-hifibio · September 8, 2021, 9:05am

I think this is a great idea! Very curious to know what you find.

Topic		Replies	Views
Increase scVI integration speed scvi-tools integration	5	446	October 24, 2023
Merging data from multiple cohorts and many donors with scVI scvi-tools	2	685	September 22, 2021
Minimum number of cells for scVI? scvi-tools scvi	2	260	February 15, 2023
SCVI tools with large datasets scvi-tools	2	217	February 9, 2024
Batch Integration Parameter Tuning scvi-tools integration , gene-selection , scvi , modeling	1	486	March 2, 2022

Models for handling hundreds of batches

Related Topics