scANVI relables known cells with known types incorrectly

Hi scvi-tools Team,

I have been trying out scVI and scANVI and am evaluating how good label transfer works. For my current test setup, I have 4 different datasets (human liver), which I have manually labeled. To test scANVI, I overwrite the labels for one dataset with “Unknown”. I pretty much just follow your tutorial for scANVI.

The label transfer works pretty nice for the dataset with the “Unknown” cell types. However, when I use the SCANVI.predict() function, it generates wrong labels for actually known cell types (so not marked as “Unknown”). And bad ones at that.

Now my question, is that expected, i.e. is scANVI supposed to predict labels for these cells as well? And the more difficult question, any idea why it behaves this way? I’m still assuming that I’m just doing something wrong but I can’t figure out what. I have tried both starting from a pre-trained scVI model and training a scANVI model from scratch. Any help would be highly appreciated!

I’ll put code and a figure below. It’s pretty clear when looking at the NKT cluster in celltype_scanvi and then comparing to the same cluster in C_scANVI. This cluster doesn’t contain unlabeled cells.

Cheers,
Kevin

adata.obs["celltype_scanvi"] = 'Unknown'
# Get the labels for datasets 0, 1, 2

batch_idx = adata.obs['batch'] == "0"
adata.obs["celltype_scanvi"][batch_idx] = adata.obs.celltype[batch_idx]

batch_idx = adata.obs['batch'] == "1"
adata.obs["celltype_scanvi"][batch_idx] = adata.obs.celltype[batch_idx]

batch_idx = adata.obs['batch'] == "2"
adata.obs["celltype_scanvi"][batch_idx] = adata.obs.celltype[batch_idx]

adata.obs['celltype_scanvi'] = adata.obs['celltype_scanvi'].astype("str")

np.unique(adata.obs["celltype_scanvi"], return_counts=True)

scvi.data.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    labels_key="celltype_scanvi",
)

lvae = scvi.model.SCANVI(adata, "Unknown", n_latent=30, n_layers=2)

lvae.train(n_samples_per_label=100)

adata.obs["C_scANVI"] = lvae.predict(adata)
adata.obsm["X_scANVI"] = lvae.get_latent_representation(adata)
sc.pp.neighbors(adata, use_rep="X_scANVI")
sc.tl.umap(adata)

sc.pl.umap(adata, color=["celltype_scanvi", "C_scANVI", "batch"], ncols=1, frameon=False)

I’ll have to look at this again more closely later, but a few quick comments:

  1. What version of scvi-tools are you using? In the latest version, the workflow is to now do some pre training with an SCVI model, which was done implicitly before, but we separated it out for code reasons.
vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
vae.train()
scanvi_model = scvi.model.SCANVI.from_scvi_model(vae, 'Unknown')
scanvi_model.train(25)

though I see now this tutorial was not properly updated with this workflow.

  1. In your workflow, how many epochs is scanvi trained for?

Yes, though the accuracy should be higher.

Thanks for the answer!

I’m using version 0.10.0

I have tried both, training an SCVI model before and then starting from that, or training a SCANVI model directly.
In the former case, it trains for ~60 epochs SCVI and then ~7 epochs SCANVI. In the latter case just ~60 epochs SCANVI.

Okay given that I wouldn’t want to re-label cells which I have already labeled, I could of course extract the predictions and just label the unknown cells manually. Maybe having this as an option would be helpful? (i.e. fix labels of cells with known labels).
Of course this still doesn’t solve the question why it behaves so weird for some clusters :thinking:

This should probably be a default. We will make a note of it.

This seems like a small number of epochs. How many cells do you have?

Ist about 140k cells. I will simply try more epochs then. I’ll let you know if it helped.

Hey Kevin,

scANVI predicting known celltypes incorrectly is something I’ve also observed – but haven’t extensively tested.

A few more suggestions to potentially improve results:

  1. If the frequency of your smallest celltype size is greater than 100, I would set the n_samples_per_label arg in lvae.train() to that number. (This way you’ll train on more cells each epoch)
  2. I agree with Adam in increasing the number of scANVI epochs. I would even train for like 50 epochs since with the n_samples_per_label param, you’re subsampling the train set.

Note, you’ll need to install the latest version of scvi-tools off of master. I just fixed a bug in max_epochs for scANVI. fix scANVI max_epochs bug when pretrained by galenxing · Pull Request #1079 · YosefLab/scvi-tools · GitHub

@KevinMenden We’d like to further troubleshoot this. Are you able to share your data with us?

Hi both,

sorry for not responding, I was on vacation.

I will try out your ideas. Yes the data are public so I can share them with you. I can basically send you the datasets as processed by me and the scripts I use.

Any preference about how to share the data with you?

I think a Google colab notebook (like our tutorials) that reproduces the issue is easiest, but even sharing the data and script is sufficient (dropbox, google, etc.)

Thanks!

Alright, I’ll send you something tomorrow!

Okay I’ve uploaded the labeled datasets (in .h5ad format) and the script I used here:

You should be able to just run the notebook from within that folder. I’ll install the patched scANVI version now and try to set max_epochs higher. Didn’t work with the current version.

Quick update from my side:

  • increasing the scANVI epochs to 50 didn’t really help
  • additionally removing the subsampling did help

Without the subsampling and training scANVI for 50 epochs, it looks much better now and basically all cells are labeled correctly. A few labels have changed but those probably make sense.

Hey,

Having the exact same problem here. I’m wondering what Kevin means by removing the subsampling? Just setting n_samples_per_label to the total amount of least frequent cell type?

Thanks

Hi @nrclaudio, sorry for the late reply. This would mean setting n_samples_per_label=None, which is the default option.