Having some difficulties with CellAssign help please =)

Hey everyone. Historically I have done my cell annotations with a combination of manual approaches or with SingleR and databases (I come from the bioconductor world originally).

I wanted to give CellAssign a try for a recent project because some the cell types arent in the databases and this could be a great way to “automate” the manual annotation since I get to assign my own markers with it.

I followed this guide → Annotation with CellAssign — scvi-tools and have no errors but also cells also get dumped into incorrect categories. The default 400 epochs puts everything into the last two categories , and when I try with 50 epochs everything gets dumped into “other” category… (have 210,000 cells , 10X platform, 5’ prime seq kits)

I have a RTX GPU so re-running the model.train() is trivial in time so am happy to try other approaches to change things if people have suggestions…

my two ideas are:

  1. after bdata = adata[:, marker_gene_mat.index].copy() I am going to have a lot of empty cells. are the total 0’s confusing the model? those t cells are tricky becasue CD4 transcript and CD8A wont be detected in all of them. if I remove them , how hard is it to extrapolate them later from the detected ones? EDIT: I attempted to control for this, see first comment, it didn’t fix my problem

  2. the tutorial doesnt want log data, put perhaps I need to format the counts differently than I am, I have tried RAW and normalized RAW…

  3. There seems to be a VERY old r-cellassign package that uses tensorflow also, it is the same thing as this? I think I would be better at troubleshooting an sce object because of my background than anndata , but i have no idea if the projects are linked or just have same name.

oh and 41 genes being used in the celltype.csv

Thanks in advance for troubleshooting help!

okay here are some updates… after setting sc.pp.filter_cells(bdata, min_genes=1) it only removed 14 cells, so out of the 200K , I dont think that was my problem , I did remove one subset of cells and 2 genes to try and simplify things, and set min genes from my remaining 39 to 2 which is the smallest set (my CD4 Dump) and 7 is my largest (Tfh)…

this approach didnt give me any different results compared to before both at 400 and 50… =(

moving on to test hypothesis number 2

They won’t confuse the model, but you might consider adding more markers to the marker matrix.

Just the plain UMI counts are the input

This is a reimplementation of the R version. The training is a bit different and it should be much more scalable than the original version.

What are you currently using for the size factors? In the tutorial, I believe it’s using size factors computed originally with scran.

They won’t confuse the model, but you might consider adding more markers to the marker matrix.

Awesome Possum, I removed them anyway and learned they werent the problem. But good to know for the future and I will leave them in the next time

Just the plain UMI counts are the input

Great!

What are you currently using for the size factors? In the tutorial, I believe it’s using size factors computed originally with scran.

I will double check this tonight, I am pretty sure I was just using computeSumFactors , but I should go back to my pipeline to verify.

I have mostly solved it. I was able to improve my results by changing how I exported the count matrix as well as moving my “dump” other category to the end. I don’t know if that actually makes a difference in the order, but I went from 160,000 cells in the other, to none. So I am happy about that.

I have a bunch of stuff on my plate currently but maybe at the end of the summer I can write up a little R tutorial for formatting and exporting data properly for use with scvi cellassign , for the non scanpy / bioconductor people… I am sure my teething pains have more to do with not being as proficient with scvi & scanpy than with anything.

It shouldn’t… would be great if you could ensure this.

By the way, I realize we have a bug in the tutorial regarding size factors. Using scran is ideal, but if using sum of UMI counts (library size), it needs to be normalized by the mean library size:

lib_size = np.asarray(adata.X.sum(1))
adata.obs["size_factor"] = lib_size / np.mean(lib_size)