SOLO usage - batch, training, predicting

Hello - great set of tools, extremely useful.

I was hoping to try using SOLO for doublet detection and can see this is now incorporated into scvi tools.
I’m not finding the usage of this particularly clear.
Seems that the guidance is that it should be run on a single droplet sequencing lane at a time, and it’s not possible to run after training an scvi model with batch correction. Is that correct?

After training then predicting I am getting a numpy array which is a little difficult to interpret - the dimensions are larger than the original anndata object - so perhaps it is the simulated doublets and the native cells together. It’s just a little unclear.
Any chance you could clarify how best to use this tool?


Thank you for the raising this. Indeed it looks like the predict() method will give predictions of real cells and simulated doublets.

Indeed this is true; this is how Solo was designed, as doublets are generated within a specific lane.

The workflow in the examples of the documentation should be followed, except here is a workaround to process the output of the predict function.

def process_predict_output(output, solo_model):
    import pandas as pd
    label = solo_model.adata.obs["_solo_doub_sim"].values.ravel()
    preds  = output[label == "singlet"]
    cols = solo_model.adata.uns["_scvi"]["categorical_mappings"]["_scvi_labels"]["mapping"]
    preds_df = pd.DataFrame(preds, columns=cols)
    return preds_df

This will give you the predictions with same order as the input anndata, and now named columns of the prediction, if you ran solo.predict(soft=True).

We will update the code so this is more straightforward.

Thanks for your quick and helpful response Adam.
I’m just conscious that with multiple droplet lanes, this requires recomputing a model for each lane which would be very time consuming for big experiments.
I wonder if it would be possible to compute a model overall for the experiment and then run solo off that for subsets corresponding to individual lanes? Or do you think this would break some important assumptions of the tool?

Yes, it’s totally fine to fit one scVI model and then run Solo independently for each lane, seeded from that model. That --seed option in our CLI demonstrates that workflow: solo/ at master · calico/solo · GitHub

Hmmm I think we might need to change the scvi-tools Solo API a bit to allow this. I can do this relatively soon.

Thanks for your thoughts on this @davek44 & @adamgayoso
I think modifying the API to allow for this approach would be extremely useful

PR is up.