gimVI seq and spatial mixing

Thank you for these great set of tools! I’ve been following the gimVI tutorial for integrating my scRNA-seq data with spatial data, but I’ve noticed that the seq and spatial data don’t mix well when I plot the model-outputted latent representations on a UMAP.

Could you offer some tips on parameters to tweak during model training that might improve the seq/spatial mixing?

I would start with trying to increase the number of encoder and decoder layers. Could you post your script that you’re using?

Thanks, here is my script:

import os
import copy
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
from import setup_anndata
from scvi.model import GIMVI

home_path = 'drive/My Drive/_output'
in_path = os.path.join(home_path, '121120_spatial_data.h5ad')
spatial_data = sc.read_h5ad(in_path)
in_path = os.path.join(home_path, '121120_seq_data.h5ad')
seq_data = sc.read_h5ad(in_path)

seq_data = seq_data[:, spatial_data.var_names].copy()
seq_gene_names = seq_data.var_names

train_size = 0.8
n_genes = seq_data.n_vars
n_train_genes = int(n_genes * train_size)

rand_train_gene_idx = np.random.choice(range(n_genes), n_train_genes, replace=False)
rand_test_gene_idx = sorted(set(range(n_genes)) - set(rand_train_gene_idx))
rand_train_genes = seq_gene_names[rand_train_gene_idx]
rand_test_genes = seq_gene_names[rand_test_gene_idx]

spatial_data_partial = spatial_data[:, rand_train_genes].copy()

sc.pp.filter_cells(spatial_data_partial, min_counts=1)
sc.pp.filter_cells(seq_data, min_counts=1)

setup_anndata(spatial_data_partial, labels_key='ClusterName', batch_key='TMA_12')
setup_anndata(seq_data, labels_key='cl_CellType', batch_key='cType')

spatial_data = spatial_data[spatial_data_partial.obs_names, :]

model = GIMVI(seq_data, spatial_data_partial)

latent_seq, latent_spatial = model.get_latent_representation()

n = 150000
seq_idxs = np.random.choice(latent_seq.shape[0], n, replace=False)
spatial_idxs = np.random.choice(latent_spatial.shape[0], n, replace=False)

latent_representation = np.concatenate([latent_seq[seq_idxs, :], latent_spatial[spatial_idxs, :]])
latent_adata = ad.AnnData(latent_representation)
latent_labels = (['seq'] * n) + (['spatial'] * n)
latent_adata.obs['labels'] = latent_labels
sc.pp.neighbors(latent_adata)[np.random.permutation(np.arange(latent_adata.obs.shape[0])), :], color='labels')

You might try increasing the weight of the adversarial loss, which would be the kappa parameter when you call the .train() method.

Thank you! I will try higher values of kappa. I also noticed that the spatial_data clusters don’t separate well on the UMAP, even though the seq_data clusters do separate well. Are there any parameters that would influence one but not the other?