
An image analysis pipeline to quantify the spatial distribution of cell markers in stroma-rich tumors
Authors: Antoine A. Ruzette1, Nina Kozlova2,3, Kayla A. Cruz2,4, Taru Muranen2,3, and Simon F. Nørrelykke1*
Affiliations:
1 Department of Systems Biology, Harvard Medical School, Boston, MA, USA
2 Department of Genetics, Cancer Research Institute, Beth Israel Deaconess Medical Center, Boston, MA, USA
3 Harvard Medical School, Boston, MA, USA
4 Biological and Biomedical Sciences PhD Program, Harvard University, Boston, MA, USA
* Correspondence: simon@hms.harvard.edu
Abstract
Aggressive cancers, such as pancreatic ductal adenocarcinoma (PDAC), are often characterized by a complex and desmoplastic tumor microenvironment rich in stroma, a supportive connective tissue composed primarily of extracellular matrix (ECM) and non-cancerous cells. Desmoplasia, which is a dense deposition of stroma, is a major reason for therapy resistance, acting both as a physical barrier that interferes with drug penetration and as a supportive niche that protects cancer cells through diverse mechanisms. Precise understanding of spatial cell interactions in stroma-rich tumors is essential for optimizing therapeutic responses. It enables detailed mapping of stromal-tumor interfaces, comprehensive cell phenotyping, and insights into changes in tissue architecture, improving assessment of drug responses. Recent advances in multiplexed immunofluorescence imaging have enabled the acquisition of large batches of whole-slide tumor images, but scalable and reproducible methods to analyze the spatial distribution of cell states relative to stromal regions remain limited. To address this gap, we developed an open-source computational pipeline that integrates QuPath, StarDist, and custom Python scripts to quantify biomarker expression at a single- and sub-cellular resolution across entire tumor sections. Our workflow includes: (i) automated nuclei segmentation using StarDist, (ii) machine learning-based cell classification using multiplexed marker expression, (iii) modeling of stromal regions based on fibronectin staining, (iv) sensitivity analyses on classification thresholds to ensure robustness across heterogeneous datasets, and (v) distance-based quantification of the proximity of each cell to the stromal border. To improve consistency across slides with variable staining intensities, we introduce a statistical strategy that translates classification thresholds by propagating a chosen reference percentile across the distribution of marker-related cell measurement in each image. We apply this approach to quantify spatial patterns of distribution of the phosphorylated form of the N-Myc downregulated gene 1 (NDRG1), a novel DNA repair protein that conveys signals from the ECM to the nucleus to maintain replication fork homeostasis, and a known cell proliferation marker Ki67 in fibronectin-defined stromal regions in PDAC xenografts. The pipeline is applicable for the analysis of various stroma-rich tissues and is publicly available.