MIT AI Model Project | My Data Science Website

top of page

Secil.

Research & Projects

Stat of the Week

← Back to all projects

Completed

MIT Koch Institute

Machine Learning in Cancer Research – Building an AI Model

Creating an AI model that can predict age, diet, and treatment of sample mice used in colon cancer research given recorded genetic information. Currently includes part of my data categorization and filtration process.

TIMELINE

FIELD

ROLE

2026-present

Data science &

Student Researcher

Computational biology

STATUS

In-progress

OVERVIEW

The following data manipulation and analysis represents one step in the cleaning process for AI training. Arguably the most challenging part of fine-tuning a model is preparing data into a digestible format. In this example, RNA-sequencing results for intestinal epithelial cells is interpreted through cleaning and categorization. The cells have seven general classification types––EEC, Enterocyte, Goblet, Paneth, Stem, TA, and Tuft, the expressions of which change depending on age, diet, and treatment. By preparing datasets like this from a variety of sources, I can potentially create an AI model that can make predictions for age, diet, and treatment based off RNA-seq data.

Note: the analysis provided below is used for demonstrative purposes––in reality, multiple datasets are considered together and there is less emphasis on visual understanding.

WHAT I DID

My goal, as described above, is to perform data processing to feed the model. For this dataset, however, I also provided visuals and more biological context to show the reliability and overall setup of procedure, which involves:
1. Setup - importing appropriate libraries, reading in raw data files
2. Creating a violin plot to demonstrate dataset health, where the amount of useless/error data is expressed
3. Eliminating biologically useless/error data
4. Cleaning the data and preparing it for visual purposes
5. Calling upon different genes in the dataset to view their relative location in the plot
6. Classifying different regions of the cell data through that observation of gene clusters, looking out for expression of genes like Ptprc, which signals immune cell interference (not the intestinal cells we are focused on)

SEE MY COMPLETE CODE

Figure 3

Figure 1

The visuals above help demonstrate the classification part of the process. The larger plot on the left depicts the useful cells we are looking at, numbering different regions based on their similarity in terms of gene expression. The three plots on the right are examples of genes that tell us what types of cells we observe in each numbered section. For example, Olfm4, an intestinal STEM cell marker, seems to match up relatively well with region 1.

Figure 2

This is a more finalized version of Figure 1, where regions of the plot have been classified completely into their correct cell groups.

This figure contains the violin plots described in the process. It displays three main features of the data that reflect the set's overall health: the number of genes (total/unique) recorded per cell, and the percentage of mitochondrial content (high = poor cell quality)

TAKEAWAYS

This analyzed dataset offers high-quality interpretation because it reflects mostly healthy intestinal cells that are cleanly separated. Overall, epithelial cells work well for AI-training purposes because they renew quickly, meaning recent lifestyle changes or qualities like age/diet/treatment are expressed within a short time span. Contrarily, an organ like the heart might take weeks, months, or years to show accurate cellular changes due to the same factors listed above.

Even so, there are a few important factors that I need to consider before attempting to feed the model this data.
1. Variations in biological procedures or practice can strongly impact results (the classification that the AI model outputs). Collecting not just a lot, but a variety of data, is critical to ensure that the model does not pick up on unwanted patterns. Such errors are classified as batch effects, where factors like the researcher, time of day, or specific procedure reflect an overall bias in the results. To minimize this problem, I must extend my data pool out of the Yilmaz lab, MIT, and maybe even Massachusetts. That way, the AI model can learn the difference between old vs. young, not procedure A vs. procedure B.

2. Analyzing every individual dataset (or mouse, in this case) independently like the example shown above is not an optimal strategy in the long run. Doing so can further contribute to batch effects, where procedural differences in mice are reflected instead of biological ones. I am currently following a pipeline that analyzes many mice together, not just one at a time. That means the code I am currently working with is not entirely identical to the one shown above, though they are very similar in concept.

I will continue to update this page throughout the year as I make progress on the model.

COMPLETE CODE

library(Seurat)
library(patchwork)
library(ggplot2)

markers <- list(
Epithelium = c("Epcam"),
EEC = c("Chga"),
Goblet = c("Muc2", "Tff3", "Agr2"),
Paneth = c("Lyz1","Mmp7"),
Enterocyte = c("Alpi", "Apoa1"),
Tuft = c("Dclk1","Trpm5"),
Stem = c("Lgr5","Ascl2","Olfm4")
)

data <- Read10X_h5("/Users/seciluluderya/Library/CloudStorage/Dropbox/Secil\'s\ project/data/sequencing/scRNA/GSE210669/raw/GSM6435266_M1B.filtered_feature_bc_matrix.h5")
data <- CreateSeuratObject(data, min.features = 200, min.cells = 3)
data[["percent.mt"]] <- PercentageFeatureSet(data, pattern = "^mt-")
VlnPlot(data, c("nFeature_RNA", "nCount_RNA", "percent.mt"))

data <- subset(data, subset = nFeature_RNA > 700 & nFeature_RNA < 6000 & nCount_RNA < 50000 & percent.mt < 25)

data <- NormalizeData(data)
data <- FindVariableFeatures(data)
data <- ScaleData(data)
data <- RunPCA(data, dim = 1:50)

data <- FindNeighbors(data, dims = 1:50)
data <- FindClusters(data, resolution = 0.5)
data <- RunUMAP(data, dims = 1:50)
DimPlot(data, label = T) + FeaturePlot(data, markers$Stem)

FeaturePlot(data, "Epcam")
FeaturePlot(data, "Ptprc")
FeaturePlot(data, "Lgr5")
FeaturePlot(data, "Fgfbp1")
FeaturePlot(data, "Lyz1")
FeaturePlot(data, markers$EEC)
FeaturePlot(data, markers$Paneth)
FeaturePlot(data, markers$Goblet)
FeaturePlot(data, markers$Tuft)
FeaturePlot(data, markers$Enterocyte)

celltype_annotations <- c(
"0" = "TA",
"1" = "Stem",
"2" = "Enterocyte",
"3" = "TA",
"4" = "Enterocyte",
"5" = "Goblet",
"6" = "Enterocyte",
"7" = "Enterocyte",
"8" = "EEC",
"9" = "Tuft",
"10" = "Goblet",
"11" = "Paneth"
)

data@meta.data$celltype <- celltype_annotations[as.character(data$seurat_clusters)]
Idents(data) <- "celltype"

DimPlot(data, reduction = "umap", pt.size = 0.5, label = TRUE, group.by = "celltype") |
(
FeaturePlot(
data,
features = c("Epcam", "Ptprc", "Lgr5", "Fgfbp1", "Lyz1", "Muc2", "Dclk1", "Chga", "Apoa1"),
max.cutoff = "q95",
) &
theme(
legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.line = element_blank(),
panel.border = element_blank()
)
)

Secil Uluderya · © 2026

Home

·

Blog

·

Research

Stat of the Week

About

·

·

·

Contact

datanet.blog

bottom of page