top of page
Completed
MIT Koch Institute
TIMELINE
FIELD
ROLE
2026-present
Data science &
Student Researcher
Computational biology
STATUS
In-progress
OVERVIEW
WHAT I DID

My goal, as described above, is to perform data processing to feed the model. For this dataset, however, I also provided visuals and more biological context to show the reliability and overall setup of procedure, which involves:
1. Setup - importing appropriate libraries, reading in raw data files
2. Creating a violin plot to demonstrate dataset health, where the amount of useless/error data is expressed
3. Eliminating biologically useless/error data
4. Cleaning the data and preparing it for visual purposes
5. Calling upon different genes in the dataset to view their relative location in the plot
6. Classifying different regions of the cell data through that observation of gene clusters, looking out for expression of genes like Ptprc, which signals immune cell interference (not the intestinal cells we are focused on)

image_edited.jpg
image.png
image.png
TAKEAWAYS

This analyzed dataset offers high-quality interpretation because it reflects mostly healthy intestinal cells that are cleanly separated. Overall, epithelial cells work well for AI-training purposes because they renew quickly, meaning recent lifestyle changes or qualities like age/diet/treatment are expressed within a short time span. Contrarily, an organ like the heart might take weeks, months, or years to show accurate cellular changes due to the same factors listed above. 

Even so, there are a few important factors that I need to consider before attempting to feed the model this data.
     1. Variations in biological procedures or practice can strongly impact results (the classification that the AI model outputs). Collecting not just a lot, but a variety of data, is critical to ensure that the model does not pick up on unwanted patterns. Such errors are classified as batch effects, where factors like the researcher, time of day, or specific procedure reflect an overall bias in the results. To minimize this problem, I must extend my data pool out of the Yilmaz lab, MIT, and maybe even Massachusetts. That way, the AI model can learn the difference between old vs. young, not procedure A vs. procedure B. 

     2. Analyzing every individual dataset (or mouse, in this case) independently like the example shown above is not an optimal strategy in the long run. Doing so can further contribute to batch effects, where procedural differences in mice are reflected instead of biological ones. I am currently following a pipeline that analyzes many mice together, not just one at a time. That means the code I am currently working with is not entirely identical to the one shown above, though they are very similar in concept. 

I will continue to update this page throughout the year as I make progress on the model.

 

library(Seurat)
library(patchwork)
library(ggplot2)

markers <- list(
  Epithelium = c("Epcam"),
  EEC = c("Chga"),
  Goblet = c("Muc2", "Tff3", "Agr2"),
  Paneth = c("Lyz1","Mmp7"),
  Enterocyte = c("Alpi", "Apoa1"),
  Tuft = c("Dclk1","Trpm5"),
  Stem = c("Lgr5","Ascl2","Olfm4")
)

data <- Read10X_h5("/Users/seciluluderya/Library/CloudStorage/Dropbox/Secil\'s\ project/data/sequencing/scRNA/GSE210669/raw/GSM6435266_M1B.filtered_feature_bc_matrix.h5")
data <- CreateSeuratObject(data, min.features = 200, min.cells = 3)
data[["percent.mt"]] <- PercentageFeatureSet(data, pattern = "^mt-")
VlnPlot(data, c("nFeature_RNA", "nCount_RNA", "percent.mt"))

data <- subset(data, subset = nFeature_RNA > 700 & nFeature_RNA < 6000 & nCount_RNA < 50000 & percent.mt < 25)

data <- NormalizeData(data)
data <- FindVariableFeatures(data)
data <- ScaleData(data)
data <- RunPCA(data, dim = 1:50)

data <- FindNeighbors(data, dims = 1:50)
data <- FindClusters(data, resolution = 0.5)
data <- RunUMAP(data, dims = 1:50)
DimPlot(data, label = T) + FeaturePlot(data, markers$Stem)

FeaturePlot(data, "Epcam")
FeaturePlot(data, "Ptprc")
FeaturePlot(data, "Lgr5")
FeaturePlot(data, "Fgfbp1")
FeaturePlot(data, "Lyz1")
FeaturePlot(data, markers$EEC)
FeaturePlot(data, markers$Paneth)
FeaturePlot(data, markers$Goblet)
FeaturePlot(data, markers$Tuft)
FeaturePlot(data, markers$Enterocyte)

celltype_annotations <- c(
  "0" = "TA",
  "1" = "Stem",
  "2" = "Enterocyte",
  "3" = "TA",
  "4" = "Enterocyte",
  "5" = "Goblet",
  "6" = "Enterocyte",
  "7" = "Enterocyte",
  "8" = "EEC",
  "9" = "Tuft",
  "10" = "Goblet",
  "11" = "Paneth"
)

data@meta.data$celltype <- celltype_annotations[as.character(data$seurat_clusters)]
Idents(data) <- "celltype"
                    
DimPlot(data, reduction = "umap", pt.size = 0.5, label = TRUE, group.by = "celltype") |
  (
    FeaturePlot(
      data,
      features = c("Epcam", "Ptprc", "Lgr5", "Fgfbp1", "Lyz1", "Muc2", "Dclk1", "Chga", "Apoa1"),
      max.cutoff = "q95",
    ) &
      theme(
        legend.position = "none",
        axis.title  = element_blank(),
        axis.text   = element_blank(),
        axis.ticks  = element_blank(),
        axis.line   = element_blank(),
        panel.border = element_blank()
      )
  )

Drop Me a Line, Let Me Know What You Think

© 2035 by Train of Thoughts. Powered and secured by Wix

bottom of page