An Interdisciplinary Journey
A Brief Look at Coding and Machine-Learning in Genetics
Melba Nuzen, Scripps Ranch High School
In today’s world of technology, the road to discovery is paved
with complex, multifaceted problems. To begin looking for solutions, we must
find equally complex methods to tackle these challenges.
Our road was paved in the 1950s when people first discovered DNA
as the blueprint for our human systems—a design that remained unchanged
throughout our lives. However, as research progressed, we discovered that our
characteristics rely on more than just the nucleotides of DNA; there are
certain chemical compounds and proteins that can modify the expression of DNA,
collectively referred to as the epigenome.
The epigenome can increase the production of specific proteins or
turn certain genes on or off when necessary [1]. All of this occurs without
altering the actual DNA code itself; epigenetic proteins instead interact with
DNA. Recently, a collection of institutions have begun exploring epigenetics in
the field of cancer research.
The connection between cancer and epigenetics is fairly simple:
the epigenome includes proteins called transcription factors that can inhibit
gene expression by blocking DNA transcription. This can stop cells from
multiplying by altering gene expression—if certain genes aren’t expressed, the
cell cannot divide. Modulation of transcription factors is essential to the
proliferation of cancer cells, formation of tumors, and tumor metastasis to
other organs, which produces secondary tumors [2]. In a study done in mice,
researchers sampled various epigenomes and found a group of enhancer genes
called metastatic variant enhancer loci (Met-VELs) that are frequently located
near bone cancer genes [3]. The activation of these enhancer genes was required
for the formation of secondary tumors, while inhibiting transcription factors
that coordinated with Met-VELs interrupted metastasis. Ultimately, this
decreased the growth of cancerous tumors and prevented relapse in mice.
Of course, there are many more variables to test before such
research can be extended to humans. But the fundamental takeaway from this example
is clear: a new scientific discovery leads to a better understanding of
genetics, which inspires solutions to challenges that have major impacts on
humanity.
So how do these discoveries, understandings, and solutions come
about? A variety of fields, such as artificial intelligence, mathematical
statistics, and computer programming are combined in careers like biostatistics
and bioinformatics to address some of these challenges.
Let’s take a closer look at the previously mentioned study of bone
cancer in mice. The activation of Met-VELs by transcription factors was just
one of thousands of interactions found when epigenetic proteins interacted with
enhancer genes. So how do we begin discovering what each protein does when it
binds to its respective gene? And before we tackle that question, how do we
even map out DNA strands and their epigenetic counterparts?
To sort through the billions of base-pairs in the human
genome—which translates to millions of bytes of data—scientists turn to
computers, or more specifically, programming languages. For example, take R, a
powerful language designed for data analysis. Counting the number of
nucleotides in a string of DNA would look something like this:
library(stringr)
seq1 <- “TCTTGGATCA”
count1A <- str_count(seq1, c(“A”))
count1C <- str_count(seq1, c(“C”))
count1G <- str_count(seq1, c(“G”))
count1T <- str_count(seq1, c(“T”))
Six lines of code tell the computer to read through a string of
characters, seq1, and count all of the As, Cs, Gs, and Ts. Using the library
stringr, and the unction str_count, this code creates four variables that hold
the number of times their respective letter appears in seq1.
To compare DNA before and after a mutation, the code would
resemble this:
library(stringr)
seq1 <- “TCTTGGATCA”
count1A <- str_count(seq1, c(“A”))
seq2 <- “TCATGGATCA”
count2A <- str_count(seq2, c(“A”))
if ( count1A == count2A ) {
print(“true”)
}
This program compares two strands of DNA, seq1 and seq2. Using the
process described above, the computer generates two variables that
represent the amount of “A” characters found in seq1 and seq2. Then, the code
compares those two variables, returning true if there is an equal number
of “A” characters found in both sequences. This idea can be implemented for all
four bases to compare much lengthier DNA strands and determine whether or not
strands contain the same number of specific bases.
Of course, these are simple examples to illustrate how coding
algorithms can be utilized. With a few lines of code, computers can analyze
millions of strands of DNA in many types of coding languages. Now, the question
to answer is how DNA interacts with proteins, and what overall effect that has
on a biological system. For this complex problem, we venture off the beaten
path to a more complex solution: artificial intelligence.
When AI is mentioned, images of self-driving cars and evil robots
often come to mind. However, artificial intelligence can play a large role in
the field of bioinformatics, particularly in genetics. Machine learning is one
such application of artificial intelligence that specializes in the independent
analysis of data by algorithms. This will be useful for looking at
transcription factors and their roles in cell development [4].
Figure 1: Supervised Learning in recognizing transcription start sites in DNA [6]
|
In regards to epigenetics, the model would sift through megabytes
of DNA to pick out notable genes of interest, allowing for more time to be
allocated toward concentrating on analyzing how the DNA interacts with
transcription factors [6].
On the other hand, unsupervised learning comes into play when it’s
preferable to avoid giving a model pre-determined labels or groups. An
application of this type of learning could be determining the functions and
effects of specific transcription initiation complexes. Given enhancers and
their respective proteins along with their impact on associated functions, a
machine learning model can group proteins together based on similar effects.
This occurs in one of two ways: generative or discriminative modelling. The
former type of modelling groups data based on similar characteristics, whereas
the latter draws a boundary between data points [6]. When dealing with unknown
variables, such as the functions of proteins, discriminative modeling is used
more often, since scientists have few predetermined groups to classify proteins
into.
Figure 2: Unsupervised Learning in grouping data [6]
|
With these methods, the machine can then conclude that a certain
group of enhancers and their epigenetic counterparts halt the proliferation of
cancer cells, as seen in Met-VELs.
Though our journey, filled with pit-stops at various science
disciplines, took us on a winding and tangled road, the combination of coding,
machine-learning, and genetics has led us to a fascinating discovery full of
potential. But this explanation covers only the basics of such a revelation; in
reality, studying epigenetics and cancer cells is just one application of the
interdisciplinary study. At the moment, combining fields of interest is the
road leading us toward the future. Our journey will continue as we synthesize a
variety of concepts to take on complex, ever-diversifying problems and explore
new solutions that will impact humanity for years to come.
References
[1] Epigenomics Fact
Sheet. National Human Genome Research Institute website.
https://www.genome.gov/27532724/epigenomics-fact-sheet/. Accessed February 8,
2018.
[2] Davis CP.
Understanding Cancer: Metastasis, Stages of Cancer, and More. OnHealth.
https://www.onhealth.com/content/1/cancer_types_treatments. Accessed February
8, 2018.
[3] Researchers Inhibit
Cancer Metastases via Novel Steps - Blocking Action of Gene Enhancers Halts
Spread of Tumor Cells. Case Western Reserve University School of Medicine
website.
http://casemed.case.edu/cwrumed360/news-releases/release.cfm?news_id=1026&news_category=8.
Accessed February 12, 2018.
[4] Marr B. What Is The
Difference Between Artificial Intelligence And Machine Learning? Forbes.
https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine-learning/#4dc79f282742.
Published September 15, 2017. Accessed February 18, 2018.
[5] Marr B. Supervised V
Unsupervised Machine Learning -- What's The Difference? Forbes.
https://www.forbes.com/sites/bernardmarr/2017/03/16/supervised-v-unsupervised-machine-learning-whats-the-difference/#5d786d6a485d.
Published March 16, 2017. Accessed February 18, 2018.
[6] Libbrecht MW, Noble WS. Machine learning applications
in genetics and genomics. Nature Reviews Genetics. 2015;16(6):321-332. doi:10.1038/nrg3920.
No comments:
Post a Comment