the inner workings: An Interdisciplinary Journey: A Brief Look at Coding and Machine-Learning in Genetics

An Interdisciplinary Journey

A Brief Look at Coding and Machine-Learning in Genetics

Melba Nuzen, Scripps Ranch High School

In today’s world of technology, the road to discovery is paved with complex, multifaceted problems. To begin looking for solutions, we must find equally complex methods to tackle these challenges.

Our road was paved in the 1950s when people first discovered DNA as the blueprint for our human systems—a design that remained unchanged throughout our lives. However, as research progressed, we discovered that our characteristics rely on more than just the nucleotides of DNA; there are certain chemical compounds and proteins that can modify the expression of DNA, collectively referred to as the epigenome.

The epigenome can increase the production of specific proteins or turn certain genes on or off when necessary [1]. All of this occurs without altering the actual DNA code itself; epigenetic proteins instead interact with DNA. Recently, a collection of institutions have begun exploring epigenetics in the field of cancer research.

The connection between cancer and epigenetics is fairly simple: the epigenome includes proteins called transcription factors that can inhibit gene expression by blocking DNA transcription. This can stop cells from multiplying by altering gene expression—if certain genes aren’t expressed, the cell cannot divide. Modulation of transcription factors is essential to the proliferation of cancer cells, formation of tumors, and tumor metastasis to other organs, which produces secondary tumors [2]. In a study done in mice, researchers sampled various epigenomes and found a group of enhancer genes called metastatic variant enhancer loci (Met-VELs) that are frequently located near bone cancer genes [3]. The activation of these enhancer genes was required for the formation of secondary tumors, while inhibiting transcription factors that coordinated with Met-VELs interrupted metastasis. Ultimately, this decreased the growth of cancerous tumors and prevented relapse in mice.

Of course, there are many more variables to test before such research can be extended to humans. But the fundamental takeaway from this example is clear: a new scientific discovery leads to a better understanding of genetics, which inspires solutions to challenges that have major impacts on humanity.

So how do these discoveries, understandings, and solutions come about? A variety of fields, such as artificial intelligence, mathematical statistics, and computer programming are combined in careers like biostatistics and bioinformatics to address some of these challenges.

Let’s take a closer look at the previously mentioned study of bone cancer in mice. The activation of Met-VELs by transcription factors was just one of thousands of interactions found when epigenetic proteins interacted with enhancer genes. So how do we begin discovering what each protein does when it binds to its respective gene? And before we tackle that question, how do we even map out DNA strands and their epigenetic counterparts?

To sort through the billions of base-pairs in the human genome—which translates to millions of bytes of data—scientists turn to computers, or more specifically, programming languages. For example, take R, a powerful language designed for data analysis. Counting the number of nucleotides in a string of DNA would look something like this:

library(stringr)

seq1 <- “TCTTGGATCA”

count1A <- str_count(seq1, c(“A”))

count1C <- str_count(seq1, c(“C”))

count1G <- str_count(seq1, c(“G”))

count1T <- str_count(seq1, c(“T”))

Six lines of code tell the computer to read through a string of characters, seq1, and count all of the As, Cs, Gs, and Ts. Using the library stringr, and the unction str_count, this code creates four variables that hold the number of times their respective letter appears in seq1.

To compare DNA before and after a mutation, the code would resemble this:

library(stringr)

seq1 <- “TCTTGGATCA”

count1A <- str_count(seq1, c(“A”))

seq2 <- “TCATGGATCA”

count2A <- str_count(seq2, c(“A”))

if ( count1A == count2A ) {

print(“true”)

}

This program compares two strands of DNA, seq1 and seq2. Using the process described above, the computer generates two variables that represent the amount of “A” characters found in seq1 and seq2. Then, the code compares those two variables, returning true if there is an equal number of “A” characters found in both sequences. This idea can be implemented for all four bases to compare much lengthier DNA strands and determine whether or not strands contain the same number of specific bases.

Of course, these are simple examples to illustrate how coding algorithms can be utilized. With a few lines of code, computers can analyze millions of strands of DNA in many types of coding languages. Now, the question to answer is how DNA interacts with proteins, and what overall effect that has on a biological system. For this complex problem, we venture off the beaten path to a more complex solution: artificial intelligence.

When AI is mentioned, images of self-driving cars and evil robots often come to mind. However, artificial intelligence can play a large role in the field of bioinformatics, particularly in genetics. Machine learning is one such application of artificial intelligence that specializes in the independent analysis of data by algorithms. This will be useful for looking at transcription factors and their roles in cell development [4].

Within the subfield of machine learning, there are two general methods for addressing problems: supervised and unsupervised learning. As the name suggests, supervised learning teaches the machine how to analyze data by inputting annotated data points to train the machine to recognize an expected output. In the case of epigenetics, this means training and testing a machine learning model to recognize enhancer genes by inputting a series of known enhancer genes and non-enhancer genes; this way, the model can make an educated guess as to whether or not a new piece of data is an enhancer gene or not [5]. If we give our model examples of DNA that contain transcription start sites (TSS) as well as DNA that does not contain TSSs, the algorithm will theoretically be able to recognize a pattern and then find TSSs itself.

Figure 1: Supervised Learning in recognizing transcription start sites in DNA [6]

In regards to epigenetics, the model would sift through megabytes of DNA to pick out notable genes of interest, allowing for more time to be allocated toward concentrating on analyzing how the DNA interacts with transcription factors [6].

On the other hand, unsupervised learning comes into play when it’s preferable to avoid giving a model pre-determined labels or groups. An application of this type of learning could be determining the functions and effects of specific transcription initiation complexes. Given enhancers and their respective proteins along with their impact on associated functions, a machine learning model can group proteins together based on similar effects. This occurs in one of two ways: generative or discriminative modelling. The former type of modelling groups data based on similar characteristics, whereas the latter draws a boundary between data points [6]. When dealing with unknown variables, such as the functions of proteins, discriminative modeling is used more often, since scientists have few predetermined groups to classify proteins into.

Figure 2: Unsupervised Learning in grouping data [6]

With these methods, the machine can then conclude that a certain group of enhancers and their epigenetic counterparts halt the proliferation of cancer cells, as seen in Met-VELs.

Though our journey, filled with pit-stops at various science disciplines, took us on a winding and tangled road, the combination of coding, machine-learning, and genetics has led us to a fascinating discovery full of potential. But this explanation covers only the basics of such a revelation; in reality, studying epigenetics and cancer cells is just one application of the interdisciplinary study. At the moment, combining fields of interest is the road leading us toward the future. Our journey will continue as we synthesize a variety of concepts to take on complex, ever-diversifying problems and explore new solutions that will impact humanity for years to come.

References

[1] Epigenomics Fact Sheet. National Human Genome Research Institute website. https://www.genome.gov/27532724/epigenomics-fact-sheet/. Accessed February 8, 2018.

[2] Davis CP. Understanding Cancer: Metastasis, Stages of Cancer, and More. OnHealth. https://www.onhealth.com/content/1/cancer_types_treatments. Accessed February 8, 2018.

[3] Researchers Inhibit Cancer Metastases via Novel Steps - Blocking Action of Gene Enhancers Halts Spread of Tumor Cells. Case Western Reserve University School of Medicine website. http://casemed.case.edu/cwrumed360/news-releases/release.cfm?news_id=1026&news_category=8. Accessed February 12, 2018.

[4] Marr B. What Is The Difference Between Artificial Intelligence And Machine Learning? Forbes. https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine-learning/#4dc79f282742. Published September 15, 2017. Accessed February 18, 2018.

[5] Marr B. Supervised V Unsupervised Machine Learning -- What's The Difference? Forbes. https://www.forbes.com/sites/bernardmarr/2017/03/16/supervised-v-unsupervised-machine-learning-whats-the-difference/#5d786d6a485d. Published March 16, 2017. Accessed February 18, 2018.

[6] Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nature Reviews Genetics. 2015;16(6):321-332. doi:10.1038/nrg3920.

the inner workings

Sunday, May 6, 2018

An Interdisciplinary Journey: A Brief Look at Coding and Machine-Learning in Genetics

No comments:

Post a Comment