Doctoral candidate devises genetic prediction algorithm

OSC's Glenn ClusterCourtesy: Ohio Supercomputer Center

Shepard leveraged OSC’s Glenn Cluster supercomputer (above) to devise an algorithm to predict introns and exons in mid-length sequences of nucleotides  found in DNA strands (below).

Double HelixCourtesy: National Human Genome Research Institute

Samual S. Shepard with Advisory CommitteeCourtesy: University of Toledo

Shepard (above, center) pleased his advisory committee with his dissertation on genomic predition techniques.

Study looks at mid-range inhomogeneous nucleotide sequences

Columbus, Ohio (June 29, 2010) – A University of Toledo doctoral candidate in biomedical sciences recently combined the inspiration he received from his grandfather, values learned from his mother, insights gleaned from his mentors and processing power tapped from a supercomputer to unlock a few of the many secrets of the human genome.

Samuel S. Shepard, a native of northwest Ohio, recently presented his doctoral dissertation, “The Characterization and Utilization of Middle-range Sequence Patterns within the Human Genome.” Leveraging high performance resources of the Ohio Supercomputer Center, Shepard was able to compute within days complex, optimized algorithms that he estimated would have taken him more than three and half years to run on a typical desktop computer.

Shepard’s research introduces a novel algorithm for the prediction of certain genomic sequences, known as exons and introns, using mid-range sequence patterns of 20 to 50 nucleotides in length. These genomic patterns are said to display a non-random clustering of bases referred to as “mid-range inhomogeneity,” or MRI.

“We based our approach on Markov chain models, which are the basis for many gene prediction programs,” Shepard explained. “During the project, our algorithm read 12 million nucleotides of exons and introns each, and three million each were used to test the predictions.”

Markov models are built using the analysis of short DNA “words.” However, recent research showed multiple types of non-random associations of nucleotides within genomic regions of 30 to 1000 nucleotides long that form specific sequence patterns. Shepard and his team hypothesized that the MRI patterns were different for exons and introns and would serve as a reliable predictor.

To circumvent the limitations of traditional Markov models, Shepard developed a technique known as binary-abstracted Markov modeling (BAMM). The procedure involves creating rules that reduce mountains of nucleotide information into a much smaller binary code, based upon word length and the nucleotide bases found within those words. For instance, if looking for a sequence rich in guanine, Shepard might break the sequence into three-letter words and assign the binary code of “1” to each word containing 2 or more guanines, and “0” to each word that doesn’t.

Shepard was able to test his abstraction rules for words of one or two nucleotides locally at the University of Toledo. As more bases are used to create each binary digit, however, the possible abstraction outcomes increased exponentially, requiring far more computational horsepower. To test rules for longer word lengths, Shepard turned to the Ohio Supercomputer Center (OSC) and its flagship system, the 9500-node IBM Cluster 1350. Shepard and fellow student, Andrew McSweeny, accessed the “Glenn Cluster” to optimize the abstraction process by using “hill-climbing” techniques that determine a single, maximal value for each abstraction space, rather than each of its possible values.

“The trials required approximately 116 individual supercomputer jobs, each using 128 computer cores (32 physical nodes) and taking a little over two hours of wall time per round,” Shepard said. “Total optimization for the tetranucleotide abstraction rule took more than ten-and-a-half days for 324 million abstraction rules.”

“Researchers at Ohio universities are fortunate to have at their disposal the resources of the Center when their investigations require computational resources beyond those found on their campuses,” said Yuan Zhang, client and technology support engineer at OSC. “Beyond the big hardware, OSC also offers researchers the expertise to prepare their jobs to run efficiently on parallel systems.”

Shepard and his colleagues then combined different abstraction models to improve accuracy. Using support vector machine technology, they achieved a prediction accuracy of greater than 95 percent. Based upon his research, Shepard is preparing a scholarly paper for publication in a professional journal – the sixth he’s authored or co-authored while working under advisor Alexei Fedorov, associate professor of medicine and director of Bioinformatics Lab.

“In his three years, Sam has been involved in no less than a dozen projects in my lab, in different areas of mathematics, genomics and proteomics,” said Fedorov. “With his technical expertise, Sam has co-taught the Biomedical Databases summer course for three years, has given a number of lab sessions and lectures for the Perl programming course, and has assisted in the maintenance of the department’s cluster where student data is housed.”

Shepard had to miss graduation exercises and asked university officials place his Doctor of Philosophy in Biomedical Sciences diploma and the program’s Outstanding Student award in the mail. He had received an Austrian Marshall Plan Foundation scholarship and departed to study in Europe several days before the commencement ceremonies.

A quick primer on genomics:

Within each cell's nucleus, deoxyribonucleic acid, or DNA, carries the information needed to create and sustain most living organisms. Most DNA is made up of a pair of twisted strands composed of paired subunits, called nucleotides, that comprise the nucleotide bases or "letters" of the genetic alphabet: adenine (A), thymine (T), guanine (G) or cytosine (C). Just as the order of letters determines the meaning of a word, the order of the bases determines the meaning of the information encoded in that part of the DNA. Some sequences or "words", called exons, are translated into proteins that express the genetic instructions, while other sequences, known as introns serve as intervening markers between exons and contain various genomic signals.