Though the human genome was fully sequenced in 2001, the most promising work in genomics has just begun and not even in the study of human DNA. Human cells are outnumbered by bacterial cells by a factor of ten to one, and, as the rest of this site alludes to ad nauseam, there is strong reason to believe that bacteria are to blame for many of the chronic diseases from which humans suffer. Genetically speaking, we know relatively little about bacteria that persist in humans. The field is ripe for advances.

Colorful representations of sequenced genomes adorn the walls at JCVI.

You may wonder how a researcher can view and understand a particular bacterial genome. On their own, they cannot. Progress in genetics is a group effort, and requires partnering with one of the handful of heavyweight institutions in the world that have developed resources allowing for genome interpretation. Several such institutions exist in the US. The NIH has bacterial protein sequencing tools at its disposal.The Broad Institute at MIT as well as theWashington University Genome Sequencing Center have also developed tools that allow for genome sequencing.

Many would argue though that the Institution most on the bleeding edge when it comes to genome sequencing technology is the J. Craig Venter Institute, formerly known as TIGR. Headed by transformative iconoclast and entrepreneur J. Craig Venter, the Institute is a non-profit research center that was founded in 2006. It has facilities in Rockville, Maryland and La Jolla, California and employs over 400 people, including Nobel laureate Hamilton Smith.

You can imagine how happy I was to get an email from my former advisor at Georgetown (where I have an undergraduate degree), asking if I wanted to attend a training session in bacterial sequencing technology at the Rockville branch of the J. Craig Venter Institute (JCVI). He was keenly aware of my thirst to gain hands on experience with sequencing technology.

The training was called the “Prokaryotic Annotation and Analysis Workshop.” (As some may know, “prokaryote” is just another name for single-celled bacteria.) This experience marked my first exposure to sequencing technology, and I had little idea what to expect. Would I be able to follow the procedures used to identify protein sequences? Three days isn’t much time, but I was cautiously optimistic.

First Impressions

Last Monday, I boarded a train to Washington DC, took a quick cab over to Georgetown to say hello to some of my old professors, and proceeded to take the Metro up to Rockville. After a solid night’s sleep at the “Sleep Inn,” I took the hotel’s shuttle to the door of the Venter Institute.

The entrance of JCVI has an aura of science and progress. The walls of the lobby and hallways are covered with neatly framed images of sequenced genomes. The individual proteins in such pictures are illuminated in different colors, invoking modern art. The head of educational outreach programs gave us a tour of the grounds, which concluded at the space right in front of Venter’s office (he was traveling at the time and unfortunately not in his office!). Employees of JCVI refer to the space as “the museum” as several objects deeply rooted in scientific history are on display. A glass case on the left side of the room stores letters exchanged between Watson and Crick. A very early model of a sequencing machine used by Rosyln Franklin is on the right. A large statue of a tiger seemingly prowls in the middle of the room – one of several tiger statues that used to be at the building’s entrance when the Institute bore its previous name. If the tiger is the unofficial mascot of JCVI, it’s certainly an appropriate one. This is not the place for the ambivalent.

Do the rules say, “No riding the tiger”? I hope not.

It’s clear that the staff at JCVI take great pride in their accomplishments and with good reason! Copies of Science and other prestigious medical journals containing studies published by JVCI or reports of efforts led by Venter are displayed on tables in several locations. The walls of the hallway leading to Venter’s office are covered with framed newspaper and magazine articles featuring Venter – articles in Wired, People Magazine, and the New York Times. Venter has been named one of the top 100 most influential people in the world by Time Magazine for the last two years.

Before the training began, I had the opportunity to chat with some of the twelve other people in my class. I had already met Dr. Anne Rosenwald, a professor at Georgetown whose research focuses on understanding the genetics of various yeast forms. She also teaches biochemistry. Dr. Rosenwald was attending the session in the hopes of working out a deal with JCVI in which the Center could provide her with genomes that have been analyzed by computers but are still in need of human annotation. Annotation refers to the process of using clues in a DNA sequence in order to name and identify protein coding regions. If such an exchange of information is possible, it would allow her undergraduate students to map a bacterial genome as their thesis project. I hope the partnership works out because I think that while challenging, using JCVI’s annotation technology would provide any undergraduate with excellent preparation for microbiology and molecular biology gradate programs. I certainly wish I could have learned how to sequence a genome as an undergraduate!

Let the learning begin

Two members of our group had travelled to JCVI all the way from South Africa. Researchers at the University of the Free State, they were already using several of JCVI’s programs to sequence and thus better understand the genomes of bacteria isolated from several African caves – bacteria that have never before been classified. I spoke with them about the challenges of mapping completely new genomes. Soon enough, I aspire to study new genomes myself, especially those pertaining to the great mass of unclassified species of bacteria in the human body. I figured their feedback could clue me in to the challenges particular to dealing with unknown organisms.

The South African duo were pleased with what they have been able to learn thus far about their cave-dwelling species. When it comes to JCVI’s sequencing technology, they were old pros and suggested improvements to the software throughout the class. Why had they come to JCVI? I sensed an eagerness on their part to see the hub of progress in person and personally get to know some of the people working with and developing the technology they are using. At the end of the session they kindly invited me to visit South Africa and spend time in their lab. Who knows, I might take them up on the offer at some point as South Africa is one of my top travel destinations.

Site of the learning.

Our classroom provided a comfortable atmosphere in which to learn with shiny new laptops for each of us. Access to the laptop allowed us all to get a chance to navigate our way through a program as the instructor described its features in a lecture. Snacks, coffee, hot chocolate and tea were available at all times and during the class we would break every hour or so to refill our cups and chat.

The first day was spent learning about the process by which JVCI’s technology allows unknown proteins to be named and characterized (annotated). Our teacher, Ramana Madupu, is a full-time employee at JCVI who uses the technology discussed in the lecture in the course of her job.

Let’s discover a new species of bacteria!

Let’s say you are picking your nose. First of all, shame on you. But, let’s say that in spite of your flagrant disregard for common decency, you nobly want to contribute to human progress by determining what kind of bacteria are in your booger. After conducting several basic experiments on the bacterial DNA in your lab, you decide that a bacterial species may be new and unique, so you decide to contact JCVI.

JCVI has you send them a sample of the bacteria in question. A non-profit institution, JCVI will run your genome through its sequencing machine at no charge to you. This service is largely automated and it’s becoming cheaper and cheaper. JCVI’s mandate is to sequence as many genomes as possible and freely share that data with researchers. However, JCVI’s offer to freely sequence and interpret your genome comes with an expectation, namely that upon receiving your results, you will review and manually correct any of the sequence errors. At last check, JCVI’s computer annotation programs claims a 95% accuracy rate.

As we know from high school biology, DNA consists of four nucleotides: adenine, thymine, cytosine, and guanine. A gene is nothing more than a sequence of those As, Ts, Cs, and Gs, one which codes for a particular protein. Genes are the blueprints for making proteins, and in fact a new gene is often referred to, at JCVI at least, as a “putative protein.” (The word putative is used until one has sufficiently conclusive evidence to remove that label.)

Sequencing and beginning to interpret a genome

At the risk of gross oversimplification or misstatement, let me dare to explain the technical process of how a genome is sequenced and interpreted. You start with a genome. The DNA is processed with fluorescent dye. Each base pair (aka a nucleotide) emits a different color. Those colors are read by a machine and interpreted as a sequence of nucleotides. The result is an exceedingly long sequence of As, Ts, Cs, and Gs.

Now the fun really begins. The goal is to take this enormous sequence and begin to determine which base pair sequence codes for which protein and in which biological category. The Prokaryotic Annotation Pipeline aka “the pipeline” to the rescue!

The pipeline is an algorithm-based workflow which automatically predicts to a good, but certainly not perfect, degree of reliability for the name, location, and function of a gene. The pipeline does this by comparing your base pair sequences to sequences of previously identified proteins that exist in a variety of databases.

But how does the pipeline know which segments of your base pair sequence to check against known protein sequences in the hopes of finding a match? There are apparently a lot of fancy statistical algorithms at work here, but one way is to look at the codons, or pieces of genetic code which mark the start and end of a protein sequence. The base pair sequences ATG, GTG, and TTG almost always code for the start of a protein, while the sequences TAA, TAG and TGA almost always indicate the end of a protein. Of course there are always exceptions to the rule, which is why every sequence to come through the pipeline should be checked by a human being. That’s the ideal anyway. By identifying these start and stop codons, the pipeline has a pretty good idea of where one protein coding sequence ends and another begins. At this point, each potential protein coding sequence is referred to as an ORF or “open reading frame.”

The algorithm matching a gene to an existing sequence applies greater weight to matches derived by certain databases. For example, one database frequently used for comparison is Swiss-Prot. Swiss-Prot relies exclusively on manual annotation by humans. At present, humans make fewer annotation errors and are, therefore, more reliable than software. For this reason (and perhaps due to the fact that, at least according to stereotype, the Swiss are highly precise), Swiss-Prot is arguably the gold standard. If a sequence from your bacterial genome matches a Swiss-Prot sequence, the confidence level is high that the match is correct.

During the pipeline comparison process, the software will also run your protein sequences against what are known as “Hidden Markov Models” or HMMs. HMMs are essentially statistical models of the patterns of amino acids in a multiple alignment of proteins which share sequence and functional similarity. Proteins run against HMMs receive a score as to how well they match the model. If the score is high enough you can reasonably expect your protein to have the same function that the HMM represents. For example, if your protein has a high-scoring match to an HMM model for a protein involved in sugar transport, you can be pretty sure that the match protein from your genome has the same role.

After a particular protein sequence is run against HMM models, the software assigns it a putative name and role, based on how much information it believes it has to support such a label. The process of comparing sections of your base pair sequence to as many existing protein databases as possible is also referred to as BLASTING. Depending on the level of evidence at hand, the protein is also given a gene symbol, role information, and sometimes numbers that pertain to its classification.

For example, after one of your proteins is run through the pipeline the pipeline might come up with the following result:
TIGR00433

Name: biotin synthase

Gene symbol: bioB

TIGR role: 77 biotin synthesis

Now, what does this mean? Let’s break this down. The fact that the gene has what is called a TIGERfam ID (TIGR0043) refers to the fact that it had a high scoring match to a protein previously annotated at JCVI. Since JCVI obviously believes their genomes have been well annotated, a TIGERfam match that exceeds the minimum threshold for reliability is generally regarded as a sign that the computer has made the correct match. The name and protein role associated with the highest HMM and other database matches for your protein is also displayed, along with the symbol for the putative gene. In this example, it appears that it is the software’s best guess that your protein is involved in the synthesis of biotin (also known as vitamin B7).

JCVI repeats this process for every Open Reading Frame sequence it detects, and the number of sequences often ranges in the thousands.

How then does one access the proposed annotations generated by the pipeline? After each of your protein sequences has been run through the pipeline, JVCI software condenses them into a digital file that is sent to you. At this point you need to use a web-based program that allows you to manually modify the results. The program is called Manatee, Manual Annotation Tool Etc. Etc. An open source project, it was also created by JCVI software programmers. A bit intimidating for the uninitiated, Manatee is a powerful and exquisite program, which allows a person to assign each putative protein with the correct name and function.

Annotating a DNA sequence

Your goal in using Manatee is to make sure that the protein matches made by the pipeline are grounded by supporting evidence. For example, you can check for “gene model curation” which provides information necessary to ensure that your genes have the correct coordinates and that your set of predicated genes is complete. Other features allow you to look at the raw base pair sequence of your genome in order to identify rare start and stop codons that the computer may have missed, or screenshots that allow you to note if the software accidentally annotated overlapping genes.

The labs at JCVI are first-class.

As the human annotators using Manatee use their good old-fashioned brain power to identify where the JCVI computers may have made mistakes, they alter the names of certain proteins in accordance with such findings. When naming a protein, the goal is always to err on the side of conservatism.

Let’s say, for example, that based on a strong HMM hit, the computer has decided that one of your proteins is a ribose ABC transporter protein (ribose is a sugar). But after further examining the protein using Manatee’s tools you decide that there really isn’t enough evidence to support the conclusion that the sugar transported by your protein is ribose. You then manually change the protein’s name in Manatee so that it is less definitive by calling it only an “sugar ABC transporter”. Then, after using even more of Manatee’s features, you decide that you can’t put together sufficient evidence that the gene in question really transports any form of sugar. Under such circumstances, you make its name even less specific, calling it simply an “ABC transporter.”

As you can see, Manatee is a tool which enables researchers to better make judgments about the role and function of genes by assigning characteristics to those genes. Often enough, evidence to make these determinations is insufficient and attributes are characterized by how reliable the best evidence is. One expands and contracts the attributed qualities as the evidence warrants. When the evidence is equivocal, you say that a protein is “putative.” This apparently is the nature of genetic research, one which requires scientists to pick up on indeterminacy and do their best to fill in the gaps as they go.

Every DNA sequence which emerges from the pipeline is putative. Certain sequences remain so because they fall short of threshold reliability which would allow the software to give it an existing name. Under such circumstances, no name is assigned and no role is attributed. The protein is simply named “hypothetical protein.” If a hypothetical protein from one species matches a hypothetical protein from another, each are given the name “conserved hypothetical protein.” Since the hypothetical protein has been found in two different species, the corresponding sequence clearly exists. But the sequence’s series of base pairs are so different from known sequences that, at this point in time, neither the software nor a human annotator are able to give it a name or role. Ramana (my instructor) commented that as the Human Microbiome Project presses on ahead, she expects to see many more “hypothetical proteins” show up in genomes. In fact, she had just recently finished sequencing about eight Microbiome genomes and was surprised at how few of their DNA sequences matched known proteins. This suggests that the majority of the yet unknown bacteria that inhabit the human body are quite different than those species we have already become familiar with such as E. coli or Tuberculosis.

One might ask, “Isn’t the human annotation process open to error and bias?” The answer is yes. It’s up to the human annotators to decide if they can find enough information to support a software derived match and every human has different tendencies when it comes to such decisions. Annotators like Ramana say that after working with a sufficient number of genomes they usually learn to trust their gut feelings and standardize the process by which they make naming decisions. Even the best human annotators would admit that 100% consistency, from one day to the next, between one annotator and the next, is unattainable.

The Institute had a modern aesthetic.

So what happens when an annotator has finished going over all the proteins in a particular genome? Genomes in which all protein sequences have been given a name and function are considered “closed” and made available throughGenBank, ultimately. Any genome with loose ends is considered “open,” with the hope that future researchers will we able to confidently determine what the names and roles of current hypothetical proteins. One way to determine the role of a “hypothetical protein” is to study the protein coding sequence in the laboratory using in vitro techniques such as the creation of gene knockouts. Given this, it should be clear to my readers that sequencing technology does not obviate the need for laboratory research. There remains a lot of work to be done in this field.

About the Comprehensive Microbial Resource (CMR)

JCVI enters as many genomes as possible into a database called the Comprehensive Microbial Resource (CMR). CMR is a free, open-source website that allows access to the sequence and annotation of all completed prokaryotic genomes. CMR is a seemingly invaluable resource. Before genomes are entered into the database they are standardized in a manner that makes them much easier to be compared. Researchers from over 200 sequencing centers currently put sequenced genomes into a database called GenBank. GenBank contains about 600 complete prokaryotic genomes with about 10 new genomes released each month. One of the significant problems with GenBank is that the annotation process at each center that submits genomes to GenBank is done so differently that many of the genomes in GenBank have been named using different conventions. Often, they have also been assigned genes symbols and role names that differ depending on their where they were sequenced.

The goal of the CMR is to take the genomes from GenBank and create common datatypes with the same nomenclature sequence elements annotation methodology. When this has been done individual genomes can be compared much more easily and accurately. There are currently about 400 organisms in CMV but the project’s leaders have ambitiously committed themselves to adding several hundred more genomes to the database in the coming months. One reason that the CMV contains fewer genomes than GenBank is because the project is, thus far, unfunded. Apparently, JCVI has been working on the CMV without grant money for the last two years. The program is so well-designed and useful that it’s hard to believe it could go unfunded. I was told that JCVI has just applied for a new grant that might allow the project to be funded and project leaders should hear back about the decision in a week or two. Fingers crossed!

Tanja Davidson is one the main directors of the CMR, who was our teacher on day three. In fact, our entire third day was spent learning about the CMR, which at first glance, contains a daunting but well-organized number of features. CMR allows the researcher to compare multiple genomes using what are called “cross genome analysis pages.” These tools allow two or more genomes to be compared so that the elements they have in common (or the elements that make them different) can easily be analyzed.

Imagine that doctors report an outbreak of a stomach disease and a bacterial species is isolated from people with the illness. The genome of the disease-causing pathogen is put through Glimmer and the pipeline, annotated by humans, and found to be part of the E. coli family. By using CMR tools, researchers can compare the genome of the new E. coli variant to the genomes of other E. coli species that have not been tied to stomach disease. Most of the genes between the different forms of E. coli should be the same because they are of the same family. But those genes that differ between the recently isolated species and those already in the database can be assumed to be those coding for the proteins that endow the new variant with the ability to cause disease. In this case, the CMV comparison tools greatly narrowed down what would otherwise have been a veritable “needle in a haystack” situation.

Other nice features of the CMV include the ability to access a “Role Category Graph” which displays the different roles of all the proteins in a genome in a colorful pie chart. A tool called “Restriction Digest” allows users to splice genes of interest with various enzymes – a procedure that takes a long time to complete in the lab but only minutes to complete using the CMV. A “Pseudo 2-D Gel” allows users to get an idea of what a genome of interest looks like in another dimension. Each dot of a 2D gel represents a single protein whose location can be compared to others. The comparative tools even allow for the creation of a scatter plot in which two genomes are compared on a two-dimensional plane.MUMmer or (Maximum Unique Match) compares genomes at the nucleotide level, allowing scientists to detect just single nucleotide differences between DNA sequences.

Ready to learn about human annotation!

When it comes to the CMR, Tanja and other JCVI employees welcome feedback from scientists other than those at JCVI. In fact, while we were doing some practice CMV tutorials in class, the pair from South African and Dr. Rosenwald came across a few minor glitches in the system. Tanja was quick to write them down and most of them were already fixed by the time we got back from lunch. Rosenwald and others also offered feedback about new features they might like to see in the CMV and Tanja was again quick to record their suggestion and insights. I could tell she was definitely not just humoring people but actually planning to pass every suggestion by her development team.

The reality is that Manatee and the Annotation Engine project are part of the Institute’s open source initiative, the goal of which is to provide high quality software and services to the genomic community. External involvement and feedback is strongly encouraged because it’s such feedback that drives development and continual improvement of the software. In fact, JCVI doesn’t actually have employees who test their software, so they fully depend on user feedback. Some of us joked that because we were testing the CMV as part of our class exercises we should have been paid to attend the training session rather than vice versa.

Improving the accuracy of the pipeline

Human annotation is a lengthy and laborious process. One of the foremost goals at JCVI is to perfect the pipeline and the computer annotation process such that human annotation is no longer necessary. One Idea currently being tossed around at JCVI when it comes to perfecting the output of the pipeline is a concept referred to as something like “humanitization.” (I’ve searched my notes but can’t find the exact name!) The annotators at JCVI are currently being asked to report exactly how they go about using Manatee in order to annotate a genome. As previously discussed, since there are so many databases to compare and analyze in Manatee, each employee using the program has settled into a pattern of evaluating database information in a certain methodical fashion. The hope is that if some of the best human annotation regimens are recorded and analyzed, they can be translated into logic, which software could duplicate.

If these extra steps do indeed increase the accuracy of the protein matches made by the pipeline, there may no longer be a need for humans to check Manatee’s output. So it’s possible that in the coming years genome sequencing may be a completely automated process. At the current moment, the pipeline’s protein matches are accurate about 95% of the time The stated goal is to get that level of accuracy into the 99-100% percent range. So, as Tanja commented, the human annotators at JCVI who are currently helping programers understand how they navigate Manatee may, by doing so, actually be putting themselves out of a job.

But at least for now, human employees are still an integral part of the annotation system. Four recently hired JCVI employees were attending the teaching session. During a discussion about perfecting the pipeline, our instructor confided that one of our classmates had just been hired with the expectations that he would create the technology to make the pipeline more accurate. What a daunting job! The rest of us regarded him with a certain level of awe over the next two days. Every so often our practice sets would reveal a flaw in pipeline output and the instructor would turn to this particular employee and say something like, “Of course now, you’ll be fixing this problem.” Such comments reflect what seems to be the prevailing attitude at JCVI. Most of their projects are extremely ambitious and half the time I’m not sure if they even know if success is possible when a task is initiated. But the mindset is “No matter how hard this goal seems we will simply have to find a way to get it done!” This type of determined thinking does seem to generate results as there is little doubt that such an attitude was the driving force behind the Institute’s ability to sequence the human genome in record time.

Tanja, our instructor, and Phil, a JCVI employee who was hired to improve the pipeline.

As implied by the above paragraph there are a lot of situations at JCVI that end up pitting humans against computers. As Ramana described, it would be ideal if every genome sent to JCVI could be manually annotated from the onset. At least for now, a well-trained human is able to pick up on subtleties of database comparisons that the computer can miss. But such a scenario, at least over the long term, simply isn’t sustainable. Since genome mapping is growing in popularity over the coming years, humans alone cannot keep up with the number of genomes requiring mapping. Although using computers to annotate genomes slightly compromises accuracy, the technology must be used in order to keep up with demand. Ideally genomes are manually checked with Manatee but there are definitely JCVI/TIGER annotations that are never checked by a human annotator at all.

Sequencing bacteria and the future of medicine

In recent years, mapping genomes has grown in popularity. Scientists working on efforts related to the Human Microbiome Project currently want to map the genomes of every single bacterial species capable of inhabiting the human body, and such pathogens may number in the thousands. But large groups of other scientists are set on better understanding the massive number of bacteria that inhabit our oceans. Since little is known about many regions of the ocean, who knows how many microbes these efforts may turn up? Then, like the two scientists in our group, other research teams seek to map the genome of bacteria that live in obscure land locations such as caves, volcanoes, mines etc. So, the JCVI computers and those at other sequencing centers are relentlessly accumulating DNA data.

Perhaps because they have each personally annotated so many hypothetical proteins about which we currently know nothing, the staff at JCVI are very open to the idea that we are only on the brink, if that, of truly understanding the bacteria capable of making us ill. This correlates with the Marshall Pathogenesis in which essentially all inflammatory diseases are attributed to infection with chronic intraphagocytic metagenomic bacteria that, for the most part, have yet to be clearly named and sequenced. One study I often invoke was conducted by Dempsey and team. This Glasgow-based group found human tissue taken from prosthetic hip joints contained protein sequences corresponding to those of hydrothermal heat vent bacteria. Most of the time when I discuss the study, other scientists are skeptical of the results. The average response is that they would like to see the results repeated or that the sample was contaminated. Ramana had no such reaction. In her opinion, there can definitely be hydrothermal heat bacteria in the human body and she’s confident the sample was not tainted. When we discussed the findings she suggested that the bacteria are probably not killed at high temperatures which, interestingly, was one of Dr. Marshall’s first inferences when analyzing the data.

A JCVI sequencing machine

The organizers made an admirable effort to serve us savory lunches which we ate in one of JCVIs cafeterias. All our teachers attended lunch and sat among us, meaning that I was able to easily batter them with questions. Alex Richter, one of the program heads, was great about answering my questions in detail. Thanks to his anecdotes, I got a much better impression of what microbiology labs will be doing in the coming years and the tools I will likely need to master as a potential microbiology PhD student. Before attending the training I had wondered if I would be able to understand JCVI’s sequencing technology without a background in computer science. But Richter didn’t seem to think that my lack of computer training is an issue and it’s true that I certainly seemed able to follow the discussions in class. I was encouraged by Richter’s comment that someone good at scientific reasoning (such as, ahem, myself) is also likely to be good at working systematically with computer programs. I’m sure he’s right, but even so I won’t be contributing to the Linux codebase any time soon.

It felt pretty darn good to be in a place where I personally believe that government funding is going towards research that is really going to have an impact on our ability to better understand chronic disease. As the Marshall Pathogenesis continues to spread, it’s clear that bacteria will eventually receive all the scrutiny they are due. At that point, scientists, doctors and patients alike are going to demand a more thorough understanding of bacteria implicated in chronic disease and down to the level of the genome.

It’s great that JCVI is already starting to collect data on never before sequenced bacteria. It’s also good that the Institute is striving to perfect bacterial sequencing technology now, so that by the time the Marshall Pathogenesis gains hold, sequencing results should not only be more accurate but also easier to use. Just as our ability to sequence genes has improved exponentially, so, I believe, will our ability to interpret the data. The tools are just getting better and better. As someone who has the inside scoop about the fact that bacteria are headed for the big time, I feel we’re closer than ever to characterizing the genomes of the pathogens that are capable of making us so ill.