Consider, for example, the starting and ending time point of a fecal transplantation series , where it’s obvious that the clusters are statistically significantly different, but it is not obvious what direction this difference is in or what it means. However, when we perform a meta-analysis and put these samples in the context of the Human Microbiome Project data , one of the most important abstract maps in human microbiome science, we see immediately that the difference between start and endpoint is much greater than the difference between healthy and diseased samples, and when we add the intermediate time points we see that the transition occurs very rapidly. These types of examples prompt similar data collection and visualization techniques in metabolomics, in order to understand how we can identify a desirable metabolomic state , and guide an undesirable state into a desirable one by optimizing the trajectory towards the desired state in a series of perturbations. Only the existence of a map can allow rational hypotheses about what to try, especially in the context of n=1 studies or in cases where response heterogeneity among individuals is extreme.We have seen, quite literally, the value of maps. But how do we build them? The key to acquiring high-resolution data, whether spatially or temporally resolved, or dense enough in an abstract space, is to make sampling fast, cheap,plastic pots for planting and sufficiently precise. Unfortunately, the trade offs among these approaches are typically not well understood.
In DNA sequencing, a common question is whether, given a fixed sequencing budget, it is better to have more sequences per sample, or more samples. In general the answer to this question depends on the hypothesis to be tested. But, as noted above, all too frequently the “hypothesis” is retrofitted to an arbitrarily collected dataset. In our experience, for amplicon sequencing, the value of having more samples has always outweighed the value of having more sequences per sample, down to surprisingly low thresholds. For example, Figure 1.1.3 shows the Earth Microbiome Project dataset sampled at 500,000 sequences per sample, 1000 sequences per sample, and just 200 sequences per sample. The overall pattern, e.g. the host/non-host split and the saline/non-saline split, are much clearer with more samples than with more precision about the location of each sample in PCoA space. Multinomial sampling considerations make it immediately clear why this is true: with 100 sequences per sample, the standard error in inferring the proportion of a taxon at 5% frequency is ~sqrt or 2.18, or about 50% error in proportion; the standard error at a taxon at 1% frequency is about ±1, or about 100% error. Consequently, even low-abundance taxa are sampled with enough accuracy to place a sample in the context of an overall map with surprisingly few sequences. Logically, this must be true, or all ordination diagrams in microbial ecology before the advent of next-generation sequencing would have been useless, yet many revealed biologically interesting principles. The goal for better amplicon maps should therefore be to process vast numbers of additional samples inexpensively, exploiting the power of modern sequencers.
Shotgun metagenomics, however, poses a different challenge, because typically only a small fraction of the sequences can be confidently associated with known taxonomy or function. Further, the goals are often different because of the value of genome assembly in identifying biosynthetic pathways, allowing taxonomic resolution at the species or strain level, and generating high-resolution single nucleotide polymorphism profiles to characterize novel strains and to confirm functional variants . As a result, although the same sampling principles as for amplicon data apply if the goal is to provide a high-level taxonomic profile, far more sequences must be collected to have the same level of confidence in the result. Consequently, the most important areas for tool development in shotgun metagenomics are either several additional orders of magnitude drop in sequencing cost, reference databases that are more comprehensive and unbiased, and algorithms that are more efficient and accurate in read alignment, genome assembly and separation. In particular, methods that can identify genetic variation from lower-coverage data, and methods for estimating features of interest from less data or with efficient target capture, are of significant necessity. Another important consideration in shotgun metagenomics requires host DNA depletion, both experimentally and computationally, because total DNA extracts from biological specimens can be dominated by host DNA that is not picked up by standard PCR primers for bacterial/archaeal amplicon sequencing . The challenges in metabolomics are somewhat different . Sequencing has reduced in cost by nine orders of magnitude per data volume. In comparison, mass spectrometry, during the same time, has only reduced in cost of data volume collection by two orders of magnitude .
However the main limitation is the enormous diversity in chemistry. Unlike just four bases one has to “identify” to enable sequencing, there are hundreds to thousands of molecules that need to be identified from a list of millions, if the molecule is known to exist at all. The chemical diversity also impacts the choice of extraction solvents during sample preparation, type of separation methods, type of instruments used and data analysis approaches. Further, because the multiplexing strategies that are successful in both amplicon- and shotgun-based sequencing approaches are not available in mass spectrometry, instrument time is directly proportional to the number of samples. Consequently, although it is easy to slip a few more samples into a mass spec run, instrument time is limiting for large-scale projects. As was the case with sequencing a decade ago, the vast majority of molecular features that are found in a sample are currently unidentified, and many are likely technical artifacts of various steps in the process, e.g. adducts formed in the gas phase, solvent artifacts and multimers of the same compound . Better methods and incentives for aggregating community knowledge and for automatically assigning unknown mass peaks and fragmentation spectra to molecules and have an estimation of error rates , as opposed to heuristics subject to personal interpretation rules , are urgently needed. Global Natural Products Social molecular networking offers alternative solutions for computational mass spectrometry infrastructure. Spectral datasets can be publicly deposited with a unique identifier and transformed to “living data” as they will be continuously searched against reference libraries to update users on new identifications. Furthermore, annotations can also be made by the scientific community within GNPS and propagated to all other data sets in the public domain with notifying subscribers on new annotations. This living data concept is crucial way to ensure that collected metabolomics data can still be useful over time. Other examples include automated species metabolome references and the Molecular Explorer for cross-searching annotated MS/MS spectra between datasets. Connections between several datasets, within the same knowledge base or between different spectral repositories such as Metabolights and Metabolomics Workbench , can be made to highlight annotated compounds found in several data sets Such analysis is a trivial task in sequencing but still novel in mass spectrometry. Integration of taxonomic, genomic and metabolomic data remains an important unsolved challenge. Although genome mining is successful for identifying the sources of individual natural products, matching up the overall taxonomic or functional profile to a molecular profile remains challenging because of procedural and analytical differences in data acquisition. In particular, the likelihood of time lags in chemical production or in genomic response to environmental changes, which may appear on different timescales, make integrated analysis of snapshot data extremely challenging . In cases where microbial and molecular composition is driven by a dominant effect , the molecular and metagenomic datasets will appear concordant by Procrustes analysis ,drainage for plants in pots which measures the fit of one ordination space to another. It is likely that an integrated systems biology approach that maps all data layers onto common pathways will be needed. This task is complicated at present not only because most genes, pathways, and molecules are unknown but also because, even for the known components of the system, we still lack coherent ontological conventions across databases which may aid in connecting these data layers. Integrating this extended universe of possible molecules and their transformations across space, time, and species in complex ecologies will require fundamentally new approaches, and orders of magnitude more computing power, than are available today.Another branch of non-hypothesis-driven research, but critically important to framing precise hypotheses, is the development of standards.
In microbiome science these broadly take three tracks: procedural standards for sample collection and handling, analytical standards for determining the accuracy and fidelity of readouts, and annotation standards for integrating results across studies. The lack of agreed-on standards stems from the origin of much of microbiome science in the discipline of ecology, where the fundamental questions revolved around finding new kinds of organisms to fill out the phylogenetic tree of life, and around finding statistically significant differences in microbial diversity or composition among sets of samples within the context of an individual study. Because the goal was to test whether any difference existed in the microbiome or metabolome as a function of disease, physiological, or environmental state, biases were not terribly important as long as a difference could be discovered. However, this situation diverges radically from the present situation, where physicians and engineers expect to be able to measure the correct, absolute abundance of all microbes or molecules in a given sample simultaneously. The realities of nucleic acid or organic extraction, detection methods for sequences and molecules, and downstream data processing simply do not support this important goal. However, in general, we don’t even know how far we are from it, or what the specific blind spots are. Consequently, without consistent and well-defined measurements underpinned by a mechanistic causal model of change, the state of microbiome-based predictions is much more like astrology than like astronomy. In order to move from pre-science to science in predicting microbiome changes, we need known reference standards that can be spiked into samples at different stages, from original specimen to DNA or molecule, that are agreed on, widely used in the field, and have an inexhaustible supply. Previous efforts, such as the HMP standards, have been limited by insufficient availability of materials, taxonomic complexity, or both. KatharoSeq in particular benefits from having different spike-in standards at the level of the primary sample and at the level of DNA, allowing different sources of contamination to be tracked down. Comparable development in mass spectrometry, perhaps with isotope-labeled molecules or molecules otherwise unlikely to occur in biological specimens and that can be introduced at different steps, would be of tremendous value. Sample collection and storage can introduce biases of varying degrees in specimen readout , but for most sample types the precise implications of different forms of degradation are unknown. Consequently, the conservative recommendation is always to expensively collect pristine samples , even while more practical methods would often suffice. For a few sample types, such as amplicon processing of stool, considerable data is now available on a range of conditions , and researchers can make more informed decisions about which methods to use. However, we know much less about the implications of sample degradation for most other types of biospecimens, and for the implications for reading out different molecular fractions with mass spectrometry . Understanding these principles would greatly expand accessibility of these techniques to field, clinical, and self-collected specimens , as the American Gut Project is already doing for amplicon collection from stool. Finally, integrating samples from different studies remains extremely challenging because of differences in annotation . For example, different studies may refer to “stool”, “feces”, “gut”, or other synonyms or rely on different units of measurement . Efforts such as the Genomic Standards Consortium MIxS family of standards , the Earth Microbiome Project Ontology , and other annotation schemes assist considerably in these tasks, but have been applied to relatively few datasets to date. The potential for natural language processing and/or data-based methods for automatically applying annotations, perhaps semi-supervised by human guidance, is considerable. These types of strategies were successful in Qiita for inferring EMPO annotations for tens of thousands of samples in Qiita primarily based off the researcher reported “sample_type.” Resources like Qiita, which allow researchers to deposit microbiome studies, provide mechanisms to help researchers use standard compliant metadata. However, further development is necessary to enable researchers to “discover” the types of variables and controlled vocabularies that are in common across the resource.