The PRC matrices were produced using a custom lisp function

A third use of PMN is in genome annotation. Gupta et al. used RNA-seq data from blueberry to annotate a draft genome sequence for the plant. Gene models were BLASTed against metabolic genes from AraCyc and other species-specific pathway genome databases, and the results were used to improve the annotations. The annotations were then used to examine blueberry metabolism. Similarly, Najafabadi et al. took transcriptomes of Ferula gummosa Boiss., a relative of carrot that is the source of the aromatic resin galbanum, and used BLASTx against enzyme-coding genes from PMN as a source for annotation of enzyme-coding genes in Ferula. PMN provides an important resource for organizing and making accessible plant metabolism information. The study of plant metabolism enables improvement of the productivity, nutrition, and resilience of crop plants, and furthers understanding of how wild plants function in their ecosystems. PMN data and tools have been used by researchers to answer a broad range of biological questions from development to physiology to evolution. The latest release of PMN, PMN 15, has the breadth and depth of metabolic information that should enable even a wider spectrum of questions to be pursued in plant biology. New plant databases introduced in each version of PMN are Tier 3 BioCyc databases ,black plastic planting pots which indicate that the information is based mostly on automated prediction using their genome.

Any experimentally-supported enzymes and pathways in Metacyc or Plantcyc that are annotated as belonging to the organism are also imported into the database along with their citations and codes for the type of evidence the cited papers present. The plant’s remaining complement of enzymes is predicted, and its metabolites and pathways are in turn predicted based on the enzymes. Bringing a new species or subspecies into PMN begins with the sequenced and annotated genome with predicted protein sequences. To be considered for inclusion, a genome must pass a quality metric in the form of BUSCO , which assesses genome completeness using a database of proteins expected to be present in all eukaryotes, with matches assessed using HMMER . A score of at least 75% “complete” is required for inclusion in PMN. If a genome passes this metric, it can then be run through the PGDB creation pipeline. First, splice variants are removed, leaving one protein sequence per gene, with the longest variant being retained. The sequences are classified as enzymes or non-enzymes, and enzymatic functions are predicted, using the Ensemble Enzyme Prediction Pipeline software . E2P2 uses BLAST and PRIAM to assign enzyme function based on sequence similarity to proteins with previously-known enzymatic functions based on functional annotations taken from several sources including MetaCyc , SwissProt , and BRENDA . The genomes included in PMN 15 were checked using BUSCO v 3.0.2 using the Eukaryota ODB9 dataset. Enzyme prediction for PMN 15 was done using E2P2 v4.0 and RPSD v4.2, which was generated using data from PlantCyc 12.5, MetaCyc 21.5, BRENDA , SwissProt , TAIR , Gene Ontology , and Expasy . Once enzymes are predicted, they must be assembled into pathways by the PathoLogic function of Pathway Tools .

The set of predicted pathways is then further refined using the Semi-Automated Validation Infrastructure software . SAVI is used to automatically apply broad curation decisions to the pathways predicted for each species. It can be used, for example, to specify particular pathways that are universal among plants and should therefore be included in all species’ databases even if not predicted by PathoLogic. SAVI can also be used to specify that a particular pathway is known to be present only within a specific plant clade. Therefore, if the pathway is predicted in a species outside of that clade, it should be considered a false prediction and removed. PMN 15 was generated using Pathway Tools 24.0 and SAVI 3.1. The final parts of the pipeline are grouped into three stages: refine-a, refineb, and refine-c. In refine-a, the database changes recommended by SAVI are applied to the database and pathways added or approved by SAVI have SAVI citations added. In refine-b, pathways and enzymes with experimental evidence of presence in a plant species are added to that PGDB if they were not predicted, and appropriate experimental evidence codes are added. In refine-c, authorship information is added to the PGDB, the cellular overview is generated, and various automated data consistency checks are run. The convention for PGDB versions was updated in PMN 15. Taking SorghumbicolorCyc 7.0.1 as an example, the first number, 7, is incremented when the PGDB is re-generated de novo from a new version of MetaCyc and/or a new genome assembly. The second, 0, is incremented when there are error corrections or other fixes to the content of the database. A third, 1 in the example, may be added when the database is converted to a new version of Pathway Tools without being regenerated, a process that does not alter the database contents. Since its initial 1.0 release, some changes in curation policy have been made to PMN and PlantCyc.

In 2013, the Arabidopsis-specific database, AraCyc, switched from identifying proteins by locus ID to using the gene model ID. This eliminates ambiguity when multiple splice variants exist for a single locus. In PMN 10, the policy for all species was switched from using the first splice variant to the longest. This was done because a longer splice variant is likely to have more domains, making it easier to determine its function. In PMN 10, the database narrowed its focus strictly to small-molecule metabolism, and pathways involved solely in macromolecule metabolism were removed. Macromolecules have never been the focus of PMN, and provision of information about them is a role better served by other databases with tools specifically suited to large heteropolymers like proteins and DNA/RNA. In version 13 of PMN, the PlantCyc database was limited to only include pathways and enzymes with experimental evidence to support them. The original purpose of including all information, experimental and computational, in PlantCyc was to allow cross-species comparison, a function now served by the virtual data integration and display functionality recently introduced in Pathway Tools . PlantCyc now serves as a repository of all experimentally-supported compounds, reactions, and pathways for plants. One hundred and twenty PMN pathways were randomly selected to manually assess pathway prediction accuracy. The 126 organism-specific PGDBs were then regenerated using E2P2 and PathoLogic alone, with PathoLogic set to ignore the expected phylogenetic range of the pathway and call pathway presence / absence based only on the presence of enzymes , no SAVI, and skipping the step of importing pathways with experimental evidence of a species into that species database if the pathway was not predicted. This resulted in a set of PGDBs based purely on computational prediction that we refer to as “naïve prediction PGDBs”. Biocurators evaluated the accuracy of each of the 120 pathway’s prediction across all 126 organisms in PMN in the naïve prediction PGDBs and, separately,black plastic pots for plants in the released version of PMN. Specifically, we evaluated whether pathway assignments to the PGDBs reflected the taxonomic range of the pathway as expected from the literature. Each pathway’s assignment to the naïve prediction PGDBs and released PGDBs was classified with respect to the expected taxonomic range as either “Expected” , “Broader” , “Narrower” , or it was identified to be a non-plant or non-algal pathway, and therefore classified as a non-PMN pathway. The improved accuracy in pathway prediction by incorporating phylogenetic information and manual curation was statistically quantified in R version 3.6.3 with Fisher’s exact text using the fisher.test function.In order to analyze the pathways, reactions, and compounds in each species’ database, presence-absence matrices were generated for each of these three data types. Each is a binary matrix containing the list of PMN organisms as its rows and a list of PRCs of one type as its columns. Each matrix element is equal to 1 if the organism contains the PRC and 0 if it does not .

Reactions were only marked as present in a species if the species had at least one enzyme annotated to the reaction, whether predicted or from experimental evidence. Since PRCs that are present in either only one organism or all organisms are not useful in differentiating plant groups, we excluded these PRCs from further analysis. Separately, a table was generated that maps the species to one of several pre-defined taxonomic groups . The groups were selected manually to best represent the diversity of species in PMN and included monophyletic and paraphyletic groups, as well as a polyphyletic “catch-all” group . The PRC matrices and the plant group table were used to investigate relationships among the species through the lens of metabolism. The PRC matrices were used to perform multiple correspondence analysis . This is a technique similar to principal component analysis but is frequently used with categorical data. It differs from PCA in that a complete disjunctive table is first produced from the input matrix. In a CDT, each multinomial variable i is split into Ji columns where Ji is the number of levels of variable i. In this analysis, the variables are the pathways, reactions, or compounds , and there are two levels for each, present and absent. Each CDT column ji therefore corresponds to one level of one variable and is initially set equal to 1 for species for whom that PRC is present and 0 otherwise. Each group of Ji columns therefore contains, in each row, one column equal to 1 and Ji–1 columns equal to 0. In this analysis, therefore, each pathway results in two columns in the CDT, set to 1 0 if the pathway is present and 0 1 if the pathway is absent. MCA then scales the values of each column in the CDT according to the rarity of that level of that variable, so that each CDT column sums to 1. The remainder of the procedure is the same as in PCA. Because of the scaling, a species will be further from the origin in the MCA scatterplot if it possesses uncommon PRCs or lacks common ones. The MCA was performed using the MCA function of the R package FactoMineR v2.3 . The MCA scatter plots were colored using the columns of the plant group table to elucidate relationships between the MCA clusters and plant groups. The scatter plots were generated using ggplot2 v3.3.4. To examine the pathways associated with each MCA axis, the percentage of variance explained by the presence or absence of each pathway, found in pwy.mca$var$contrib , was exported to a tab-delimited text file. To determine which metabolic domains, if any, were over represented in the set of pathways describing the variance of MCA dimensions 1 and 3, we ran an enrichment analysis of the set of pathways explaining the 95th percentile of the variance. Pathways were mapped to a metabolic domain using supplementary information from . Pathways leftt unmatched were manually assigned to a metabolic domain by expert curators and a new pathway-metabolic domain mapping file version 2.0 was created . Enrichment background was set as all pathways from PMN’s 126 organism-specific databases, all of which were assigned to metabolic domains. Enrichment was calculated using the phyper function from the R stats package and p-values were corrected for multiple hypothesis testing at a false discovery rate of 5%.We downloaded and integrated datasets from 5 existing Arabidopsis root single-cell RNA-seq studies. Briefly, raw fastq files for 21 datasets derived from studies by , , , , and were downloaded, trimmed, and mapped using the STARsolo tool v.2.7.1a. Whitelists for each dataset were obtained either from the 10X Cellranger software tool v. 2.0 for the 10X-Chromium samples, or after following the Drop-seq computational pipeline , extracting error corrected barcodes from the final output for the Drop-seq samples. Valid cells within the digital gene expression matrices computed by STARSolo were then determined as those having total unique molecular identifier counts greater than 10% of the 1st percentile cell, after filtering for cells with very high UMIs. Cells containing greater than 10% mammalian reads, greater than 10% organellar reads, or cells having transcripts from fewer than 200 genes were filtered out. Filtered digital gene expression matrices were then preprocessed using the Seurat package after removing protoplast-inducible genes , using the SCTransform method .