We now apply this method to analyze the intrinsic geometry of gene expression data

The optimal Rmax were as follows: 1.6 , 1.9 , 1.8 , 1.7 , 1.9 , and 3.0 . We used these values to compute the second Betti curve and determine how well it could account for the second integrated Betti value. All reported P values for comparison with experimentally gener- ated Betti curves were obtained by creating, for each candidate geometry, 300 statistically equivalent models. Points for each model were selected randomly according to the density specific to that geometry . The number of points was matched to the number of points in the corresponding experimental data set. On the basis of this simulated point distribution, we computed 300 different Betti curves. These curves were then used to generate a distribution of integrated Betti values or compute the L1 distance of these curves from the mean Betti curve of this model. The reported P values reflect two- tailed percentiles for where experimental Betti curves fall within the model-generated distributions. We report P values as < 0.003 when none of the samples generated values further from the mean than the observed data point. The non-metric MDS algorithm embeds a set of points within a pre-specified space while attempting to preserve rank-ordered distances be- tween points. We modified the Euclidean based,procona florida container non-metric MDS algorithm in MATLAB version 2017a by replacing the Euclidean distance with hyperbolic distance in Eq. S2. The initial positions of points were uniformly sampled in the optimal 3D hyperbolic space determined in Fig. 1. The radial coordinates were fixed because their range was small and points were approximately positioned on the surface of a sphere.

The algorithm updated the angular coordinates to minimize the mis- match in the rank order of distances. The iterations ended when this error fell below a threshold of 0.001. Because the MDS algorithm can return the arbitrary rotation of the space, in Fig. 3, we used the Procrustes algorithm to align the positions of odors between the strawberry and tomato data sets, using the strawberry data set that had the most odors as an anchor. The Procrustes process was carried out through the Procrustes function in MATLAB, and the scale component and translation component were set to 1 and 0, respectively. One of the great challenges of modern biology is to understand how the genotype of an organism impacts its phenotype, such as disease risk. The difficulty of this problem stems from the complexity of this relationship where thousands of genes can affect a phenotype of interest through nonlinear interactions. In the past 15 years, genome-wide associations studies have demonstrated that a range of traits, including those that related to metabolic and mental health disorders, are potentially linked to thousands of genes, with each gene explaining only a small fraction of the expected heritability. At the same time, correlations between genes are widespread. These observations raise the possibility that genetic variation and their expression can be described by a low-dimensional geometry. Identifying this geometry would make it easier to find relevant gene combinations and how they impact a given trait. Traditional approaches to finding low-dimensional spaces, such as the principal component analysis , assume that the space is “flat” and evaluate distances between points according to Euclidean metric.

Recently hyperbolic spaces have attracted a lot of attention both for the analysis of biological data as well as in computer science. The reason for this interest is that hyperbolic metric approximates the exponential expansion of possible states of the system described by a hierarchical tree-like process. Hierarchical representations, such as phylogenetic trees and clustering clades have long been used to characterize differences between cells, proteins, the activity of metabolic networks within cells and human brain functional networks. This suggests that hyperbolic metric should be considered as one of the possibilities when searching for the low-dimensional geometry in biological data. At the same time, any hyperbolic geometry can locally be approximated using Euclidean geometry . Therefore, in this work we focus on comparing the signatures of Euclidean and hyperbolic geometry. For completeness, we also include results from the spherical geometry that has positive curvature and represents the last of three possible geometries with constant curvature. In this work we pursue two goals. The first goal is to develop a quantitative test for distinguishing the curvature of the underlying low-dimensional geometry. We show that this can be achieved by performing non-metric multi-dimensional scaling using both Euclidean and hyperbolic metric and comparing the results. Our second goal is to develop visualization tools for data that exhibit a low-dimensional hyperbolic geometry. Many of the current state-of-the-artvisualization tools, such as k-means clustering, local linear embedding , t-distributed Stochastic Neighbor Embedding , and Uniform Manifold Approximation and Projection, all use Euclidean metric. We propose a method for incorporating hyperbolic metric into the t-SNE method and show that this leads to improved visualization across a range of datasets.

To demonstrate the utility of both the diagnostic method and the hyperbolic t-SNE , we apply these methods to a range of gene expression data sets from mouse and human. These datasets uniformly show that gene expression data across different cell types exhibits a low-dimensional hyperbolic geometry. The curvature of this space, which is related to the branching ratio of the corresponding tree-like process, was systematically higher in differentiated cell types compared to embryonic cells, and took even larger values for brain cells. These results demonstrate that gene expression data can be effectively described using a small number of coordinates under hyperbolic metric. Visualizations using hyperbolic metric consistently showed more accurate representations, both in terms of local and large-scale structure, including more consistent estimates of developmental states in datasets where pseudo-time trajectories could be constructed. Multi-dimensional scaling has been widely used to embed a set of data points into a geometric space in a way that attempts to best preserve the distances between points in the original space. Metric MDS tries to make the embedding distances proportional to the input distances, while non-metric MDS preserves the only ordinal values, allowing a monotonic nonlinear transformation between the distances. Both metric and non-metric MDS in high dimensional Euclidean space have been well studied during the past few decades. However, the MDS in the hyperbolic space has not been fully developed yet. Several metric MDS algorithms have been proposed recently for embedding data into hyperbolic space, offering advantages over Euclidean visualizations in terms of distance preservation, space capacity, trajectory inference and unseen data prediction, etc. However, we find that metric MDS does not correctly distinguish between Euclidean and hyperbolic geometry of input data, but non-metric MDS does . The reason for this is that non-metric MDS matches the ranking order instead of exact values of the data distances. The resulting nonlinear distortions in embedding distances can be used as indicators for a geometry mismatch between data and embedding points. When using non-metric MDS, we illustrate that as soon as there is a mismatch between native and embedding geometry, a nonlinear distortion appears in the scatter plots of embedding distances versus input data distances . These scatter plots are known as Shepard diagrams. When Euclidean data is embedded into a hyperbolic space, the Shepard diagram has negative convexity . When hyperbolic data is embedded into Euclidean space,procona London container the Shepard diagram has a positive convexity . Thus, the convexity of the Shepard diagram can indicate the difference in geometric properties between the embedding and native spaces, and in particular could indicate the difference in curvature of geometry. When using the metric MDS, the Shepard diagram shows increased spread but does not yield a nonlinear relationship upon embedding Euclidean data to hyperbolic space . The reason is that Euclidean distances can be fully embedded into the faster-expanding hyperbolic space masking the distortion of distances, and this does not happen in the non-metric MDS. In what follows we apply non-metric MDS to synthetic and several real gene expression datasets to detect their hidden geometry, and we refer non-metric MDS as simply MDS for brevity. When cells are characterized according to the expression of thousands of genes, the number of genes represents the nominal dimension of the representation space. However, the real dimension of the gene expression space might be much lower. Furthermore, the true geometry of the hidden space is not necessarily Euclidean. Therefore, in this section we analyze the signatures of low dimensional geometry of constant curvature in the situation where each data point is described with respect to large number of variables. In the synthetic examples below, the points are first sampled from a low dimensional geometry and then embedded into a high dimensional Euclidean space. This step is included to mimic analysis of experimental data, where each data point is evaluated according to a large number of measurements. After this, the data points are embedded into spaces of different curvatures to determine indicators through which the properties of the original low-dimensional space can become apparent.

In the examples below, we focus primarily on hyperbolic and Euclidean geometries, because hyperbolic geometry describes hierarchically organized data, whereas Euclidean metric is often the only feasible geometric metric for computing distances of high dimensional vectors. Comparison with the results for spherical spaces is provided in Figure S1. First, we analyze the case where data has a 5D Euclidean underlying geometry. To simulate this case we randomly sample 100 points from a 5D Euclidean space, and use Euclidean MDS to embed the points to 5D, 10D, 50D and 100D space respectively . This step emulates the representation of real data where each data point is described by a large number of measurements according to which each cell is characterized, and the distances between points are measured according to a Euclidean metric. The embeddings with different number of dimensions correspond to cases where measurements are taken with respect to different number of genes. As expected, the distances of synthetic 5D Euclidean points can be preserved without distortion when embedding data to Euclidean spaces of higher dimensions. This is evidenced by the linearity of Shepard diagrams in left column of Figure 7A. Next, we apply EMDS and hyperbolic MDS to the points in the Euclidean representation space, as we did in Figure 6A-B. As one can see in Figure 7A, Euclidean embeddings of these data do not generate distortions in the Shepard diagrams but hyperbolic embeddings yield Shepard diagrams with negative convexity that is largely independent of embedding dimensions. This indicates that the data has an underlying Euclidean geometry. We first analyze a discrete gene expression data from Lukk et al.. In the paper, they integrated microarray data from 5372 human samples representing 369 different cell and tissue types, disease states and cell lines, which has a complex global structure. They constructed a global gene expression map by performing principal component analysis and found that the first two principal axes described variation in biological variables corresponding to hematopoietic and malignancy properties. However, the presence and properties of the underlying low-dimensional geometry and how the samples are organized in the space remain to be investigated. Several previous studies showed that gene expressixton was stochastic both at the single cell level and the population level, and the expression profiles of samples within the same cluster were dominated by intrinsic noise. This would imply either Euclidean geometry, at least locally, or a lack of geometric structure altogether. On a global scale, biological systems usually show a hierarchical structure which would imply hyperbolic geometry. Therefore, we separately probe the geometry of gene expression data at the local and global scales. To probe local geometry we apply k-means method to cluster the whole data and select 100 samples from a single cluster randomly. Similarly to Figure 7, we use increasing subsets of genes to represent samples and then perform EMDS and HMDS embeddings for geometry detection . Increasing the number of probes with respect to which samples were characterized corresponds to increasing the dimensionality of the initial Euclidean embedding as in Figure 7. We find that this does not significantly change the convexiy of the Shepard diagram in both EMDS and HMDS . These results match the fitting in Figure 7A, and indicate that the samples taken from the same cluster have Euclidean structure, even when all the probes are used . Additional analyses show that the Euclidean structure is indeed caused by the stochastic Gaussian expressions of genes among the samples within a cluster .