A linear projection layer is used to map the image patch arrays to patch embedding vectors

Ultimately, a broader benchmarking effort would be needed to understand when each approach should be used, similar to benchmarking efforts in the environmental DNA sequencing community. In addition to providing a framework for chemically informed sample comparisons within a dataset, Qemistree also provides a framework for comparing independently processed datasets. In the Qemistree workflow, we represent chemical features as their molecular fingerprints; this representation is largely independent of the technical variation such as chromatography shifts across mass spectrometry experiments. Therefore, the chemical content of samples from different experiments can be compared by using a fingerprint-based representation without the need to repeat feature detection and feature alignment. This workflow is similar to how large-scale sample comparisons are made possible in sequence based analyses, where datasets are processed upfront, and rapidly co-analyzed according to the users’ requirements. Extending these applications to mass spectrometry data would allow metabolomics investigations of the scale of the Earth Microbiome Project and the American Gut Project to find global biochemical patterns. However, there is a need to benchmark experimental protocol comparability, as well as establish community-adopted standards that facilitate the global reuse of data. While these problems are substantial, container raspberries we have seen examples of communities coming together to solve these issues for systematic and global data comparability.

In summary, we introduce a new tree-based approach for computing and representing chemical features detected in tandem mass spectrometry-based untargeted metabolomics studies. A hierarchy enables us to leverage existing tree-based tools, and can be augmented with structural and environmental annotations, greatly facilitating analysis and interpretation. We anticipate that Qemistree, as a data organization and comparison strategy, will be broadly applicable across fields that perform global chemical analysis, from medicine to environmental microbiology to food science, and well beyond the examples shown here.As the global population continues to expand, “it is estimated that food production will need to increase by 60% by 2050 to feed the estimated 10 billion people expected on Earth. An increase in production along with a reduction in food loss due to pests and pathogens and food waste will be needed to meet demand”. Crop loss resulting from plant diseases and pests poses a formidable challenge for crop growers worldwide. Plant diseases and pests lower the product quality or shelf-life of crops, decrease the nutritional value of vegetables and fruits, and reduce crop yield. Plant Diseases caused by fungal pathogens can cause crop losses of 10% to 20% each year. The Food and Agriculture Organization of the United Nations estimates that annually 20% to 40% of global crop production are lost to pests. Each year, plant diseases cost the global economy around $220 billion USD, and invasive insects around $70 billion USD. A challenge crop growers face is accurately identifying the disease responsible for their crop losses.

The identification process is particularly challenging as some plant diseases exhibit similar symptoms, particularly during the early stages of infection. Consequently, discerning the nuanced distinctions becomes a daunting task for the human eye. Often, crop growers can recognize the disease after it has significantly affected their crops or when the infection or infestation has persisted over a prolonged period of time, leading to observable alterations in leaf appearance or crop loss. It is crucial to emphasize the significance of proper disease identification, as employing the wrong treatment can be a waste of time, financial resources, and possibly cause further crop loss or damage.In order to facilitate the identification of plant diseases, Mohanty et al. proposed a novel approach in their scholarly work titled “Using Deep Learning for Image-Based Plant Disease Classification”. The researchers explored the utilization of deep learning convolutional neural network models to effectively discern various types of plant diseases. The data set in their study was obtained from the PlantVillage project, encompassing a vast collection of 54,306 color images depicting 14 distinct crop species afflicted with 20 different disease types or healthy conditions. The authors conducted an extensive investigation, comparing the effectiveness of using color images versus gray-scale and segmented images, exploring various training validation-testing splits, comparing the outcomes of training models from scratch versus utilizing pre-trained models, and evaluating the performance of GoogLeNet and AlexNet, two different deep learning convolutional neural network architectures.

Through the systematic exploration of these factors, they conducted a total of 60 experiments to ascertain the optimal combination of architectural configurations. While the study from authors Mohanty et al. primarily focused on deep convolutional neural networks, subsequent research has demonstrated that convolutional neural networks are not the sole approach to achieving excellent performance in image classification tasks. These claims come from the paper “An Image is worth 16X16 Words: Transformers for Image Recognition at Scale” by Alexey Dosovitskiy et al.. Their paper finds that employing a pure transformer applied directly to sequences of images, when pre-trained on substantial volumes of data and transferred to multiple mid-sized or small image recognition benchmarks such as ImageNet, CIFAR-100, or VTAB, can yield highly competitive outcomes [DBK21]. The Vision Transformer architecture has showcased remarkable performance compared to state-of-the-art convolutional networks, while also significantly reducing the computational resources required for training. Consequently, the motivation for this project is to loosely follow the framework outlined in the study by Mohanty et al.. However, instead of employing a deep convolutional network architecture,a Vision Transformer model pre-trained on the ImageNet-21k data set will be utilized to implement transfer learning and to train a disease classification model with the project PlantVillage data set.The data for this project is the project PlantVillage data which was found through the paper, “Using Deep Learning for Image-Based Plant Disease Detection”. The data consists of 54,306 color images of healthy and diseased crop leaves. In a machine learning sense, our data set of 54,306 images is considered small. Each image is the size of 256 × 256 pixels, has the three color channels RGB, and is categorized under a CropDisease Classification Label. Each label has a crop species name and a plant disease name or healthy. There are 14 different crop species and 20 different crop diseases, which create 38 different crop-disease Classification Labels in this data set. See Table 2.1 for a detailed list of all Classification Labels. Table 2.1 shows the total number of images each Classification Label contains and how that amount is translated to be the overall percentage contribution to the data set. Classification labels with over 5,000 images had the largest percent contributions to the data set. These labels are Orange-Haunglongbing with 10.1%, Tomato-Yellow Leaf Curl with 9.9%, and Soybean-Healthy with 9.4%. The classification labels with less than 500 images and having the least amount of percent contributions to the data set are Peach-Healthy, Raspberry-Healthy, and Tomato-Mosaic Virus, all with 0.7%, Apple-Cedar Apple Rust with 0.5%, and Potato-Healthy with 0.3%.The top three diseases that compose this data set are Bacterial Spot, Haunglongbing, and Yellow Leaf Curl. Bacterial Spot, draining pots which composes 10% of the data, is a bacterial disease that affects many crops by causing their leaves to develop yellow spots that turn brown in the middle. It also causes crops to develop black or brown spots of rot on their fruits. Haunglongbing composes 10% of the data set. This bacterial disease affects citrus trees causing their fruits to stay green and fall to the ground early before becoming ripe.

This disease is common for citrus, but keep in mind that this data set only has images of this disease affecting Oranges. Yellow Leaf Curl composes 9.9% of the data set and is a Viral infection that only affects Tomatoes. “Yellow leaf curl virus is undoubtedly one of the most damaging pathogens of tomatoes, and it limits the production of tomatoes in many tropical and subtropical areas of the world. It is also a problem in many countries that have a Mediterranean climate, such as California. Thus, the spread of the virus throughout California must be considered a serious potential threat to the tomato industry”. Note that the total percentage of diseased images contributing to this data set is 72.2% because the other 27.8% of this data set is healthy crop images. See Table 2.4 for a comparison of the amount of diseased and healthy crop images in this data set. The crops that contributed to the diseased images are Apple, Bell Pepper, Cherry, Grape, Maize, Orange, Peach, Potato, Strawberry, Squash, and Tomato. The crops that contributed to the healthy images are Apple, Bell Pepper, Blueberry, Cherry, Grape, Maize, Peach, Potato, Raspberry, Strawberry, and Tomato.For a visual representation of what the diseased and healthy crop images look like, see Figure 2.1 for nine different crop images. The crop images have their classification labels above them to identify the crop name and disease or healthy. The images in Figure 2.1 are crop images of classification labels Apple-Apple Scab, Apple-Healthy, Peach-Healthy, GrapeHealthy, Raspberry-Healthy, Soybean-Healthy, Grape-Black Rot, and Peach-Bacterial Spot.While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. This was the motivation for Dosovitskiy et al. to look into the implementations of the transformer model for image classification tasks, and the vision transformer was created. The transformer architecture for natural language processing tasks works similarly to a vision transformer. In natural language processing, sentences are broken down into words. Then each word is treated as a sub-token of the original sentence. Similarly, the vision transformer breaks down an image into smaller patches, each patch representing a small sub-section of the original image. To visually see how sentences are broken down into word tokens and images broken down into patches, see Figure 2.2 from. Keep in mind that the position of the image patch is very important. If the image patches are out of order, then the original image will also be out of order.“After turning the image into a sequence of 256 image patches. The linear projection layer attempts to transform arrays into vectors while maintaining their physical dimensions. Meaning that similar image patches should be mapped to similar patch embeddings” . The vision transformer is designed to start with an extra learnable class embedding that is equivilent to 0. Which represents the start of the image and sequence of patches to come. The extra learnable class embedding allows the model to learn embeddings specific to each classification label. “The pre-training function of vision transformer is based solely on the classification label given; therefore, the learnable class embedding is even more important to successfully pre-training the vision transformed model”. Without the learnable class embedding, the transformer will not understand the classification labels that are attached to each image. To keep the order of the sequence of patches that make up the image, the patches are instilled with positional embeddings. “For the vision transformer, these positional embeddings are learned vectors with the same dimensionality as our patch embeddings.Positional embeddings are learned during pre-training and sometimes during fine-tuning. After creating the patch embeddings and pre-pending the classification label embedding, it is then summed with the positional embeddings”. Finally, the summed embeddings are shown to the transformer encoder. After the entirety of the image is shown to the transformer encoder, the model has then learned that image under the given classification label. For a more understandable visual example, see Figure 2.4 from. It visually demonstrates the architecture of how the image patches are linearly projected and linearly embedded, how the patches receive a learnable class embedding, and then finally showing the image to the transformer encoder and the model learning the given image for the given classification label.This project will implement the vision transformer developed by Dosovitskiy et al. and mentioned in their paper. The framework of their ViT model will be used and accessed through the Hugging Face platform and their package transforms using Python. The vision transformer model comes pre-train on the ImageNet-21k, a benchmark data set consisting of 14 million images and 21k classes. The vision transformer model has been pre-trained on images with pixel size 224 × 224. Therefore, any data that is to be further trained on this model must also be of pixel size 224 × 224.“Data augmentation is the process of transforming images to create new ones for training machine learning models”. “Data augmentation increases the number of examples in the training set while also introducing more variety in what the model sees and learns from.