Application of neural networks to the prediction of a phenotypic trait of pacific lampreys based on single nucleotide polymorphism (SNP) genetic markers

The relationship between single nucleotide polymorphisms (SNPs) and phenotypes is noisy and cryptic due to the abundance of genetic factors and the influence of environmental factors on complex traits, which makes the idea of applying artificial neural networks (ANNs) as universal approximates of complex functions promising. In this study, we compared different ANN architectures and input parameters to predict the adult length of Pacific lampreys, which is the primary indicator of their total migratory distance. Feedforward and simple recurrent network architectures with a different range of input parameters and different sizes of hidden layers were compared. Results indicate that the highest performing ANN had an accuracy of 67.5% in discriminating between long and short specimens. Sensitivity and specificity were 62.16% and 70.73%, respectively. Our results imply that feedforward ANN architecture with a single hidden neuron is enough to solve the problem of specimen classification. Nonetheless, while ANNs are useful at approximating functions with unknown relationships in the case of SNP data, additional work needs to be performed to ensure that the chosen SNP markers are related to functional regions related to the examined trait, as the use of non-specific markers will result in the introduction of noise into the dataset.


Background
Migratory behavior is the long-distance movement of individuals, which mostly occurs on a seasonal basis. As one of the most well-known and studied phenomena in behavioral biology, migratory behavior can be observed in most animal species. Animals with unpredictable migratory patterns represent a challenge to ecologists working on effective population management and conservation, as their movement patterns are influenced by multiple events spread across a wide geographic range, often encompassing international borders [1]. The signals that start migratory behavior are largely environmental and are usually related to the length of day in bird species, or the water temperature in fish migration. However, there is evidence that genetics plays an important role in the migratory predisposition of individuals [2]. The genetic molecular mechanisms regulating migratory behavior have already been studied in birds, and genes that control such behavior have been discovered [3].
Single nucleotide polymorphisms (SNPs) have previously been used to predict a variety of traits in numerous species, ranging from quantitative [4,5], to discrete traits, such as eye color [6]. SNPs are single base sequence variations between individuals at a specific position in the genome. They are abundant in the genome of humans and animals, and are commonly used to differentiate between individuals of a species [7].
The Pacific lamprey (Entosphenus tridentatus) has recently been studied [8] regarding SNP markers that could be used to predict the migratory behavior of an individual. In that study, three SNP markers that can be used as efficient predictors of migratory behavior in this species have been elucidated. The main characteristic found to be indicative of the migratory behavior of such individuals was the total body length, as it was noticed that shorter fish are less likely to exhibit long distance migratory behavior [8]. Pacific lampreys have an important role in the ecosystem, serving as a buffer for salmon from predators and acting as an important sustenance food and cultural symbol for many tribes living along the Pacific coast [9]. Pacific lampreys are a highly dispersive, anadromous type of fish, which lacks a strict homing site. Instead, Pacific lampreys seem to locate their spawning sites based on pheromonal cues [10,11]. This makes the ability to predict their movements and migratory behavior both challenging and important from a conservatory perspective. The population of Pacific lampreys is on the decline due to environmental issues, inadequate dam design impeding their spawning migration, and prejudice caused by the popular opinion that lampreys are an invasive parasitic species, despite being indigenous to the Pacific coastal area [12,13]. Their migratory behavior has previously been studied using passive integrated transponder (PIT) tagging, where the correlation with migratory distance and length was observed in adult specimens [14].
Artificial neural networks are attractive for applications in genetics [15][16][17][18], as the relationship between SNPs and phenotypes is noisy and cryptic due to the abundance of genetic factors and the influence of environmental factors on complex traits [19,20]. They are applicable to various biological problems for categorization, such as discriminating between wild and domesticated populations of salmon and trout, as well as regression problems, such as predicting the sulphur removal by Acidithiobacillus species [21,22].
This study compares the predictive ability of two different neural network architecture types, a varying number of hidden nodes, as well as different input parameters and training data distributions. The target parameter is the total adult body length of individual Pacific lampreys, and is based on a previously published [8].

Methods
For problems of this type, the most frequently used ANNs are linear feedforward, and recurrent (feedback architecture), such as the Elman neural network [23][24][25]. Therefore, we have made a direct comparison of the neural network types on identical datasets to determine which is most suitable for predicting a phenotypical trait based on SNP data.

Feedforward networks
Single layer feedforward networks are frequently used for regression problems and forecasting. A linear feedforward neural network is often sufficient to properly perform classification tasks and is also applicable to regression tasks [26][27][28]. They are models in which information travels in one direction without any loops or cycles between the input and output. Neurons are assigned random weights at the beginning, and the sum of the products (linear combination) of the weights and inputs is calculated at each neuron. If the value obtained is greater than a given threshold value, the neuron "fires" and assumes the activated value. If the threshold is not reached, it assumes a deactivated value. The training of a network depends on outputs obtained. In the case of using the delta rule, the error is calculated between the predicted and target data, and the weights of the neurons are adjusted based on the error. This "backpropagation" process is repeated until a sufficiently low level of error is reached, or until a predefined cutoff point is reached [29]. A representation of a feedforward neural network with one neuron in the hidden layer, and nine input neurons is shown in Figure 1.
In this study, the input to each hidden neuron is a linear combination of a vector of weights, input SNP variants and a "bias" weight for the feedforward networks. The input to each neuron is obtained as represented in Equation 1. The result is then transformed via the sigmoid activation function f t (Equation 2) to produce the hidden neurons output value.
The output layer consists of neurons as well. The inputs to neurons in the output layer is a linear combination of outputs of neurons in the hidden layer, weights of the output layer q, and an output layer bias neuron b. The value obtained is transformed by the linear transformation function p t (.) to generate the value of the predicted adult length of an individual, as presented in Equation 3.
During the training of the neural network, the optimal weights are established using the Levenberg-Marquardt algorithm (LMA), which is commonly used for training ANNs [30,31], and which minimizes the error between the predicted and the actual weight [32]. This is performed by using a process of backpropagation that continues until an optimal mean error squared level is reached or stopping criteria was fulfilled.

Elman neural networks
Elman neural networks have feedback architecture and are also referred to as recurrent neural networks. This architecture, in addition to the layers found in the feedforward network architecture, also has a "context" layer which saves the unweighted outputs of the previous iterations hidden layer, thus giving the neural network a sort of short term memory, or context that feedforward networks do not possess [33]. Elman networks are identical to feedforward ones in their first iteration, due to no context layer being present. After the first iteration, the context layer is formed by the previous iterations of the hidden layer, thus resembling a three-layer feedforward network, with one layer being a copy of the previous iterations hidden layer. A representation of an Elman network with one hidden neuron, and nine input neurons is shown in Figure 2. Equation 4 was used to calculate the hidden neuron values in the Elman networks; it is very similar to Equation 1 with the only difference being the addition of the context layer inputs.  The Elman network was trained using the Gradient Descent with Momentum & Adaptive Learning Rate training method, which is a backpropagation algorithm commonly used for training recurrent ANNs [34]. The training continued until a minimal level of error was reached, or stopping criteria was fulfilled.

Dataset creation
A dataset containing 797 individuals that were genotyped for 94 total markers has been collected [8], with additional data describing their size, weight and migratory distance travelled. Three of those markers were determined to have a correlation with total adult length, and were found to be linked to genes with association to morphological traits (Table 1).
In order to compare the effects of SNP markers that have no direct correlation to the predicted trait, we have constructed three separate datasets. One dataset contained only the SNP markers that were determined to be related to adult length (dataset S3). A second dataset containing the S3 SNPs and seven arbitrarily chosen SNPs (S10), and a dataset containing all SNPs included in the study (S94). Since SNPs cannot be represented as continuous variables, and since each SNP has multiple possible variants in which it appears, each variant was used as a flag value, totaling in the neural networks having three times the number of SNPs as input neurons. An exception was the S94 dataset, which contained two SNPs that had two variants instead of three, resulting in a total of 280 input neurons. An example of this is shown in Table 2. The target value in the datasets was the total length of the individual.
In order to investigate what effect the distribution of the test data has on the training of the neural network, we have devised three different schemes of splitting the data, H, T and Q, described in Table 3. In total, this resulted in nine different input and test datasets.
The output value of the ANNs is the length of the individual expressed in millimeters. This output value is normalized to values between 0 and 1 (Equation 5), as this is a standard procedure done in order to obtain better initial weights and make the training faster [29].

Neural network training and performance measurement
We examined the accuracy and performance of neural networks that employed the feedforward architecture with the Levenberg-Marquardt training method, and the recurrent neural network architecture, also known as Elman architecture, with the Gradient Descent with Momentum and Adaptive Learning Rate training algorithm. Both are popular optimization algorithms used in the ANN domain, and are the default algorithms used in MATLAB for feedforward, and Elman networks, respectively [35]. The number of hidden neurons, as well as different input values, and training dataset distributions were examined. The number of neurons in the hidden layer was repeatedly increased by a single neuron, starting at one and continuing to twenty hidden neurons, which we have chosen as an arbitrary stopping point.
Mean absolute error (MAE) and Pearson's correlation coefficients were used to measure the ANN predictive performance. The Mean absolute error was calculated based on equations (6) and (7).
f i -predicted value, y i -actual value, e i -error value, n -number of samples.
Pearson's correlation coefficient of the predicted and actual values (r) was calculated as a measure of linearity between input and output values using Equation 8, where n is the total number of samples (Table 4).
Each neural network was trained using 600 samples, and a 70:30 data split, with 70% of the data being used for training, and 30% being used for validation. The sampling was random in order to avoid any selection bias in the dataset. Performance testing was done with 197 samples that were not used during the training phase. This is a common data splitting scheme, frequently used in ANN applications [36].
This process was repeated for each network type using the three datasets, and three different data splits for each dataset, resulting in a total of 360 neural networks analyzed in this study. Data preprocessing and analysis were performed using the R programming language [37], while the construction and training of neural networks was performed using MATLAB [38].

Results
The combined results of this study can be seen in Figure 3. It plots the correlation of the target and predicted parameter for all datasets and data splits. The individual plots (a-f) represent the correlations of the target and predicted parameters for the different data splits. The first row (a-c) represents the results of using the Elman architecture, while the second row (d-f) represents the results of using the feedforward architecture.

Predictive ability of different ANN architectures
In the case of Elman networks, no improvement was observed with the increase of the number of neurons in the hidden layer. Their performance decreased rapidly once the hidden layer exceeded three neurons, especially in the case of noisy datasets, while in the case of noise-free ones, the performance stays consistent (Figure 3a-3c). The explanation for such a fall in accuracy is that the increase in hidden nodes resulted in overfitting, that is, the neural network memorized the training data, but did not learn the underlying rules that would enable it to predict the length of new samples [35].
Feedforward networks exhibited no great changes in their performance with the increase of neurons in the hidden layer in the S3 dataset. However, in the noisy datasets S10 and S94, the increase of the number of neurons served to handle the noisy inputs and increased performance. This is most evident in S94, where the highest performing feedforward neural networks had two, seven, and four neurons in the hidden layer, respectively (Figure 3d-3f).
In a general comparison of Elman and feedforward neural network architectures, Elman shows more consistency in its predictive ability when the number of neurons is changed in the hidden layer. However, it should be noted that the highest correlation was obtained by a single neuron feedforward network utilizing the S3 dataset with the Q data splitting scheme.

Predictive ability according to training data
The S3 dataset consistently provided the best results as the SNPs used in this scenario have been previously correlated to the target variable, thus providing a noise-free dataset, and being ideal input variables. The S10 dataset performed comparably to the S3 dataset in neural networks with a small hidden layer, while the performance of the S94 dataset was erratic at best. The Q data splitting scheme (Table 3) provided the best results, despite the performance of neural networks worsening as the size of the hidden layer started to exceed ten neurons. The H data split was a close second, while the T data split exhibited the worst performance of the three, regardless of number of hidden neurons or network architecture used.

Etr_5317
Localizes to DYM gene which encodes a protein associated with normal skeletal development and brain function.

Etr_4281
aligned to the human homologue PCDH15 that encodes a membrane protein which functions to mediate calcium-dependent cell-cell adhesion.

Etr_1806
Does not appear to localize within any described gene region. Table 1. Association of SNP markers to morphological traits.

Etr_1806
Etr_1806_aa Etr_1806_ag Etr_1806_gg  The first row (a-c) represents the Elman networks, the second row (d-f) represents feedforward networks. The blank squares are the S3 dataset networks, S10 is represented by blank circles, while S94 is represented by blank triangles. The first column is split according to the H scheme, second column is the T scheme, while the third column is the Q scheme. The vertical axis is the Pearsons r coefficient, while the horizontal axis represents the number of neurons in the hidden layer.   Table 5. Performance of the expert system.

Highest performing feedforward ANN testing
The testing of the highest performing ANN was performed with 197 samples that were not included in the training dataset ( Table 3). The performance of the trained feedforward ANN for the test dataset was measured by mean absolute error (Equations 6 and 7). A mean absolute error of 30.162 mm was achieved. When converted to percentage values, the mean absolute error was 5.03% (Equation 2, Table 4).
The Pearson correlation coefficient was calculated to be 0.68 for the test dataset, which leads to the conclusion that a relatively high level of correlation between the true and predicted lengths of the samples was achieved. The network had an accuracy of 67.5% in discriminating between long and short specimens. The accuracy was calculated by transforming the true and predicted values from the test dataset into categorical ones, those being either long (≥ 660mm) or short. Out of 197 samples in the test set, 74 were from group long and 123 were from group short. Out of 74 long samples, 46 samples were correctly predicted giving the sensitivity of 62.16%. Out of 123 short samples, 87 were correctly predicted giving the specificity of 70.73% (Table 5).

Discussion
Changing environmental circumstances influences the natural migratory instincts in Pacific lampreys and causes them to travel great distances in order to reach their natural spawning sites. In this paper, we present a comparison between simple recurrent (Elman) neural networks and feedforward networks for the prediction of the adult body size of Pacific lamprey individuals according to their genotypes. The feedforward architecture proved to be efficient in classifying the phenotype of individuals according to the SNP variations of three markers.
A three-centimeter average difference between the actual and the predicted length of an individual obtained in the results is satisfactory for a species where the average size of an individual is about 55 cm. However, as promising as the results appear to be, one must take into consideration the inherent difficulty of predicting a complex trait that is largely influenced by environmental factors, not just genetic ones. The size of Pacific lampreys at adulthood is heavily dependent on environmental factors such as water temperature [39] which have not been accounted for in this study, as the source dataset did not delve into such elements.
The results were compared to similar studies, where regression model and artificial neural networks were used with SNP data. A constructed multinomial logistic regression model is designed using 24 SNPs from eight genes. The proposed model revealed the accuracy for predicting intermediate eye color of 0.73 [40]. On the other hand, prediction of eye color using multinomial regression model based on six IrisPlex SNPs shown the accuracy of 0.796 for intermediate eye color [6]. Also, regression model found its application in trait prediction using whole-genome sequencing data. The results shown the reidentification accuracy of different pool sizes range from 0.075 to 0.85 for eye color trait [41].
However, in ANN it was found that they are in line, or even outperform certain other studies by a small margin, depending on the dataset. SNP data were used to predict childhood allergic asthma in humans, and the obtained accuracy was 74.4%, which is comparable to the results obtained in the present study after transforming the output to a categorical value where the accuracy of predicting whether the individual is a large fish (length >66 cm) was 67.5% [42].
An ANN was used to predict various complex traits in cattle, and the obtained predictive correlation ranged from 0.47, to the best-case scenario of 0.67, whereas the correlation coefficient in the present study matched theirs being 0.68. These results being in line with previous research give the authors confidence in the test design and execution of the ANN in this study, and serve as another set of evidence as to the effectiveness of using ANN in combination with SNPs to predict complex traits [15].
The performance of different architectures of neural networks was compared with the task of predicting phenotypes of cattle and wheat, and the obtained conclusion was that nonlinear ANN outperform linear architectures in that scenario, as they had higher predictive correlations. Our results outperformed their predictive values, possibly due to our use of SNPs known to be involved in the targeted trait, while they used a large SNP panel, which might have had the unwanted sideeffect of introducing noise into the dataset [16].
A multitude of ANN models was explored for the prediction of marbling score in Angus cattle. The authors used different training algorithms, different activation functions and different numbers of neurons in hidden layers, and obtained a high correlation in their training set ranging from 0.776 to 0.858, depending on the algorithm and input dataset used. As they used SNP panels of 3,000 and 700 markers, it remains to be explored whether their results could have been improved by limiting the number of input SNPs to only the most relevant ones, and the application of their methods to the dataset used in our study would be a good topic for further research [17].
ANNs with relatively small numbers of hidden neurons showed good results in our study, which is not uncommon, as even single hidden neuron ANNs have the ability to learn complex rules [43,44]. The increase in the number of hidden neurons only became necessary in the case of noisy datasets, where it served to handle the noisy data. The greatest influence on the performance of the neural network came from the choice of input values, and distribution of values in the input dataset, as the best performance was achieved by a training data split where the lengths of the individual samples were evenly distributed.

Conclusions
We have compared a number of ANN models for the prediction of a phenotypic trait, based on SNP data. We have investigated the effects of network architecture, hidden layer sizes, inputs, and training data splits in order to obtain the highest performing neural network model for the prediction of adult Pacific lamprey length. The results indicate that using a minimal number of inputs in the dataset (three) with a one neuron in the hidden layer and the feedforward neural network architecture provides the most accurate predictive performance. These results correlate with previous findings in this area.
While artificial neural networks are great at approximating unknown relationships, they work much better in the absence of noise in the dataset in the case of a SNP panel, and any such further studies must be initiated with an exploration of the relevancy of the chosen inputs to the output traits in order to avoid noisy data.

Availability of data and materials
The dataset(s) supporting the conclusions of this article is (are) available in the Data Dryad repository, http://datadryad.org/resource/ doi:10.5061/dryad.t0391.