Nonlinear analysis of random walks: a tool to analyze nucleic acid sequences

We have performed a comparison of computer-generated random sequences with tRNAs nucleotide sequences present in Bacteria represented as random walks. Nucleotide sequence data of tRNA genes were obtained from the Institute for Genomic Research (TIGR) and the GeneBank library. Random sequence data (white noise) were obtained from the algorithm by Press and Teukolsky. Random walks of nucleotide sequences were obtained by letting the orbit walk a unit step in one of four directions (down, left, right, and up), depending upon the next base (A,C,G, and T) in the sequence, and the distances from the origin calculated. The Visual Basic routines here applied to perform the analysis are presented. Relative Lempel-Ziv complexity. Entropy (sum of the positive Lyapunov indexes) and Hurst indexes of nucleotide sequences and of computer-generated random data were evaluated over the distances of their random walk. Our data show that the values of nonlinear parameters obtained from the bacteria are lower than the values of randomly generated sequences (p<0.01, p<0.05, p<0.01), meaning that the tRNA sequence is more ordered than a pure destructured random data and it owns a “memory”. The observed deviation from pure randomness should be arisen from some constraints like the secondary structure of this biologic macromolecule and/or from the peculiar origin of this macromolecule by repeated subunits. These data indicate that evolution earlier chose nonrandom “alphabets”: order together randomness were present at the dawn of life. Our method, here presented and described, provides an efficient tool to assess the amount of order/disorder in the primary structure of nucleic acid sequences


Introduction
Billions years ago, according to the exon theory of genes [1] small RNAs translated into peptides of 15-20 amino acids: minigenes of pre-tRNAs codifying RNA hairpin structures. The dimerization of two equal RNA hairpin structures may have lead to the formation of the cruciform structure of the tRNA molecule: tRNA reflects the primordial genes of that era. Prokaryotic cells lived during the earliest time (3.5 billion years ago): these microorganisms retain ancestral signatures in their tRNAs [2,3]. In order to test the hypothesis that the nucleotide sequences of the primitive informational polymers (tRNAs) might not be chosen randomly, we introduce the comparison of computer-generated random sequences with tRNAs nucleotide sequences present in the bacterial genomes based on the description of nucleic acid sequences as random walks and their analysis by nonlinear approaches over the distances from the origin.

Random data
Random data (white noise) were obtained from the algorithm by Press and Teukolsky [4] and their orbit walks were obtained generating an uniformly and randomly distributed data points over the unit interval (0 to 1). Based on the graphical approaches by Hamori and Ruskin and Mizrahi & Ninio [5,6], we have analyzed nucleotide sequences of nonintronic tRNAs and of computer-generated random data describing them as random walks [7] by means of softwares developed in Visual Basic language by the first Author of the paper (Figures 1 and 2).
To obtain the random walks of random data, the sequence was divided in 4 intervals to which A,C,G,T letters were attributed ( Figure 2) and distances from the origin over the random walk obtained as above ( Figure 1). Fifty random sequences (length, n = 80) were evaluated.

Nonlinear parameters
Relative LZ complexity and Entropy (sum of the positive Lyapunov exponent), have been calculated (Chaos Data Analyzer Pro v. 2.0 [8]) from the distances calculated over the random walks.
Relative LZ complexity, LZ: Relative LZ complexity is a measure of the algorithmic complexity of a time series. According to the Kaspar and Schuster algorithm [9], each data point is converted to a single binary digit according to whether the value is less than, or greater than, the median value of a set of data points.
White noise (a pure random signal, common in physical systems, that exhibits equal power across all the component frequencies of the signal), has an LZ value that is close to 1.0. Pink noise (flicker noise or 1/f noise), exhibits decreasing power as frequency increases, and is associated with a relatively low LZ value; it is common in biological systems (e.g. heart rate). A sine function with 10% superimposed Gaussian white noise yields an LZ value that is close to zero. The algorithm for calculating LZ, applied in the present paper, converts it to a single binary digit which indicates whether the value was less than, or greater than, the median value of the set of such data points.

Entropy, K:
The entropy index chosen here [10] is a measure of the disorder in a data set and was calculated as the sum of the positive Lyapunov exponents.
Randomness is indicated by numerically high values of entropy. Ordered series like the sine function exhibit values that are close to 0.

Hurst exponent, H:
The Hurst exponent is the slope of the rootmean-square displacement of each data point versus time. The H value for white noise is equal to 0. If H ≠ 0.5, then correlation exists, the noise is "colored" and the process exhibits a "memory": if the exponent is greater than 0.5, persistence occurs (past trends will statistically persist in the future, see sine function), and, vice versa, if H is less than 0.5, anti-persistence occurs (past trends tend to reverse in the future, e.g., Figure 1. Visual Basic subroutine: tRNA random walk ("orbit" that collect distances from the origin for each base). Base ratio, RT, of nucleotide sequence data was also calculated (trna.txt).
pink noise and biological signals such as heart rate). For Brownian motion, a random process in which, on average, each point moves away from its initial condition by an amount that is proportional to the square root of time, the Hurst exponent exhibits a value which is close to 0.5 (no memory) [11].

Statistical tests
Mann-Whitney U test was used to ascertain the difference between the groups.

Results
Two-dimensional random walks of bacterial tRNAs appear less "diffused" than the ones of computer-generated random data ( Figure  3). Relative Lempel-Ziv, Entropy and Hurst indexes calculated over the distances of random walks in bacterial tRNAs are statistically lower than the ones of computer-generated random data (p<0.01, p<0.05). Table 1 summarize the results.
Graphical representations of computer-generated random data (top) and of a tRNA (bottom) as random walks. The sequence begins at the origin in the lower right corner of the graph. tRNA random walk appears less "dispersed" than computer-generated random sequences.
Relative Lempel-Ziv, Entropy and Hurst indexes calculated over the orbit walk (distances) of bacterial tRNAs present significant lower values compared to the ones of random sequence data (**1, p<0.01; *1, p<0.05), meaning a more ordered structure in the nucleic acid sequences than in destructured random data. Moreover, the tRNA sequence owns a "memory" (H ≠0.5).

Discussion
A number of studies have been devoted to examining the structure of nucleic acids sequences subjected to a variety of mathematical transforms, in order to uncover pattern irregularities in the DNA, that often result from constraints and are therefore frequently associated with function [12][13][14][15][16][17][18][19], also using graphical approaches [5,6]. By our approach, ancient informational polymers, old bacterial tRNAs [20], present significant lower values of LZ complexity, Entropy and Hurst indexes than random sequence data (white noise). These data reveal a significant shift from destructured randomness (white noise), i.e. a more ordered structure than a pure random sequence, also confirming our previous results performed with different nonlinear indexes [21].
The observed deviation from pure randomness may be arisen from some constraints like the secondary structure of this biologic macromolecule and/or from the peculiar origin of this macromolecule by repeated subunits [1].
Gayle and Freeland [22] showed that the 20 amino acids present in the LUCA were not chosen randomly and O. Weiss et al. [23] showed a significant small reduction of the Shannon entropy (-1%) in protein sequences compared to random polypeptides. Together with our results, these data seems indicate that evolution earlier chose nonrandom "alphabets": order together randomness were present at the dawn of life.

Conclusion
It has been shown that tRNAs sequences are more ordered than   destructured random ones (white noise). Lempel-Ziv and Entropy evaluations of nucleic acids sequences described as random walks provides an efficient tool to assess the amount of order/disorder in the primary structure of nucleic acid sequences.