Dysregulated signal propagation in a MYC-associated Boolean gene network in B-cell lymphoma

By the modern molecular biological approaches that exploit the availability of high quality gene expression data, it is made clear that flexible and robust responses of cellular programs are encoded in the relations between gene expression values. These relations naturally define a network where they stand for edges between the nodes that stand for the genes. The wiring of these networks often found to be dysregulated in cancer. Different system biological approaches that rely on correlations, differential equations and logical analysis are used to probe these relations in gene expression data especially. In our work we investigated selected biological functions in aggressive germinal center B-cell lymphoma in terms of a logical analysis of gene-regulation in Boolean space and a signal propagation algorithm considering network topology based on gene expression data. We especially aimed at studying the activity of the MYC gene as a key player. It is shown that the functional output of a gene network is affected by the states of the genes and also by the wirings between them. Our results support the key function of MYC in lymphoma biology. In addition, we showed that genes can alter functional output of the network by alternative mechanisms like reducing the variance in propagating signal and locking it to a certain level. Correspondence to: Hans Binder, Interdisciplinary Centre for Bioinformatics, Leipzig University, Leipzig, Germany; E-mail: binder@izbi.uni-leipzig.de


Introduction
Organisms have a great variety of biological responses on molecular level to continuously changing intra/inter-cellular and environmental signals. These flexible yet robust response mechanisms to ever changing signals are rendered possible by different organizational levels and by complex gene networks [1][2][3][4]. They are found to be dysregulated in many diseases, especially and particularly in cancer. Thus, understanding the biological functions entails constructing the links between genetic entities, gene regulatory networks and cellular responses and understanding the disease mechanisms requires investigating the wirings in the networks.
In eukaryotes each component of the genetic network is often regulated by 4-5 other components of the network that results in a highly integrated regulatory control [5]. These regulatory relations are manifested in the expression of the genes that can be considered as continuous real variables in an interval bounded by min/max expression values of the respective gene. This kind of formalism is used in differential equation related analysis [6,7]. However differential equation methods suffer heavily from the need to a great number of kinetic parameters that usually are not known to a satisfying degree of precision, especially for regulatory and signaling networks. On the other hand, gene regulatory relations are based on simple building blocks such as promoters, transcription factors and transcription factor binding sites on DNA. These basic components are essentially molecular switches that either turn genes ON or OFF and result in switching modes of the groups of genes in the gene regulatory machinery [8]. Consequently, these modes and the essence of regulatory relations can be captured by Boolean variables and binary logic functions [5]. Apart from its use in various fields like electrical circuits, systems biology approach to whole genome gene expression data, binary logic is a distinguished recent technique that stands due to its simplicity and robustness. Logical analysis methods for biology are pioneered firstly by S.A.Kauffman [9][10][11][12]. In general sense logic based methods aim to describe sigmoidal functions as a Boolean step function. Since then logical formalism and methods are used effectively in biology in addition to differential equation methods, correlation based methods and stochastic models (see for a review [13]). The base, formalism and biological reasoning for the method used in this work was developed and applied first by Sahoo et al. [14]. We supplied this method with a machine learning algorithm (SOM) and a pathway signal flow algorithm (PSF) to analyze gene expression networks [15]. The method predicts logical relations between pairs of genes from their expression profiles.
Since years one of the most visited diseases by modern molecular biology and medicine is cancer. Transcriptional regulation is heterogeneous and associates typically to a series of molecular subtypes [16][17][18]. Understanding of the molecular mechanisms of subtypes of a cancer is particularly of great interest since it has a significant impact on diagnosis and prognosis [19][20][21][22].
In this study we address the biology of aggressive germinal center B-cell lymphoma with the focus of Burkitt's lymphoma (BL) which is probed in relation to MYC and other highly mutated key players (Table  1) with regards to gene expression, gene-regulation, network topology and network organization using a Boolean approach. Our aim is to study the activity of MYC in connection to the other key players of the network and their manifestation onto the significant biological functions. Using gene-gene relations in previous studies on aggressive B-cell lymphomas (referred to as 'Reference Network' -RN) [23][24][25], we identified the genes of the network that act in concert or contrary fashion compared to a data-driven implication network (Implication Network -IN) of B-cell lymphomas. Furthermore we use PSF algorithm to calculate signal flow through two networks and finally we employ previous biological knowledge using biological function nodes in the network. Throughout the study we made use of subtype classification of Hummel et al. that consists of three classes called mBL, non-mBL and intermediate to discuss the results [26].

Results
Our study organizes in three levels. Firstly we analyze gene expression data using a unsupervised machine learning method called SOMs to identify, organize and cluster similarly behaving genes and to reveal the general organization of the data. Then comes the main step of the study that sets the base for network level analysis: we identify logical relations between clusters of genes by using so called Boolean Implication method, which is convenient for our purpose since it provides six different relation classes and also does not assume co-expresion of the genes. These logical relations give us the wirings between the key players of lymphoma and thus constitute an implication network. At this point we carry out a comparison between identified wirings from the data and expected wirings from the literature (topological comparison). The differences in wirings were investigated considering mutations of MYC, TCF3 and some other potential key players [23][24][25] (Table 1). Lastly we investigate the effect of changing wirings in the network over signal propagation and biological functions such as cell proliferation, cell survival, apoptosis and cell cycle control. These biological functions are required for the formation of Germinal centers and proper maturation of B-cells ( Figure 1).

The reference network of MYC regulation in B-cells
We constructed a network that consists of genes that have significant roles and that are frequently dysregulated in aggressive B-cell lymhphomas. Key connections in the gene regulatory network were investigated using mutations in critical actors like MYC and TCF3 [23][24][25]. MYC encodes a transcription factor that is estimated to have the ability to bind a very significant proportion of genomic sites (~25,000 sites), which gives a clue of its master role [27,28]. In lymphoma studies, and in cancer studies in general, a special attention is given to MYC gene for its "master regulator" properties [29][30][31]. MYC is noted as an exclusive effector both due to the number of targets that are in relation with MYC and due to the crucial pathways and processes that encapsulates nearly all hallmarks of cancer (cell cycle control [32], cellular transformation [33], proliferation and apoptosis [34], vasculogenesis and angiogenesis [35], cell adhesion [36], metastasis [31], genetic instability [37] and tumorigenesis through miRNAs [38,39]) and that are affected by MCY. Consequently dysfunctional state of this non-linear hub is prevalently associated with bad prognosis of cancer. In addition to MYC, some other key genes are found frequently mutated in B-cell lymphomas; pro-survival associated TCF3, ID3 and cell progression associated CCND3 ( Table 1).
The constructed a network from relevant literature ( [23,24,40]) where MCY constitutes a highly variant network in aggressive B-cell lymphomas is given in Figure 1.

Clusters of coregulated genes in lymphoma reflect alterations of the MYC network
We used SOM machine learning algorithm to cluster, visualize and profile gene expression data of a cohort of 220 lymphoma patients previously classified as mBL, non-mBL and intermediate cases [26]. This 'SOM space' visualizes the data landscape under study. It is an important property of the SOM clustering method that it 'selforganizes' meta-genes (and genes) with similar profiles together into neighboring positions. In consequence, the SOM expression landscape can be segmented into spot-like areas that represent clusters of covariant genes. These spot-clusters can be interpreted as intrinsic regulatory modes in the multidimensional expression data [41].

Implication analysis identifies relations between genes
Next, we carried out Boolean implication analysis for the 19 genes that are assumed to be crucial regarding B-cell function collected in Figure 1 to judge the type of mutual interactions between them as described in the methodical section. One of six possible Boolean implications has been assigned to each of the pairwise combination of genes of our network. We used alternatively gene (GL), metagene (MGL) and spot-level (SL) expression profiles as provided by the SOM analysis. In our analysis we primarily used metagene level results since they are more stable, less noisy compared to gene level data and more diverse compared to spot level data. The comparison between different levels of data aggregation and their properties are discussed in a previous publication [15]. Both implications and correlation coefficients for the set of genes are given in Table 2. Pairwise scatter plots of interactions are shown in figures Figure 3 and Figure 4. Figure 5 compares gene clustering of the network considered using either Boolean implications or Pearsons correlation coefficients between their expression profiles. Both heatmaps split into two main clusters 1 and 2 that were identified also in the SOM map (Compare Figure 2 and Figure 5). Cluster 3 genes are found at intermediate positions between clusters 1 and 2 in the heatmaps (genes with green font in Figure 5). It turned out that implications provide a better contrast for the different types of interactions compared with correlations. It is also clearly seen that the two clusters in the implication heatmap perfectly match cluster 1 and cluster 2 on SOM map. Moreover, the relations between the genes within the clusters are resolved in terms of HH, LL, EQV, HL, LH and OPP implications (Figure 5b).  Table 2. Boolean implications and Pearson correlation coefficients of the network genes. Genes of reported positive and negative relations are listed in the left and the right panel, respectively. Bold font is used for correlation coefficients with absolute values over 0.5. Brackets indicates correlations/implications that are found to be different from the RN. Gene names that are possible centers for mutation in lymphoma are shown by underlined text.

Topological comparison between reference net and implication net
In our analysis we consider two variants of the network shown in Figure 1 and Figure 6, respectively: (i) The reference network (RN) assumes the interactions between the genes in its 'canonical' form. It describes a collection of regulatory paths in healthy B-cells. (ii) The 'data-driven' implication network (IN) considers the implications derived from the expression data. Hence, the IN describes the dysfunctional state identified from the lymphoma data, and particularly the differential interactions observed between the lymphoma subtypes. The IN thus describes B-cell lymphoma with potentially differentially dysregulated interactions between the genes. Topological comparison Every pixel in the map represents a metagene and encloses a set of genes that show similar expression profiles. Light grey areas represents the spots that enclose a set of metagenes that have similar expression profiles. There are two additional genes shown: p53 and BIM owing to their possible regulatory effect on key genes of the network in some special cases and they will be discussed separately in the text. The genes can be grouped into three clusters colored in red, blue and green, respectively. b)-d) Boxplots of selected spot profiles taken from each of the three clusters (selected spot profiles are the light grey pixel groups that lie in dashed rectangles in figure part a). Note that cluster 1 and 2 collect genes up-regulated in BL and DLBCL, respectively. Consequently Cluster1 genes are in an opposite type relationship with cluster2 genes ( Table 2). There are some outliers like PTEN and BCL6 having a probe located close to bottom right corner and ALK, P16 and CDK6 that seem to be located in between three clusters. identified two possible differences: (i) A link that is present between two genes in RN is absent in the other network (broken link) or (ii) A link that is present in both networks can change the type of regulations it implies to another type in one network compared to other (type changing).
Five interactions (one activating and four repressing) changed their sign compared with the expected interactions between RN and IN. For instance, MYC is expected to be up-regulated indirectly by the oncogenic effect of STAT3. However, we identified the opposite STAT3 OPP MYC implication between them (Table 2, Figure 3g, Figure 6). Other type changing regulations include BCL6 and MYC, PTEN and CCND3 (Table 2, Figure 4d,e,g).

Signal flow in the MYC network
We calculated pathway signal flows (PSF) through the MYCnetwork for RN and for IN versions separately to estimate the effect of altered interactions on selected biological functions as described in the methodical section. We considered five functional biological sink nodes, related to germinal center formation, plasma cell differentiation, cell survival, proliferation, and cell growth (Figure 1, Figure 6).
The collection of metagene level signal results for RN and IN for the biological sink nodes are given in (Figure 7). The greatest differences between both networks are found for germinal center formation and survival signals for mBL subtype. Signal values for non-mBL subtype seem to stay almost unchanged between RN and IN except slightly increased plasma cell differentiation signal and slightly decreased proliferation and cell growth signals. On the other hand signals for intermediate cases seem to always decrease slightly from RN to IN.
In addition one finds that proliferation and cell growth signals show similar mean signal in RN and IN however the distribution is much narrower IN (Figure 7d,e). Another interesting effect is observed when we remove the survival repressing effect of MYC, which implies its apoptosis. This change is simulated by simply removing the MYC/ Survival negative connection in the network (Figure 1). Surprisingly after applying this change we have the slightest change in the results -a slight increase in mBL subtype, a widening of distribution in intermediate subtype and a slight decrease in non-mBL subtype -that underlines the cooperative role of the genes on the sink signals. We conclude that changing a single edge sole is not sufficient to change the signal significantly in this regime (Figure 7c, f).

Discussion
In this study we examined a gene regulation network composed of selected key genes of lymphoma biology directly related to MYC. We constructed a data driven network via Boolean implication analysis (IN) using gene expression data and compared it to a reference network (RN) that is constructed by using the literature knowledge. For network synthesis we used Boolean implications and we compared them to correlations. One of the fundamental advantages of Boolean implications is that implications have the ability to capture relations between genes that have poor correlation and consequently cover a greater range of possible relations between the genes. Another benefit of implications is the better contrast that they provide by classifying relations. The consequences of network-wirings are examined in terms of network activity using the signal flow approach, which assesses the signal propagation through the network towards functional sink nodes. The examined network is composed of 20 genes and 5 functional outputs.

Germinal Center Formation
The germinal centers (GCs) are the main sites for antigendriven somatic hypermutation (SHM) of the genes encoding the immunoglobulin variable region (IgV) and they hold clue regarding the immunological state of the cells. BCL6 and BCL2 are two of the actors of this process. The distribution of the reference and implication network signals at the sink node GCF is shown in Figure 7. Firstly it is clearly visible that there is a sharp increase in GCF signal of mBL subtype in IN compared to RN. On the other hand, for non-mBL and IML subtypes there is essentially no change between RN and IN. There are two crucial aspects to be discussed: the difference of the signals between subtypes and the difference of the signals between reference and implication networks. Regarding subtype differences with respect to GCF signals, mBL subtype is enriched in high expression zones of BCL6 and TCF3, which results in amplification of the GCF signal compared to intermediate and non-mBL subtype (Figure 4c  7a). For the second point, by comparing Figure 1 and Figure 6 it can be seen that the increase in GCF signal in IN compared to RN is particularly caused by absence of negative control of ID3 over TCF3. This fact is shown to be very common in BL and consistent with our results.

Plasma cell differentiation
Plasma cells (PCs) are one of the key players in adaptive immunity as final moderators of primary and secondary humoral response that are dedicated to supply soluble immunoglobulin (Ig). Molecular characteristics of plasma cells in comparison with B-cell (BC) characteristics are discussed extensively in the review by Kathryn L.Calame [42]. Before discussing plasma cell differentiation signal we should denote that in our restricted networks plasma cell differentiation signal is merely controlled by BLIMP1 gene that is not highly informative and that does not provide different results comparing different subtypes and RN and IN. Thus depending on the negative relation between MYC and BLIMP1 it is assumed that Figure 6. Gene network obtained from the data by implication analysis. Activating and repressive interactions of the genes to functional outputs are shown by blue links (In the signal analysis CCND3/CDK6 and CD79A/CD79B are considered as single nodes since they behave as complexes. The edges between them and the edge between TCF3/PI3K are given as supporting information).  Table 2 metagene level correlation and implication results). Furthermore a better contrast is presented by implication analysis. Additionally one must be aware of the fact that poorly correlated genes can exhibit significant implications. This advantage of Boolean implication analysis is discussed in [15,54]. doi: 10.15761/BEM.1000115 Volume 2(2): 8-11 plasma cell differentiation signal is also directly and negatively regulated by MYC. This assumption implicitly incorporates BLIMP1 and BCL6 into the signal results by their control over MYC. It is seen in Figure 7b that IML and non-mBL subtypes have higher signal values than mBL in both RN and IN. On the other hand when we compare the signals between two networks it is seen that mBL and IML signals decrease while non-mBL signal increases from RN to IN. Lower signal for mBL subtype is basically due to the negative regulating effect of MYC over plasma differentiation signal (note that MYC is highly expressed in mBL subtype). On the other hand, the difference between RN and IN is due to the BCL6/MYC positive relation in IN as opposed to RN. So the state of MYC plays an important role for the outcome of this signal.

Survival
There are two crucial paths controlling the survival in BL designated by two genes: MYC and PI3K. Cell fate regarding survival is basically controlled by the cross-talk between MYC and PI3K/AKT/mTOR pathway i.e. pro-apoptotic properties of MYC is being counterbalanced by deliverance of survival signals via PI3K. The most prominent change in survival signal appears as a sharp increase for mBL subtype while there is a slight decrease for IML and almost no change for non-mBL (Figure 7c). This is caused by escape of TCF3 from the negative control of ID3 and its support on PI3K path to survival. The differences between the subtypes are caused by the differential expression of the genes of PI3K path. This is supported by the survival signal distribution when we remove the link between MYC and survival sink node to simulate the BL cases where MYC-mediated apoptosis is suppressed [43]. In this case the signal does not change significantly for any subtype, which emphasizes the prominence of PI3K path and also the cooperative activity on survival (Figure 7f).

Proliferation
Proliferation essentially involves the similar paths as in survival: ID3/TCF3 and PI3K/AKT/MTOR path regulated by BCR and PTEN [44], [45] ( Figure 6). However additionally it incorporates another important effector CCND3, which pairs with CDK6. Considering the signal distributions in two networks there is no dramatic difference between the median values. However we observe another interesting property, in IN the signal distribution seems to be locked to a narrower region enabling a smaller variance in the signal. This is caused by removal of variance in the signal by the missing links PI3K/AKT/ MTOR in IN. Regarding the subtype differences, the higher signal in IN for mBL signature is basically due to the positive relations between TCF3/ID3 and p16/CCND3 as opposed to RN (Figure 7). The signal difference between the subtypes is basically caused by the differential expression of mentioned key players; e.g. CCND3, TCF3.

Cell growth
As it is the case for most of the sink nodes, MYC stands out as a critical component together with PI3K/AKT/mTOR regulation via PTEN in cell growth (Figure 7). In a broad biological sense MYC controls cell size and proliferation through amino acid/protein synthesis, lipid metabolism, protein turnover/folding, nucleotide/DNA synthesis, transport, nucleolus function/RNA binding, transcription and splicing, oxidative stress and signal transduction [46]. Additionally MYC regulates cell growth related metabolic pathways such as glucose and iron homeostasis [47]. Comparing subtype and network differences, cell growth signal has a similar profile to that of proliferation (decrease in the signal from mBL to non-mBL and loss in variance in IN), which is logical since cell growth is controlled by similar set of genes and in cells cell growth and proliferation is carefully orchestrated. The slight decrease in signal in IN is caused by the negative regulation of STAT3 on MYC in IN (Figure 6). The positive connection between MYC and STAT3 in RN is replaced by a negative one in IN and consequently the overall signal for all subtypes is decreased slightly depending on the expression of STAT3.

Summary
The biological/functional output of the gene networks are not only composed of the states of the genes but also composed of the wirings / interactions between them. Furthermore altered wirings in the network reflect the basic dysfunctional modes of the network. It is shown in our analysis using five biological outputs of a network and the change in their activity with respect to altered wirings (broken links, and wirings whose activity is reversed, i.e. from activating/repressive to repressive/ activating). It is also shown that genes act cooperatively and it is possible that change in a single gene state is not enough to alter the macroscopic behavior of the network. Thus it is vital to incorporate the dynamics and topology of the network in modern -omics analysis.
Regarding lymphoma biology (in particular BL) the central hub behavior of MYC is supported by our analysis; the expression state of MYC and its wirings to surrounding key paths are responsible for changes in major biological responses like survival, proliferation and germinal center formation. Despite its major significance, it is discussed that to be effective the activity of MYC must be orchestrated by other pathways and key players (as shown in survival). In addition, the behavior of the system is not simply characterized by its activity level, there can be other effects like lost of variance in activity i.e. locking down of the activity to a tight interval (measured as signal in the respective sink node in our analysis of cell growth and proliferation).

Data
We analyzed a microarray data on mature aggressive B-cell lymphomas available under the GEO accession number GSE4475 [26]. In the study authors used Affymetrix HG-U133A microarrays to assess biopsy specimens from 220 patients by assessing the expression level of 22,283 gene-related probesets are. A transcriptional signature is defined by Hummel et al. to distinguish the subtypes molecular Burkitt's lymphoma (mBL) and non-molecular Burkitt's lymphoma (non-mBL) [26]. In the data set 44 cases are assigned to the mBL, 128 are assigned to the non-mBL signature and 48 cases could not be assigned unambiguously to one of the two groups and flagged as intermediate [26], [48][49][50][51].

Preprocessing
Raw probe intensity values collected from each of the 221 arrays are calibrated and summarized by hook method into expression values in logarithmic scale and subsequently quantile normalized [52,53]. The expression data is represented as numerical matrix of dimension N×M such that rows of the matrix, e_(i,j) with i=const, will be termed 'expression profile' of the respective gene i while columns, e_(i,j) with j=const will be termed 'states' referring to sample j under consideration.

SOM analysis
Machine learning using self-organizing maps (SOMs) is applied to the expression data to cluster genes of similar profiles on so-called metagene and spot levels. We used SOM portraying method that visualizes the expression state of each sample by a color-coded twodimensional map of 50x50 pixels, called metagenes according to their expression values in the respective sample. Spot modules were defined as clusters of neighboring over-(red) or under-(blue) expressed metagenes as described in [41]. Each spot represents an expression mode of a group of metagenes showing concerted expression.

Boolean implication and correlation analysis
Boolean implication analysis method transfers genes' expression profiles into binary data space according to their high and low expression values by applying to each gene's expression a threshold. It then identifies logical implications between pairwise combinations of these dichotomized values. Details of the method and of its application to SOM data are presented in the previous publication [15]. Originally the algorithm to identify Boolean relations for gene expression is proposed by Sahoo et al. [54]. Detection of Boolean implications between pairs of gene clusters, i.e. metagenes and spot modules, is performed similarly as described for the genes using expression profiles of metagenes or spot modules instead of those of single genes [15]. Identified implications correspond either of the six Boolean implication classes, High-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH), Equivalent (EQV) and Opposite (OPP).
In correlation analysis, Pearson correlation coefficients for all probe expression couples are calculated and represented as a hierarchically clustered heatmap in comparison with the corresponding implication heatmap. Implication analysis provides six different type of relation classes where three correspond to positive (HH, LL, and EQV) and negative (HL, LH and OPP) correlations each.
If a gene is not located within any spot, we use the profiles of the closest spot in the map for spot level correlation and implication analysis. For the genes with multiple probesets the best scoring correlation and implication is taken into account.Generating implication networks from reference networks Sink Figure 8. Schematic representation of PSF algorithm workflow on hypothetic pathway graph.Mapping of relative expressions R_i on to the pathway nodes and calculation of pathway signal flow for the outcome of the network. Green arrows stands for activation relations and red lines with round ends stands for inhibition relations respectively. S_i ,i=1,2,…6 represents the signal flows at corresponding nodes.