Reconsideration of cohort study and case-control study

Cohort study is the factor-referent study (some scholars called factor-control study) with timespan. The essence of cohort study and case-control study (it’s better to call case-referent study) is the correlation analysis study between exposure factors and disease outcomes. Relative risk (RR) and odds ratio (OR) can be regarded as different transformations of correlation coefficient (r). Cohort studies are applicable to natural population, while case-control studies are also applicable to non-natural population. Case-control study can have cohort and use the full information of all source population during a risk period, case-control study based on cohort sampling can be considered as a more efficient form of cohort study. The results of cohort studies and case-control studies can be compared and evaluated by OR values and their confidence limits. *Correspondence to: Ma Junling, Department of Cancer Epidemiology, Peking University Cancer Hospital and Institute, Beijing 100142, China, E-mail: 13806491812@139.com


Introduction
Case-control study and cohort study are two classic epidemiological methods. Exploring, understanding, and analyzing the relationship between them are essential to both epidemiology teaching and practice. Many scholars have been looking into these methods from different perspectives [1][2][3][4][5][6][7][8][9][10][11][12]. Based on many years of epidemiological practice and teaching experience, this paper further explores the nature, internal relationship and application space of cohort study and case-control study by analysing examples and interpreting relevant discussions.

Examples
Case-control study is grouped according to diseases (result-tocause) while cohort study is grouped according to exposure factors (cause-to-result). They seem to be opposites superficially but are in fact internally unified: Grouping of the case-control study would have been completed when the relative risk (RR) [13,14] was calculated from a cohort study. The odds ratio (OR) [13,14] can also be calculated at the same time. If the exposure factors and diseases (outcomes) can be constantly divided further, the case-control study and cohort study both can be transformed to correlation analysis study at an individual level. We used the examples below to show and explain the nature of the intrinsic relationship between case-control studies and cohort studies (data in the examples are hypothetical).

Example 1
In order to explore the potential intrinsic relationships between the case-control study and the cohort study, we examined the height, weight, and blood pressure of all adults aged 26-54 years old in a village. The body mass index (BMI) ≥ 25 Kg/m 2 was defined as the exposure, and the subjects whose diastolic blood pressure (DBP) ≥ 90 mmHg were defined as hypertension cases [13,14]. A total of 150 subjects randomly selected from DBP < 90 mmHg were included in this cohort. The study was conducted according to whether the subjects were exposed or not.
After one year, the prevalence of hypertension was compared between the exposed group and non-exposed group. The RR of hypertension with high BMI was calculated ( Table 1). The result is as followed: It can be seen from table 1 that although the cohort study is conducted initially to calculate RR, simultaneously, the grouping of case-control study has been completed. Therefore, OR can be calculated (OR can be the result of a case-control study or a cohort study). The result is as follow: OR = ad/bc = (31*70)/(28*21) = 3.69 (P = 0.0003) Initially, we would not need to group the 150 subjects in the beginning, but it was necessary to record height and weight measurements. After one year, the subjects were grouped by hypertension. The prevalence of high BMI one year ago was compared between the case group and the non-case group. Therefore, a casecontrol study was carried out directly. The results were the same.
The RR and OR can also be calculated by using the generalized linear model ( ) [20]. The data was analyzed by using SAS9.4 statistical software.
The equation (1)  Example 2 The exposed factor, BMI, was further divided into six groups from low to high following one year. This created a cohort comprised of many groups where one could compare the prevalence of hypertension among different groups (analysing dose-response relationship between BMI and hypertension [13,16,18]). The results are as follows: By using the generalized linear model, the equation is obtained: RR= e 0.4178 =1.519 (P < 0.0001) Data in table 2 can also be analyzed by "case-control study hierarchical exposure data [16,21,22]". The results are as follows: By using the generalized linear model, this equation is obtained: OR=e β =e 0.8504 =2.341 (P < 0.0001) Here, RR and OR are average RR and OR values of hypertension (diastolic ≥ 90 mmHg) which refer to every grade of increment of BMI over the former after one year.

Example 3
The body mass index and measurements of diastolic blood pressure were both divided into six grades from low to high, then the rank correlation analysis was conducted [13]. Thereafter, the forms of cohort study and case-control study have disappeared ( Table 3). The results are as follows: By using the generalized linear model, the equation is obtained: The standard deviations of X (body mass index) and Y (diastolic blood pressure) at six levels are: σx=1.069, σy=1.179 Rank correlation coefficient: r=β*σx/σy=0.4891*1.069/1.179=0.443 (P < 0.0001)

Example 4
Since the data of body mass index and diastolic blood pressure are both quantitative, the 150 people in example 3 can be further subdivided into 150 groups (everyone is internal control of each other [16,18]). The correlation and regression analysis of body mass index and blood pressure was analyzed (Figure 1). The results are as follows: By using the generalized linear model, this equation is obtained: Correlation coefficient: If the exposure and outcome variables can be continuously subdivided, then any form of cohort studies(including retrospective cohort studies [13]) and case-control studies (including hospital-based case-control studies [18]) can be transformed into correlation analysis by individual measurements.

Discussion
According to the examples given above and comprehensive analysis of the existing literature about the two methods [13][14][15][16][17][18][19][22][23][24][25][26][27][28][29][30][31][32], it can be seen that cohort study and case-control study are essentially correlation analysis studies between exposure factors and diseases or other outcomes. They both use different forms of correlation analysis due to the limitations of the survey data or according to the practical demand. RR and OR can be regarded as different forms of the correlation coefficient (r). As shown in Example 1-4, RR, OR and r can be derived from regression coefficient(β) of a generalized linear model(some scholars have discussed the different methods of calculating r by OR [33]). Whether it is "cause-to-result" or "result-to-cause", exposure factors are the independent variables, and diseases (outcomes) are the dependent variables. A value of RR or OR greater than 1 indicates a positive correlation, while less than 1 indicates a negative correlation, equal to 1 indicates no correlation; The larger the RR or OR ( when RR or OR is less than 1, that is the larger the 1/RR or 1/OR ), the greater the correlation intensity. This is consistent with the meaning of r (large than 0 indicates a positive correlation, less than 0 indicates a negative correlation, equal to 0 indicates no correlation; The larger the absolute value of r, the greater the correlation intensity).This interpretation of the cohort study and case-control study can help readers to understand the intrinsic relationship and nature of the two methods.
Although the values of RR and OR are generally not equal, both larger than 1, and less than 1 and equal to 1 are the same (same result) in a cohort study. The significance test also obtains the same result (see the RR and OR significant test results from example 1 and example 2). Taking RR and OR as different forms of r does not affect the conclusion that RR can be interpreted as the ratio of one rate over the another (OR can be interpreted as a multiple of one ratio to another). OR is an approximate value of RR when incidence is low [13,22].
Grouping of the case-control study would have been completed when the relative risk (RR) [13,14] was calculated from a cohort study. This means that all the data suitable for a cohort study to calculate RR can also be used for case-control study (RR is always accompanied by OR, see example 1 and 2). Case-control study can also use the full information of all source populations during a risk period. In this case, the two methods have the same intensity and quality to demonstrate the correlation or causality. RR is a result of cohort studies as well as a result of case-control studies using natural population data. However,  Table 3. The relationship between body mass index and blood pressure -body mass index and diastolic blood pressure both were graded in the past, when prospective surveys were conducted using natural population data, people used to carry out cohort studies to calculate RR, even considered that RR can only be the result of cohort studies, ignored or denied the objective fact that case-control studies exist at the same time.
It is inappropriate to say that a cohort study is objectively superior to a case-control study. On the contrary, because of incompleteness (non-natural population), a lot of data unsuitable for cohort study can still be carried out in a case-control study (OR can be independent of RR), such as in a hospital based case-control study. This is the advantage of case-control study (flexible and wide application). At this time, due to limitations of survey data or poor implementation, the intensity and quality of a case-control study may be reduced. The misunderstanding that OR is not as good as RR or a case-control study is not as good as a cohort study may come from this misunderstanding. People are accustomed to comparing prospective cohort studies in natural populations with retrospective case-control studies in non-natural populations: the advantages of natural population and prospective data are given to cohort studies, while the disadvantages of non-natural population and retrospective data are imposed on case-control studies. Some current discussions on shortcomings of case-control studies are related to their limitation of survey data or poor implementation [16][17][18]29], not an inherent flaw in case-control studies themselves. In fact, cohort study is a factor-referent study (some scholars called factorcontrol study [19]) that survey data of a certain period be collected, and case-control study (case-referent study [2]) etc. belong to one classification system, while prospective study, retrospective study and cross-sectional study belong to another [15]. That is, factor-control study and case-control study can be prospective, retrospective and cross-sectional survey. Factor-control study and case-control study are side by side, but cohort study is a part of factor-control study [19].
It is obvious that cohort study and case-control study are both related analysis studies. However, cohort studies, grouped by exposure factors, are only applicable to the natural population (the denominator of the cases is known, some scholars call it "primary source population [18]"); while case-control studies, grouped by disease, are not only applicable to the natural population, but also to the non-natural population (the denominator of the cases is unknown, some scholars call it "secondary source population [18]").
Case-control study coexists within a cohort study calculating RR. However, in case-control studies using such data, sampling methods are often used further. The basic characteristics of this case-control study are that the case group consists of all the cases in a cohort, while the reference group is a sample from no-cases of the same cohort. The reference group can come from survivors(cumulative sampling, source population (case-cohort sampling), or person-years (density sampling) etc [11,18]. Since all cases are used, as long as the control group is selected according to the statistical requirements, relative to the cohort study, the sampling error and the change of OR confidence limit are very small (OR confidence limit may be slightly widened. Some scholars said "with only a slight reduction in precision" [2,18]), but the sample size can be reduced drastically(improve efficiency greatly), and RR can be estimated. The more rare the disease, the more obvious the reduction of sample size. Therefore, case-control studies based on sampling from cohort can be considered as more effective forms of cohort studies. These include nested case-control studies etc [5,13,14]. The "cohort" is not exclusive to cohort studies (or factorcontrol studies), and case-control studies can also establish "cohort".
Generally, for diseases with lower incidence, case-control studies sampled in natural population need smaller sample sizes than cohort studies with the same precision requirements, and the rarer the disease, the more obvious it is. Some scholars conceptualized case-control studies as streamlined versions of cohort studies [2].
Using the same population data for cohort study or case-control study, as long as according to the statistical requirements, the ORs are basically same, only sampling errors exist. So the results of cohort studies and case-control studies can be compared and evaluated by OR values and their confidence limits (cohort studies and case-control studies can be included in a same meta-analysis [33]).
What survey indicators can be obtained is determined by the survey data: Case-control studies using natural population data do not affect the acquisition or estimation of incidence and RR; Case-control studies with incomplete data (non-natural population data) can only get OR, which are not suitable to calculate RR, and cohort studies can also not help in such conditions (and even cannot be implemented). The intensity and quality of RR and OR to demonstrate the correlation or causality also depend on the nature and acquisition process of survey data: For a same disease and exposure factor, the intensity and quality of the OR and estimated RR obtained from prospective case-control studies carried out according to statistical design strictly are higher than the RR and OR obtained from retrospective cohort study of poor implementation.

Conclusions
To sum up, cohort study is the factor-referent study(factor-control study) with time-span. The essence of cohort study and case-control study (it's better to call case-referent study) is the correlation analysis study between exposure factors and disease outcomes. RR and OR can be regarded as different transformations of r. Cohort studies are applicable to natural population, while case-control studies are also applicable to non-natural population. Case-control study can have cohort and use the full information of all source population during a risk period, case-control study based on cohort sampling can be considered as a more efficient form of cohort study. The results of