NOVEMBER 1961 BULLETIN 336 A Systematic Procedure for Determining Potent Independent Variables inMultiple Regression and Discriminant Analysis 1 A AGRICULTURAL EXPERIMENT STATION AUBURN UNIVERSITY E. V. Smith, Director Auburn, Alabama CONTENTS SUMMARY INTRODUCTION REVIEW OF LITERATURE . .... . . . . .. . . 3 5 7 SIMILARITIES AND DIFFERENCES IN REGRESSIONS AND DISCRIMINANTS . . . . . .. . .. PROCEDURES FOR SOLVING REGRESSIONS . . . . . OBTAINING THE SUMS OF SQUARES AND PRODUCTS CODING THE SUMS OF SQUARES AND PRODUCTS 8 10 10 . . . . . . 12 13 THE SOLUTION PROPER. DELETION OF A VARIABLE.. . . . . . . . .. . . 17 IDENTIFYING THE MOST POTENT VARIABLES IN REGRESSION . .. . . . . . SOME CONSIDERATIONS IN CHOOSING A METHOD GENERAL PROCEDURE . . . . . .. . . . . . . . . . . 19 19 22 22 23 To Find the Most Potent Single Variable . . . To Find the Second Most Potent Variable . . . . . . . To Find the Third Most Potent Variable . . . . . . . Construction of a Mask to Aid in Computations . . . . 24 24 To Find the Fourth and Other Most Potent Variables TESTING SIGNIFICANCE OF THE VARIABLES . . . . . . . . . . 25 25 USING REGRESSION PROCEDURES IN DISCRIMINANT ............. .. . . . .26 ANALYSIS NUMERICAL EXAMPLES . .. . . 30 NUMERICAL EXAMPLE OF REGRESSION.... .. . . . . 30 Finding the Most Potent Single Variable . . . . . . . 31 Coding ....... . .. . . . . .. .. . . .... . . . 32 33 Finding Other Potent Variables after the Most Potent NUMERICAL EXAMPLE OF A DISCRIMINANT... OTHER TESTS OF SIGNIFICANCE. LITERATURE CITED....... TABLES 1-18 FrIRST PRINTING, .. . . . . . . 39 49 56 58 . . . . . . .. . ............. NOVEMBER . . 1961 SUMMARY A method is presented for finding which few of a large number of independent variables are the most potent predictors of some dependent variable Y in the case of a multiple regression or are the most potent discriminators in the case of a discriminant function. The most potent variable is defined as that independent variable most closely related to the dependent variable; the second most potent variable as that variable which together with the most potent variable makes the pair of independent variables most closely related to the dependent variable; the third most potent variable as that independent variable which together with the most potent pair of variables makes the trio of variables most closely related to the dependent variable, etc. The success of the method is based on the intuitively reasonable idea that variables chosen according to the above definition cannot form a much poorer set than the absolutely most potent set; and, on the basis of some practical experience, no sets have as yet been uncovered that have been much better than those thus chosen. The practicability of the method is based primarily on the fact that, under the proposed scheme, it is necessary to examine only n regressions with one independent variable, n - 1 regressions with two independent variables, n - 2 regressions with three independent variables, and so on up to a maximum of usually not over n - 5 or n - 6 regressions with five or six independent variables. On the other hand, to obtain the absolutely most potent set would require examination of n!/n!(n - 1) ! regressions with one independent variable, n!/2!(n - 2)! regressions with two independent variables, n!/3!(n - 3)! regressions with three independent variables, and so on to n!/5!(n - 5)! or n!/6!(n - 6)!. A further advantage is that only those sums of products involving Y and/or the independent variables actually chosen will have to be computed, whereas the absolutely most potent set will require that all sums of products existing among the variables be computed. The only other process that could be proposed as leading to the desired set of variables would be to work the regression with all n independent variables, identify and discard the weakest, work the regression with n - 1 independent variables, identify and discard the weakest, work the regression with n - 2 variables, and so on. This process does not yield the set of absolutely most potent variables either, yet it requires that all sums of squares and products be computed. Furthermore, the computation load will be exceedingly heavy when compared with the proposed method if there are more than about 10 independent variables. Experience indicates that rarely are there as many as five or six potent variables finally selected. A systematic procedure with computational checks and some devices for reducing duplication of computations is described in detail and illustrated with worked examples. Computations and tests of reliability for many of the summarizing statistics of multiple regression are also described. A Systematic PROCEDURE for DETERMINING POTENT INDEPENDENT VARIABLES in MULTIPLE REGRESSION and DISCRIMINANT ANALYSIS* E. FRED SCHULTZ, JAMES F. GOGGANS, JR. Biometrician** Associate Forester Researchers and their statistical advisors are often confronted with the problem of determining the relative potency of a large number of variables in accounting for the behavior of some dependent variable or in discriminating between two discrete groups. Their problem usually is not to assess the potency of all the variables singly, though this may be a beginning, but to find some satisfactorily small number of variables that will explain some satisfactorily large portion of the variability in the dependent variable, or discriminate satisfactorily between the groups. One would like for the chosen set to explain more variability or discriminate more certainly than any other set with this many or fewer variables. Such a set can be described as the absolutely most potent set. To ensure finding the absolutely most potent set of r variables out of n, would require that all possible multiple regressions or discriminants with r predicting or independent variables be evaluated, with the set accounting for the most variability being chosen. The number of such sets is r!(n -r)! the number of combinations of n things taken r at a time, where n is the total number of variables to be examined and r is the number of variables to be allowed as predictors in the multiple regression or discriminant at any one time. Even the procedure of examining all the C; sets with r variables does not ensure, however, that some set with r - 1 or r - 2 variables would not do substantially as well. To ensure * The investigations leading to this report were supported by Hatch and State Funds and carried out cooperatively by the Departments of Forestry and Botany and Plant Pathology. The data were gathered as a part of Alabama Project 509. ** Resigned. ALABAMA AGRICULTURAL EXPERIMENT STATION that this situation does not arise would require evaluating all the possible regressions or discriminants with r or fewer variables. Another possibility, that inclusion of one more predictor variable would yield a considerably better prediction equation, can be examined only by evaluating all possible regressions or discriminants with r + 1 variables for all possible values of r. This is really the total of all possible regressions or, c=1 If the number of variables of possible predicting value is at all large, say above 10, and especially if the final multiple regression or discriminant is to be allowed to have as many as 4 or more independent variables (if shown to be necessary or desirable by examination of all regressions or discriminants with fewer independent variables), it is apparent that the number of multiple regressions or discriminants to be evaluated would be so large as to economically prohibit such a search in many studies. However, it is possible and with much less labor to find a set of variables that, though not guaranteed to be the absolutely most potent set, does have some probability of being the absolutely most potent set. In any event, the set may be regarded as most potent in the following sense: The absolutely most potent single predicting or discriminating variable is identified and selected. Following this selection, the most potent pair of variables of which one is the previously chosen most powerful single variable is identified and selected. In the next selection, the most potent trio of variables is identified and chosen, two of which are the previously chosen pair. This procedure continues in the case of regression until a satisfactorily large portion of the variability of the dependent variable is accounted for or until additional variables do not account for a significant amount of the remaining variability in the dependent variable. In the case of a discriminant, the procedure is followed until a satisfactory discriminating equation is obtained or until further variables do not significantly improve the discriminant function. When it is realized from the start that such a search is to be made, it is possible to further reduce the work by systematization and short cuts. The purpose of this bulletin is: (1) to describe such a systematic search for potent predicting or discriminating variables, (2) to emphasize that a single computing procedure serves for both regression and discriminant, and (3) to bring together in one place directions for all the DETERMINING POTENT INDEPENDENT VARIABLES computations, operations, and tests necessary for such a search. It is intended that these directions shall be in sufficient detail to serve as computing instructions to a group whose members are not highly trained in statistics, but who in the aggregate account for much of the practical use made of statistical procedures. Examples of such persons are researchers with some but limited experience in statistics and mathematics, graduate students in fields other than statistics and mathematics, and clerks who must sometimes function with only very sketchy directions from the researcher whose data they process. The directions are specifically for use with desk calculators, although they could perhaps be adapted to other types of calculators. 1 REVIEW OF LITERATURE The literature concerning multiple regressions is voluminous and scattered throughout journals and textbooks in many fields of science. For this reason the authors make no pretense of having made a thorough search of all the literature to determine whether the procedure outlined in this bulletin or a similar procedure has been previously proposed. Since this report is directed primarily toward those users desiring computing instructions rather than development of theory, most references are to textbooks rather than journal articles. In most textbooks the discussion of multiple regression and discriminant function analysis is limited to finding the regression equation or discriminant function, testing significance, and interpreting results, assuming that there is no uncertainty about the choice of independent variables to be used. Usually there is very little or no discussion of the problem of finding the best possible predicting variables. The reader is left to assume that determining the variables to be put in the regression is not a statistical matter. Some awareness of the larger problem of choosing the best predictor variables is acknowledged, however, as in the discussions on net and standard partial regression coefficients, Ezekiel (7), Croxton and Cowden (4), Mills (15), and Snedecor (19); on partitioning the total "determination" or sums of squares due to regression for several variables, Hendricks (12), Anderson and Bancroft (1), Goulden (11), Wert, Neidt, and Ahmann (23), Croxton and Cowden (4), Mills (15), and Snedecor (19); and on deleting or omitting variables, Villars (22), Anderson and Bancroft (1), Rao (18), Goulden (11), Wert, Neidt, and Ahmann (23), Friedman and Foote (9) and, Snedecor (19). 1As this bulletin was being prepared for publication, a computer program (Multiple Regression by Stepwise Procedure) that would make the procedures of this report applicable to use with electronic computers was listed by Leone (13a). ALABAMA AGRICULTURAL EXPERIMENT STATION The finding of a regression equation or diseriminant function requires the solution of simultaneous equations. There are several methods of solution, with many minor variations extant in the literature and textbooks. The abbreviated Doolittle solution is of particular interest. It has been described by Dwyer (5, 6), Peach (16), Anderson and Bancroft (1), Goulden (11), and Friedman and Foote (9).2 Procedures in discriminant analysis are discussed by Cox and Martin (3), Mather (14), Fisher (8), Rao (18), Goulden (11), Quenouille (17), Tippett (21), Wert, Neidt, and Ahmann (23), and Bennett and Franklin (2). SIMILARITIES AND DIFFERENCES IN REGRESSIONS AND DISCRIMINANTS Regression analysis is widely known and used in research. Many research people have operating knowledge of some portion of the technique and associated computational procedures. This does not hold at least to the same degree for discriminant analysis, even though practically identical computational procedures may be used for the two techniques. For this reason it seems desirable to give a brief general description of a discriminant function and its use. This is done by comparing two situations, one suitable for a regression and the other suitable to a discriminant. An educator might wish to know what items of information about students entering college would be useful in predicting degree of success or achievement during the freshman year, the degree of success in this situation being commonly measured by overall average grade for the year. A typical regression study might call for information on such potential independent or predictor variables as average high school grade, IQ, age, college entrance examination grades, and education of parents in order to investigate their effectiveness as predictors of average grade during the freshman year. Many state universities are required by law to admit all applicants with a diploma from an accredited high school within the state. In such cases 10 to 20 per cent of enrolling freshmen may not finish the year and, thus, would not have an average grade. In such universities this early attrition is a serious problem; thus, other educators might wish to know what items of information about entering college students would be useful in discriminating between those students who will drop out and those who will remain. The researcher might investigate the same set of independent variables 2 Since the first draft of this Manuscript, the authors have become aware of an exposition by Kramer (18) embodying some of the same computational features of this report. DETERMINING POTENT INDEPENDENT VARIABLES as for the regression study, high school grade, IQ, age, college entrance examination grades, and education of parents in order to determine their effectiveness in discriminating between the two types of students. The distinction between the two cases lies wholly in the nature of the dependent variable, Y. In the former, or regression case, the dependent variable, success, is a continuous variable taking infinitely many values. In the latter, or discriminant case, the dependent variable is a discrete variable taking two forms only; the student either finishes or does not finish the year. It is quite possible that the two investigators might each decide on the same set of r predictor variables. It might turn out that each investigator has records on p students and it could even be that some of the students are common to both studies. Suppose that the number of independent variables is r, or 4, as here listed: high school grade = X 1, IQ = X2, college entrance exam grade = X 3, and parents' education = X 4. The problem of regression is to obtain the constants and coefficients of the regression or prediction equation, Y = a + bX 1 + b2 X 2 + b3 X 3 +b 4X 4 such that the sum of squares of deviations of actual average grade, Y, 2 from the predicted grade, Y, is minimized, i.e., 1 (Y is less than with any other prediction equation that might be suggested or used. The problem of the discriminant function is to find the constants and coefficients of the discriminant or discriminating equation, Y) Z = x 1X 1 + x2X 2 + x3 X3 + x4 X4 such that, if a value of Z is computed for every student, the average difference between the Z values of the two groups, D = ZI - ZI, is maximized. If this is done, - Z) 2 S(Z - ZE) + E(Z (Z I) is a maximum. Equally as well, the t or F of a test of significance of a difference between the two groups of Z-values is greater than with any other equation that might be suggested or used. Changing the forms of the equations, squaring, summing over all individuals, partially differentiating, and equating to zero, yields in each case a set of simultaneous equations which must be solved to obtain the needed coefficients, see Goulden (11) and Bennett and 10 10 ALABAMA AGRICULTURAL EXPERIMENT STATION Franklin (2). The simultaneous equations in regression analysis are given as: (Z x1)b 1 +(Zx 1x2)b2 + (Z x1x3)b3 + (Zx 1 x})b 4 Zx1y x2y x 3y (C x2x 1)b1 - (Z 2)b 2 - (Z x2x3)b +(Zx 2 x 4 )b43 ( x3x1)b1 + ( x 3x 2)b2 + (Z x3)b 3 + (Z x 3x 4) b4 . ( x4 x 1) b1 +(Ex 4 x2 )b 2 +(Z x4 x3)b3 + (Z x4) b4 Simultaneous equations in discriminant analysis are: XY. 4 =d (Z x)A+ (Z 1 (C ( x 2x 1)X x 1x2)X2 + +-( (Z x1x3)X3 + x2x3)X3 x1x4)X4 =d 1 x2x 4)X4 + 1 (Z x2)X 2 + x3x1) 4 1 1 x3x2)X2 + ( (Zx) + 3 +( (Z x x ) (C x4x2)X2 ( x4x3)X3 + x3x4)X4 (Zx)" 4 = = One may observe that the two sets' of equations are The, reader can be assured that the left hand sides are identical except for differences in symbols. The variables, b1, b2, b3, and b4 of the simultaneous equations for regression are identical with the variables, A1, X , X , and X of the simultaneous equations for discriminants. The 2 3 4 only difference is at the right hand side. However, even this difference will not affect the solution procedure once the correct quantities are entered. This means that the same procedures may be used in discriminant analysis and a search for potent discriminators as in regressin analysis and a search for potent predictors-'with certain minor modifications. PROCEDURES FOR SOLVING REGRESSIONS OBTAINING THE SUMS OF SQUARES AND PRODUCTS identical in form. d4 Since the development of a multiple regression or discriminant function demands the solution of simultaneous equations and since the practicability of the method to be presented for finding potent variables depends in part upon the method used for solving the simultaneous equations, the particular modification of the abbreviated Doolittle solution used at this station is given in Tables 1 and 2 (pages 58-61) for a multiple regression with four independent variables, X 1, X 2 , X 3, and X 4, and one dependent variable, Y, called X 5 here for convenience and for relating these instructions to those for multiple correlation as given in many statistical textbooks. The procedure may readily be extended or reduced for the cases of more or fewer variables. DETERMINING POTENT INDEPENDENT VARIABLES 11 Table 1 indicates the steps in obtaining and coding the sums of squares of deviations and sums of products of deviations that are the coefficients of the b-values in the four simultaneous equations. The variable X6 (used for checking purposes) is the sum of the values of the four possible predictor variables plus the dependent variable. When the sums of squares and products are arranged as in Table 1, it turns out that the sums of squares lie along the diagonal and that the sums of products are symmetrically distributed about the diagonal, x2x 5 = xx 2 , etc. Thus, it is possible that is E x1x3 = E x 3x 1 , to effect some sizable savings in work by listing only one side of the diagonal. This may lead to some confusion in operations that call for the sum over columns of all the values in a particular row. However, the confusion may be abated somewhat by adding the values from right to left, remembering when one reaches the diagonal the reason the remaining values were omitted from the row is that they have already appeared in the column above the diagonal value. The following definitions hold throughout this report: X is an observation; X is a mean value; x = X - X is a deviation; and C is a correction factor to subtract from a sum of products or squares of observations, E X,X,, in order to yield the desired sum of products or squares of deviations, E xixj. From the foregoing definitions Table 1 should be self explanatory to persons with a little experience in statistical analysis except for the column of X, code, the row of X; code, and the check values in the last column. To obtain the values in the check column, one must create for each and every set of values X 1 , X 2, X 3, X 4, and X 5 a sixth value, X 6 = X 1 + X 2 + X 3 + X 4 + Xs. Thereafter this value is treated as an additional variable. The primary check is afforded in that each E XX 6 as calculated from the data and entered X>X;, the sum over all the different values in Table 1 is equal to >;j, of j(eolumns) of those E X ,X in the ith row of products of observations. Remember that missing values in the ith row can be found in the jth column, i = j for rows and columns that meet at the diagonal. Thus: X6, E XX and as an example, if i = 3, 6 = EE XX [1] EX ,x6 E + 6 3X EX = 3X E 4 + EX X3X, 3x 5 E X1 X 3 + EX 2 X 3 + . [la] The next check operation checks all three lines together as follows: Ci 0 +Z X6 Z Z = X, Z iZxXX, X. [2] [ 12 ALABAMA AGRICULTURAL EXPERIMENT STATION As an example, if i = 3, C36 + Z x x6 3 SC3 S03+34+ + x 3 x, + 33+ 33 C2+013 3 + Zw 3 5 + Ex x 3 4 [2a] + = Z XX33 Z X 3X 6 . CODING THE SUMS OF SQUARES AND PRODUCTS The sums of squares and products should be coded by powers of 10, which is merely a matter of shifting decimal points with the object of bringing the diagonal terms to values between 0.1 and 10.0 and other terms to values close to this range. This procedure gives the advantage of uniform number of decimal places in the use of the calculating machines without losing significant numbers. Coding by dividing each x / E x will also accomplish these same objectives, E xix; by / bringing the diagonal terms to unity and other terms to values lying between one and minus one (simple correlation coefficients or r values) and will also facilitate calculation of partial correlation coefficients and partial regression coefficients. However, coding by powers of 10 is very much quicker and easier; partial coefficients are needed for only a very few of all the multiple regressions solved; and when needed can still be found fairly easily, even after coding by powers of 10. After subtraction of the C,. term from E XX; to yield E xix 1 , the coding factors mi and m; are applied to yield the coded values of E xx; lying near the range 0.1 to 10.0. These are designated as a1 ;. They are the elements of the information matrix that will be used in the abbreviated Doolittle solution proper. The code values mi and m; are determined by the size of the E xix, along the diagonal where E xix = x because i = j. The determination is made in the following manner: For each Ex2 choose that even power of 10, which when multiplied by Ex2 will yield a value between 0.1 and 10.0. Take the square root of this even power of 10 and designate it as mi. The coding factor for m; is the same as for m, when i = j. Enter this value in the appropriate row (i) and in the appropriate column (j) at the margins. The coding values are used as multipliers for the E xi as explained in the stub for each row of a,, = 51273.6, the even power of values, Table 1. For example, if 10, which when multiplied by 51273.6 yields a value between 0.1 and Ex DETERMINING POTENT INDEPENDENT VARIABLES 13 10.0 is 10- 4 or 0.0001. The square root of this number = 10- 2 or 0.01; thus, m3 = 0.01 in the Xi code and m 3 = 0.01 in the X, code. This type of coding is equivalent to coding the original X 3-values by multiplying by the appropriate m,. In this example, equivalent results could be obtained by multiplying each X 3 by m3 or 0.01. THE SOLUTION PROPER Table 2 outlines the procedural steps of an abbreviated Doolittle solution. The quantities ai; entered in the first 5 rows and columns of Table 2 are the coded sums of squares and products of deviations as calculated in Table 1. The symbols in all other cells are directions for computing the values belonging in those cells. The h values in the check column are, as indicated, the sums of the ai; values in that row, remembering that values not in a row may be found in the column turning up at the diagonal value. The "forward" solution results in as many pairs of rows of values, Ai and B.;, as there are variables; thus, in Table 2 there are five pairs of rows. It is one of the labor-saving features of this solution that values A and B can be calculated in pairs. This feature exists because any Bi; is equal to the same A i divided by the leading A or A of that row; hence the division may be made before the A ~ is cleared from the machine without having to later re-enter the numbers in the machine. There is also somewhat less rounding error introduced by this procedure than in copying the number and later re-entering it in rounded form. Since many of the A ~ values are negative, it should be pointed out that this result can be identified by the string of nines appearing in the add result dial of the calculator. The negative value desired is the complement of this number (the number which when added to the result causes all numbers, including the nines, to change to zeros). If this complement is entered on the keyboard and the keyboard locked so that the keys do not clear on depressing the add bar (or multiplying by unity), then a single depression of the add bar (or accumulative multiplication by one) will show all zeros in the result dial, verifying the complement. A second depression (or accumulative multiplication) will show the complement itself in the result dials at which time it may be copied as the desired A, 1 and given its negative sign. Since it is in the proper dial of the machine for division, it may now be divided by the leading A ; of that row and the result recorded as the desired Be;. As checks on accuracy it may be noted that every value of Ai; on the diagonal must be positive; consequently every pair of values, Ai; and B 1, must have the same sign. 14 ALABAMA AGRICULTURAL EXPERIMENT STATION The cells of Table 2 show that all the A,. values, except those for the first row, A11, must be computed by subtracting one or more products, Ai;B,i, from some a11. Both factors of any such product have the same i subscript, because they belong to the same pair of lines. The j subscript of B is the number of the column for which the particular A ,i is being computed and the j subscript of A is the number of the row for which the Ai; is being computed. The j subscript of A is also the number of the column at the diagonal value of the row from which the original a1 . was obtained. Thus any A i, say Ai, ,, is given by Ail = A,;,i = ai,;, AiBi,, [3] where i < i'. As an example, if A1 . = Ai ;, = A 3 4 A 34 = a34 = a3 4 - A A 1 3 B14 - iB 4 , A 23B 24 [3a] where i = 1, 2. In computing values of A ; on the diagonal, the products, A, B,, must come from values in the same column since the column for which A i. is being computed and the column at the diagonal of this row are the same column. Except for rounding discrepancies, each of the products A1 ;B; may be calculated as the product of the B11 corresponding to the A11 actually used and the Ai corresponding to the B1 actually used. As examples: A 23B24 = B 23 A 24 and A 34B 3 = B34A3. The pair nearer each other in size, disregarding sign, will yield the result with less rounding error. This leads to a rule: Examine both ways of calculating each product, A iBi;, and subtract the one that is obtained by the A and B values nearer each other in magnitude disregarding sign. The operations are carried forward in the machine with nothing being entered on paper except the final results, Ai. and B1 ;. To find A 3,, for example, if working to 8 decimal places, a30 = g3 is entered with decimal at 16th place followed by subtracting the products of A13B 10 (or B13 A 13) and A2 3B20 (or B 23A 2 0 ) each entered with 8 decimal places. This manipulation can be done by use of the cumulative negative multiplication procedure. Whether the values will actually be cumulatively subtracted or cumulatively added will be decided by considering the signs of all processes and quantities and following the algebraic rules for handling signs. DETERMINING POTENT INDEPENDENT VARIABLES 15 As each pair of lines is completed, there is a check on arithmetic accuracy-each A,h = Ai 0 + E A1 and each B, = Bg + E B,,. [4] The vacant cells of these rows are omitted in summing, not because of symmetry, but because each A1 ; or B,; cell below the diagonal has the value zero. For example: A 3,, = A3 A + A 34 +A 33 . [4a] As soon as all the pairs of lines Ai and B 1 (there is a pair for each variable) have been computed, it is possible to evaluate the success of the multiple regression in explaining variability (or of the discriminant in discriminating). The residual sum of squares in Y not explained by the independent variables Xi is the quantity A,g in the X5 = Y column. A,, is the remainder from the original coded sum of squares of Y or gg, after subtracting the sum of squares of deviations in Y accounted for by the regression. This latter quantity, called the regression or reduction sum of squares, is represented by the symbol f2 and is given by S 2 = Z AB,, = A1 B, + A2gB 2 g + A 3gB 3g + A 4gB, 42 [5] where A 10B 1 is the sum of squares due to X ; A 20B 20 is the sum of squares due to X 2 independent of X 1; A3 B 3 is the sum of squares due to X 3 independent of X, and X 2; and A 4gB 4g is the sum of squares due to X 4 independent of X1 , X 2, and X 3. The coefficient of multiple determination R 2, the proportion of the sum of squares of deviations in Y accounted for by the regression, is given by 1 R - 2 2 Decoded E=2 E 2 [6] [6] The coefficient of multiple regression R is the square root of this value. If one's interest is in these measures only, then just this much of the solution (the so called "forward" solution) is needed. It is possible now to perform a "back" solution for the bi values (regression coefficients) and also a "back" solution for the table of c 1 values, which are actually an inverse matrix of the matrix of ai; values. Ordinarily one would not do both. After the b values have been found (either directly as in column 16 ALABAMA AGRICULTURAL EXPERIMENT STATION X, = Y Table 2, or by means of the c , bottom of Table 2), the regression equation can be written, Y = Y + bixi,[7] where Y is the predicted value of Y, i is the average observed value of Y and xi is the deviation in X;, x2 = X - X. This equation can be rewritten as ? = f7Z bx, + b1X, + b2 X 2 + b 3X 3 + b4X 4 . [8] It is also possible now to find the sum of squares due to regression in another way, S2= big, = bg, + b2g + b3 g3 + b4 g . 4 [9] Ordinarily the c matrix is calculated if one is interested in tests of significance other than the R test or F test of total reduction due to the several independent variables. In such a case b values would be calculated from c values. If one desires the regression coefficients but does not care to make any statistical tests other than R or F tests of reduction due to regression, the simplest method of calculation is that given in column X5 = Y. A check on these computations is afforded at . The check: corresponding positions in the column X 6 = Check bi 1 = hib, thus b3 + 1 = h3 b. The back solutions start at the bottom + and work up, thus b4 is calculated before b3 and c44 is calculated before either c33 or c34. Customary tests of significance may be made without decoding; treat all values as if they were derived from uncoded sums of squares and products, in which case t and F will be the same as if decoded values were used. In calculating 2, an estimated value of Y for a particular set of X values, and its confidence limits, it is probably best to code the X values to be used by multiplying by the appropriate power of 10, m;, then use the coded values of bi and ci;. The final results can be very easily decoded by dividing them by nm,,, the Y code. It is possible, of course, to first decode the b. and c,; values and then use actual X and Y values. To decode a b value: decoded decoded b = b = b m 1 mi = b m mY , thus [ [10] b db*= b3 _3 m,, [10a] DETERMINING POTENT INDEPENDENT VARIABLES 17 The superscript asterisk is used to denote a decoded value. To decode a c value: decoded c;i= c* = ciimim;, thus [11] decoded c34 = c3 = To decode a sum of squares in Y as decoded 2= 2 34m3 m 4 . [lla] Z = y2 F . [12] If it is also desired to calculate the various two-factor standard partial correlation and regression coefficients, it is necessary to have the c or inverse matrix of the matrix of correlation coefficients that would have resulted from coding the sums of squares and products V x;. This matrix of deviations by division of E xix; by V may be easily had from the matrix at hand, since any element, c**, of the matrix of correlation coefficients is given by c* A specific example is = cmVm; xi .. V [13] i3 C1 = C1mmm3 VZ X3. [13a] Some of the quantities that may be calculated and tested, using the c values are summarized in a later section. After decoding the b values, there is one more check, E b (E xix)= E x y, 1 [14] which merely says that the original simultaneous equations should be satisfied if we substitute the solution results. Thus, remembering that values missing from the ith row are in the column of that same number, we find the check of X 2 in Table 1 to be: E x 2 y = b4 E x 2 x4 + b 3 Ex 2 x3 + b 2 E x 2 + b1 E xx 2 . [14a] DELETION OF A VARIABLE After a regression has been completed including calculation of the regression coefficients b and the c or inverse matrix, it is possible to determine which independent variable is contributing least to the total regression. This is done by determining for each variable the additional sum of squares that it adds to the regression sum of squares over and above the amount already attributable to the other independent vari- 18 ALABAMA AGRICULTURAL EXPERIMENT STATION ables. This quantity can be called the sum of squares due to deleting or adding a variable depending on viewpoint. It is a measure of the variability in the dependent variable explained by the variable in question after all the variability that can be explained by the other variables is discounted. If expressed as a proportion of that variation not explained by the other independent variables, it is the coefficient of partial determination. It may be calculated as S i.all others - [15] which may be read as "the estimated reduction in sum of squares of dependent variable Y due to independent variable X when all other X's are held constant." This value may be decoded: decoded ZE .all others a others .a others 2 The variable contributing least to the regression of course, is that variable with the smallest sum of squares, Z . l others* If a variable is deleted, the regression coefficients b change, as do the elements of the c matrix. It is possible to recompute these elements, but it is also possible to estimate them somewhat more rapidly by the following formulas: bi after deleting Xk = b Ckk bk,[17] and c5 ; after deleting Xk = CikCk18 Ckk where the subscripts i, j, k refer to the tabled values existing before deletion (not after). Remember that the table of c values is symmetrical so that c; = c;. If it should be desirable to drop, say X 2, then k = 2, and as examples: b, after deleting X 2 = b c2 b2 C2 2 [17a] and c34 after deleting X 2 c34 C 22 c34 C2 3 24 . [8a] C2 2 DETERMINING POTENT INDEPENDENT VARIABLES 19 IDENTIFYING THE MOST POTENT VARIABLES IN REGRESSION SOME CONSIDERATIONS IN CHOOSING A METHOD With the necessary calculating procedures explained, the primary objective of describing a method for finding a set of potent independent variables can be undertaken. If there should be no more than 6 to 8 independent variables to be examined, it would not be too illogical to work the regression with all independent variables, find and discard the weakest, then work the regression with the weakest omitted, identify and discard the second weakest independent variable, and so on until omission of another variable would significantly reduce the information obtained. This system of dropping nonsignificant variables would result eventually in a set of potent variables. These variables would not necessarily be either the absolutely most potent set or the set that would be found by the method described in this bulletin. As mentioned before, the only sure way of finding the r absolutely most potent variables out of n is to work all the Cr regressions having r independent variables and then choose the best. Any systematic method for finding the variable that is strongest, next strongest, etc., or weakest, next weakest, etc., makes the assumption that all the r most potent variables will be present in the list of r + 1 most potent variables. This does not necessarily happen. Practical experience indicates that sets decidedly better than those discovered by the procedure outlined in this bulletin are rare. The decision of whether to choose the strongest, next strongest, etc., or eliminate the weakest, next weakest, etc., depends primarily upon the amount of work each will require, the usefulness of the intermediate results, and the psychological attitude engendered by the process. The process of choosing the strongest single variable and then the strongest pair (including the strongest single), etc., has the desirable characteristic that the variable selected as the strongest single variable really is the strongest single variable. The work could be stopped at this stage with this useful piece of information. However, if the study can afford to introduce a second variable, then the best variable to use with the first one chosen is the one that the proposed method will yield. The process is continued for subsequent variables. The attitude toward a process as straightforward as this should be good. On the other hand, the process of eliminating the weakest variables must proceed through the full discarding process before much information of 20 ALABAMA AGRICULTURAL EXPERIMENT STATION value is obtained. Further, the most potent single variable might not even be in the list of potent variables retained. The amount of work involved in the solutions depends upon the number of independent variables to be examined. If there are no more than six or eight variables, with the likelihood that three, four, or five may be accepted, there may be little difference in the two methods. It is the experience of the authors that the number of independent variables to be considered will not be 6 to 8, but 12 to 20, or even more, and that the number of independent variables in the final regression will often be no more than 3 or 4, hardly ever more than 5 or 6. The reason for the large number of independent variables to be examined is that from those variables actually measured additional variables are created to account for possible curvilinearity and interaction. As an example, consider a study of volume of tree product produced per acre in which only three independent variables were measured, X 1 = height, X2 = age, and X, = number of trees per unit area. Since all three variables might show the phenomenon of diminishing returns and since age ,or X 2 might have cubic effects as well as interactions with both height and number of trees, it is apparent that the independent variables to be investigated are not three but nine in number X 1, XI, X 2, X2, X2, X 3 , X3, X 2X 1 , and X 2X 3., Such a study was actually made and the number of independent variables after all such considerations was 12, Goggans and Schultz (10). It is apparent, especially in a case in which 20 to 40 independent variables are to be investigated with considerable likelihood that not more than 6 will be retained, that the appropriate method is not that of working the regression with all variables, deleting the weakest, and reworking and deleting until only the few best variables remain. In this situation the solution is much shorter to work all simple regressions, thereby identifying the most potent single variable; then work all regressions with two independent variables involving the variable previously identified as most potent. This procedure identifies the most potent pair of variables-subject to the condition that one of them is the previously identified most potent single variable. Extending the procedure involves working all the regressions with three independent variables in which two of the variables are those two previously found to be the most potent pair. This sequence results in finding the most potent triplet of variables-subject, of course, to the condition that two of these are the pair previously chosen of which one is the most potent single variable. The process described can be extended until all the independent variables have been used and ordered. However, it is usually stopped DETERMINING POTENT INDEPENDENT VARIABLES 21 when a satisfactorily large portion of the variability in Y has been accounted for, or when additional variables do not account for significant amounts of variability. For this purpose significance might be set at a level of chance, say 0.10, rather than the more conventional levels of 0.05 and 0.01. If there are n independent variables with r finally selected, there are n simple regressions to be evaluated, n - 1 regressions with two independent variables, n - 2 regressions with three independent variables and so on to n - (r - 1) regressions with r independent variables. In contrast with the method of eliminating weakest variables, which requires that the inverse or c matrix be computed so that the weakest variable can be identified, the only information needed about a set of regressions in order to decide which is the most potent is ?, the reduction in E y 2 due to regression. When solving for the variables in order of most potent, next most potent, etc., it is only necessary to carry the abbreviated Doolittle solutions through that part of the solution designated as the "forward" solution, or down to the second horizontal ruling of Table 2. At this stage the reduction due to regression may be calculated by [5] as S = Aigig. After the potent set is chosen, it may be desirable to make various tests of significance and perhaps find confidence limits, but on the chosen set only. The factors that make feasible or practicable such a search as here described are: (1) The end point can be recognized; either a satisfactorily large portion of the variability in Y is explained or further variables do not explain a significant amount of the variability in Y. (2) The number of regressions with r independent variables to be solved is n (r 1) rather than n!/r! (n - r)!. (3) The matrices of simultaneous equations may be solved by the abbreviated Doolittle method and need not be carried farther than the "forward" part of the solution-followed, of course, by calculating the sum of squares in Y attributable to the regression, [5]. (4) It seems from some experience that the number of potent variables will usually not exceed five or six. Thus, the heavy computational load of evaluating regressions with more than five or six independent variables does not seem likely to exist. (5) Only sums of squares, sums of products involving Y, and sums of products between those independent variables finally selected as potent will have to be computed. 22 ALABAMA AGRICULTURAL EXPERIMENT STATION (6) Since it is known from the start that all simple regressions will be examined followed by examination of all two-variable regressions involving some most potent single variable, etc., it is possible to organize and systematize the work to save duplication and, further, to use some mechanical tricks to reduce computations and copying. GENERAL PROCEDURE To Find the Most Potent Single Variable For purposes of illustration assume that there are some definite number of independent variables, say 10, rather than the more general case of n variables. Prepare the outline of a table, such as Table 1, but extended to the case of 10 independent variables, Table 3 (page 62). In this case X,, = Y and X 1 2 = X 1 + X 2 + ... + X 10 + Y. Calculate only E Xi, E x2 in the cells of the diagonal, E xiy in column X 11 = Y, and the single cell X 11 X 12 in the X 12 or check column-see cells indicated by (1), Table 3. The E X, may be checked by EX1 = , EX;. The .work of column X11 may be checked by [1], [la], [2], and [2a] , as follows: 1X, 2 SX,,X12 = C 1 1 , 12 + E E i E X X, C1 1 E xXlx 1 2 = = , i x 1 1 X, XlX12, remembering to add from right to left and to turn up the column at the diagonal, which in this case is the first value added. The quantities in the diagonal cells (sums of squares) will have to be checked by calculating them a second time, but all future entries made in the table, as well as those now entered, will be susceptible to being checked by the check column before being used (or used again in the case of values now entered). The coding values, mi and m,, may now be determined from the diagonal entries, but may also be postponed until a later step. The necessary sums of squares and products are now available to compute the reduction in E y2 due to each of the simple regressions. This computation is better if done from uncoded values, since it is just as easy and saves the coding and decoding process. The reduction sum of squares for the ith variable may be calculated as y E~zxj x ) [19_l DETERMINING POTENT INDEPENDENT VARIABLES 23 and if i = 3 Z i y ( 2[19a] The variable with the greatest y2 is the most potent variable and is so designated. The computation time is about one minute for each E 2 computed. To Find the Second Most Potent Variable If, for example, it turns out that the variable X6 is the most potent variable, the next step is to complete in Table 3 all cells involving X 6 , both in the row for X, and in the column for X 6 . See values indicated by (2). This row (column) of calculations may be checked by use of [1] and [2]. E X 6 X 12 Z = E E X 6 X and C 6 ,1 2 + E X6X12 = i 6 + i 6 xj = E 6X " 12 Coding values, m1 and m;, should now be established from the diagonal values. Other aii values are then calculated for cells indicated as either (1) or (2) by the relationship, ai = E xixm m. [20] If the code fairly consistently yields values in a particular row (column) that are either greater than 10.0 or less than 0.1, the mi and m; of this row and column may be changed. Any change should be to the square root of some other even power of 10. The a1 . values in cells indicated by (1) and (2) are sufficient to solve every multiple regression of two independent variables when one of the variables is the singly most potent variable, X 6. The regressions are solved by transferring the a1 ; to information matrices and carrying out the abbreviated Doolittle solution through the "forward" solution, as described in Table 2, and then calculating the reduction due to regression, [5]: Z 2 = A11B16 + A26B2 ,. This is done for each pair of independent variables that includes X 6. The largest reduction signifies the most potent pair, and the variable other than the most potent is designated as next most potent-subject to the definition of this report which allows that the chosen pair may 24 ALABAMA AGRICULTURAL EXPERIMENT STATION not be the absolutely most potent pair. These solutions can be completed in 6 to 8 minutes by clerks who are sufficiently familiar with the process that they have no uncertainty about the next step. To Find the Third Most Potent Variable To continue the example, if it turns out that the variables X" and X 3 are the most potent pair, X3 is designated as the second most potent variable. The next step is to complete in Table 3 all cells involving X 3. See values indicated by (3). Accuracy may be checked by use of the check column. After coding, the necessary values of ai. are available to solve every multiple regression of three independent variables when two of the variables are the most potent pair, X 6 and X, Construction of a Mask to Aid in Computations It is still necessary to transfer the appropriate a1 to information matrices and perform abbreviated Doolittle solutions through the "forward" solution. In being systematic about the work, however, it seems' logical to let the most potent variable become Xi and the second chosen variable be XI, where the "prime" indicates that the subscript of X may not be the original subscript. Thus, it turns out that each of the eight three-variable regressions, which must be solved, has certain parts that are identical. Table 4 (page 63) indicates by identifying symbols those portions of the solutions that are the same in every solution and indicates by leaders (. .. ) those values that vary as the third variable varies. The proportion of the solution that is constant is not large with just two out of three independent variables constant, but increases as the process is extended to solving several regressions of four, five, or six independent variables with all but one constant. Since the only information that will ever be wanted from most of these regressions is E y2 (the reduction in sum of squares of deviations in Y that may be attributed to the regression), any device to save reworking or even recopying the constant part of these solutions would be worthwhile. The device used at this laboratory is a mask consisting of the constant values with the columns and cells in which new values are to be entered or computed cut out as shown by dotted lines in Table 4. To work a regression, the mask is laid over a fresh sheet of paper, values of ai. entered according to the variable being evaluated as the third predictor, the necessary remaining calculations made, and the reduction in sum of squares calculated, [5]: Z 2 A19B1 + A2gB g + A3gB g. 3 2 DETERMINING POTENT INDEPENDENT VARIABLES 25 The most potent trio of variables, of course, is the one that has the largest sum of squares attributable to regression, or the largest A 3QB 0. 3 Since two of these three variables are already designated as most potent and second most potent, the third is designated as third most potent. It must be remembered that this list might not include all (or indeed even any) of the three absolutely most potent variables, though experience indicates this is unlikely, or if it did happen, the difference in E 2 by which the absolutely most potent variables would displace these variables would not be large. The values appearing on the mask for aiding in the search for the most powerful trio of variables are copied directly from the regression of the most potent pair of variables. (The sturdiness of the mask is increased if the single cell cut out of the Y column, the cell for a34 = g of Table 4, is covered front and back with Scotch tape to provide a "window.") Each regression of three variables requires 8 to 10 minutes to compute, assuming that all necessary values in Table 3 have been computed and are ready for use. To Find the Fourth and Other Most Potent Variables If the most potent trio of variables should be X 6, X 3 , and X 7 , then to proceed further, it will be necessary to complete in Table 3 all cells involving X7 , labeled (4) in Table 3, and solve all regressions of Y on four independent variables in which X 6, X 3, and X, are present. The mask may be prepared from the most potent trio of variables. Solution of each regression will require from 12 to 15 minutes computational time. The process can be extended indefinitely. Regressions with five independent variables require from 20 to 25 minutes computational time after the mask has been prepared and the necessary values entered in Table 3. TESTING SIGNIFICANCE OF THE VARIABLES The question of when to stop is related to how well the independent variables account for the variation in the dependent variable. In many fields of work, it is quite probable that if the investigator could find one or two variables that would account for 95 per cent of the variability in Y he would be quite willing to stop. This would be a satisfactorily large amount of variation explained by satisfactorily few variables. Usually the investigator is not this fortunate but must continue until all the potent variables are identified, that is, until further variables added to the potent list do not account for significant portions of the variability. The significance of the variability accounted for by ad- 26 ALABAMA AGRICULTURAL EXPERIMENT STATION ditional variables can be tested by a series of F tests, as in Table 5 (page 64). Significant variables may be regarded as potent variables. 3 USING REGRESSION PROCEDURES IN DISCRIMINANT ANALYSIS If the problem is one of finding a discriminant function rather than a multiple regression, there will be certain changes, but the procedures are essentially the same. First the data will be divided into the two groups on the basis of the discrete variable Y. Again assuming a case in which four independent variables were measured, one will have to calculate the triangular array of E x,x as in Table 1 for each of the groups, I and II, which are to be discriminated. If a check column is to be used, there is a check column for each group based on adding together the 4 X's (but no Y). After computing the E xi; for both groups, corresponding E xiz; are added together and summarized in a table similar zxiy in this table, there is a column to Table 1. Instead of a column of listing the mean differences, d, between groups I and II for the several X-variables: di=X 1, - X rI.[21] . Call this column d in discriminant analysis. The first difference is listed in line 1, d1 = X 1, - X 1 ,i, the second in line 2, d2 = X2,I 2 ,II, and so on, remembering to always take the differences in the same direction. There is no value of d in discriminant analysis corresponding to E y2 or go, of regression analysis. After the E xix; have been added together and summarized with the d-values, they may be coded by powers of 10 to ai; values, after which the operations of Table 2, the abbreviated Doolittle solution, are in order. The coded values of di serve exactly as the coded values of E xy in Table 2 and are given the same symbol, gi, as in regression. The computations are exactly the same as in regression. The discussion about "forward" and "back" solutions, and decoding still holds. Since there is no value corresponding to E y2 or g, of regression analysis, there is no A, to be calculated in discriminant analysis. The coefficients of Xi in the discriminant function are calculated in identically the same manner as the b's of multiple regression, though they are usually designated X or L rather than b, and the quantity While the authors know of no formal investigations on the matter, there is some intuitive feeling that the significance level here should not be too stringent, say 0.10. 3 DETERMINING POTENT INDEPENDENT VARIABLES 27 maximized is the average difference between the Z-values of the two groups, I and II, rather than a sum of squares due to regression, 22 The difference can be computed as D = Z, - Z1I [22] but is more usually calculated in the same manner as a sum of squares due to regression, after either [5] or [9]. If after [5], D = or if after [9], D = X Sg. [24] ABi [23] After the X-values have been found (by either of the methods described for b-values), the discriminant function may be written as Z = , 1X 1 + X2X 2 + X3X3 + X4X4 . [25] must be decoded or the Xi must be coded. where either the XA In the same manner as for a regression, the significance of the discriminant may be tested by means of an F-test. The sum of squares attributable to the independent variables is given by SS due to variables = nn nI + D 2, nil [26] where n1 and niI are the numbers of sets of observations in groups I and II, respectively. D is itself the sum of squares for residual and i has degrees of freedom, n1 + n11 - (1 + the number of independent variables), as indicated in the analysis of variance tabulated following. Source of variation Variables Residual Degrees of freedom Sum of squares Mean squares No. of variables ni + nii - 1 -no. of variables D2ninID2 nIn nr + nii (ni + n 1 1 )(no. of var.) D ni + nii - D 1 no. of var. The residual is used for testing variables, and a significant F-value for variables indicates a significant discriminant. 28 ALABAMA AGRICULTURAL EXPERIMENT STATION The discussion of considerations in choosing a method for determining potent variables holds as certainly for discriminants as for regression. If the problem with 10 independent variables should be one of discriminant analysis rather than multiple regression, there must be a table like Table 3 (omitting the a2;) for each of the two groups to be discriminated. The corresponding E xix; from groups I and II are are added together and listed in a third table. The di = X9 - X also listed in this third table in the column for X 11 = d, same column as X, = Y in regression. As with E y in simple regression, the quantities Di for the case of single variable discriminants are best calculated without coding. Thus, similarly to [19] and [19a] Di d SXi and if i = 3 D = D3 3 The variable yielding the largest D is the most potent single variable to use as a discriminator. If the most potent variable in the discriminant analysis of the problem with 10 independent variables should turn out to be X 6, it will be necessary as in regression to complete through E xix; all the cells involving X 6. This must be done for both groups, I and II, adding corresponding E xix; together in the third table and then coding to ai;. The necessary values of a2; (coded E xx;) are now available to solve every discriminant function of two independent variables when one of the variables is the singly most potent variable, X6 . The pertinent a; are removed to matrices as in Table 2 and the discriminants are solved through the "forward" solution and the evaluation of D by means of [23] D = AIgBI + A2gB 2 . The largest D signifies the most potent pair of variables to be used in a discriminant function. The resulting function [25] is Z = x1 x 1 + X2X 2 . If, as in the regression discussion, it should turn out that the variables X6 and X 3 are the most potent pair, then X 3 is designated as the second most potent variable. To look for the third most potent variable, all cells in Table 3 involving X 3 (see values in Table 3 indicated by "3") must be completed for both groups. After combining and coding, the DETERMINING POTENT INDEPENDENT VARIABLES 29 can be removed to individual matrices for solution just as in regression. A mask is just as valuable here as in regression analysis. The search for potent variables may continue to be extended in discriminants just as in regressions, except for the necessary modifications in first finding the E xix; of two separate groups and using the average group differences, di, of the Xi rather than sums of products, xiy. The problem of when to stop a search for potent variables in discrimination is subject to the same considerations as in regression. The significance of additional variables can be tested by a series of F-tests as illustrated in Table 5 for regression, although the complexity of the testing process is increased because the sums of squares do not exist as such but must be computed from the values of D. The D calculated with any number of variables may be regarded as the residual sum of squares remaining after the sum of squares due to those variables has been subtracted from the total sum of squares. The sum of squares due to the variables is not known, but may be computed from [26] as ai; SS due to variables = ni -nIn + nIi D2 , where n1 and nil are the numbers of sets of observations in the two A ~,Bi. The total sum of squares is groups, I and II, and D = given by Ei + Total SS = nI nnI ni D2 + D. [27] There is still a difficulty in that the basic variable, D = ZI - Zi, changes as variables are added to or deleted from the discriminant so that the total sum of squares with a two-variable discriminant is, for example, different from the total sum of squares with a three-variable discriminant. However, since the F-ratio of the mean square for variables to the mean square for residual is valid for each discriminant, the mean squares may all be made comparable to one another by applying the necessary factors to bring every total sum of squares (and its component parts) to some constant total sum of squares. Since the proposed analysis tests a single variable first, then a second variable, a third, and so on, it is proposed that the total sum of squares for the single most potent variable be accepted as the constant total sum of squares to which the total sums of squares of other discriminants are to be made equal. In most cases the only sum of squares of interest from the discriminant with r variables is the sum of squares due to the r variables. 30 ALABAMA AGRICULTURAL EXPERIMENT STATION This may be adjusted to equal the sum of squares of the single most potent variable as follows: Adj. SS, r variables nn n + n nn" 2 Total SS, 1 variable 28 D " Total SS, r variablesL2 D2 ni + ni n+ niI where D 1 and Dr are the largest D's for 1 and r variables, respectively. Significance of additional variables in discriminant analysis is tested in the same manner as for regression analysis in Table 5. The entries in the first three lines of the sum of squares column of the table resembling Table 5 are: line (2) = the sum of squares due to the most potent variable = D1 nInI/(n + nz), line (3) = the sum of squares among 1 Z-values of the same group = D1, and line (1) = the Total sum of squares = (2) + (3). Lines (4) and (7) are the sums of squares due to two and three most potent variables, respectively, calculated and adjusted to the total sum of squares of the most potent single variable as in [28]. Sums of squares for all other lines are calculated as indicated in Table 5. NUMERICAL EXAMPLES Because members of the group to whom this account is directed often feel a little uncertain as to whether their applications of symbolic representations are correctly made, worked numerical examples of both regression and discriminant analysis are added against which they may check themselves. NUMERICAL EXAMPLE OF REGRESSION The data for the regression example, set forth in Table 6 (page 65), consist of 40 sets of observations on one-tenth acre plots of planted longleaf pines. The dependent variable, Y, is the average height in feet of dominant and codominant trees. The independent variables or predictors, X;, are defined as follows: X1 = silt plus clay content of topsoil in per cent, X2 = imbibitional water value of the most impervious soil horizon, X3 = silt plus clay content of B horizon in per cent, and X4 = age of planting in years. DETERMINING POTENT INDEPENDENT VARIABLES 31 The variables created to allow for curvilinearity and interaction of effects are respectively, X, = X' = (age) 2 and X 6 = X 1X 4 = (silt + clay of topsoil) (age). As a matter of record, the original analysis studied 19 predictor variables, but for purposes of illustration only 6 are included here. The results of the study have been reported by Goggans and Schultz (10). Finding the Most Potent Single Variable Table 7 (page 66) is prepared in the manner of Table 3, filling in X, for the check first the sums and means of the X's including column, the quantities in the diagonal cells, the values in the column X, = Y, and the values in the single cell X 7 X 8 of the check column. Note the checks on computation, 18,231.8 = 1,139.2 + 9,190.6 +- .. + 820.8. Also by [1] and [la] 550,444.53 = 34,171.52 + 279,200.63 + ... + 24,070.97 and by [2] and [2a] 519,241.66 + 31,202.87 = 32,444.42 + 261,748.29 + .... +- 23,376.38 + 1,727.10 + 17,452.34 = 550,444.53. + . - 694.59 Values in diagonal cells must be checked by recomputing. The sums of squares due to regression when regarding each variable as a simple predictor are calculated by [19] and [19a] which for X 3 is 2 x3y) Ex3 2 _(489.75)2 5,522.85 32 ALABAMA AGRICULTURAL EXPERIMENT STATION The reductions for the six variables are listed in order of decreasing magnitude: Z A due to age, X 4, = 1,200.53, Z ~ due to (age)2 , Xs, =1,147.85, S~)due to (age) (silt + clay of topsoil), X 6, S2 to silt + due =641.79, clay of topsoil, X 1, = 204.24, clay of B horizon, X3 , = 43.43, Z ~ due to silt + and. Z A due to imbibitional water value of the most impervious soil horizon, X 2, = 0.39. Note that no quantities have been coded in these computations. The greatest reduction is 1,200.53 due to X, thus, X4 or age is the most potent single predictor of height. The significance of this reduction is tested in the first three lines of Table 8 (page 67), which is prepared in the' manner of Table 5. F = 86.62 with 1 and 38 degrees of freedom occurs in not more than 0.001 of cases due to chance; hence, the effect of age on height is to be regarded as very highly significant. Since it has turned out that X 4, or age, is the most potent single variable, it is necessary to complete Table 7 with regard to column X4 and row X4. As computational checks, [1] and [la] 211,216.30 = 12,949.1 + 107,160.4 -++ 4,985 + 15,579.0 and, [2] and [2a] 199,182.42 + 12,033.88 = 12,445.8 + 100,407.3 + ± 503.3 ... + ""+ ± 9,190.6 + 8,967.2 .-. + + 6,753.1 + 223.4 = 211,216.30 Coding Before evaluating the several regressions with two independent variables, the E xx; should be coded to az1 . As an example of coding (Z x)(10 4) = (5,552.85)(0.000,1) = 0.552,285, a value between 0.1 and 10.0, so the coding factor for X 3 is -2 . \/0.000,1 = 0.01 = 10 DETERMINING POTENT INDEPENDENT VARIABLES 33 Enter this value as m3 for both X; code and Xi code factors for X 3. As another example ( x -6 )x(10 = (474,585.41)(0.000,001) ) = 0.474,585,41, a value between 0.1 and 10.0, so code for X 6 is 0.000,001 = 0.001 = 10- 3. Enter this value as m6 for both the Xi and X; code of X6. Other code values are computed similarly with the exception now noted. If the X7 code is chosen by the above directions, the X 7 code equals the Y code which equals 0.01 and a, 7 = g, = 0.069,459, a2 = 7 92 93 = -0.014,96, and a3 = 7 = 0.048,975, which are all values lying on the low side of the recommended range of 0.1 to 10.0, ignoring signs. It will probably increase agreement between check column and the terms, which should check with the check column, to use 0.1 rather than 0.01 for the X 7 = Y code. This change in coding is introduced and will be used. The diagonal value and one other value exceed 10.0. This result is preferable to values beginning 0.0. Each ai. is computed from its corresponding xix 1 by [20]. ai = (E As examples: a,4 = (Z xlx 4 )(X, code)(X 4 code) = (223.4)(0.01)(0.1) = 0.223,4, )(X4 code)(X4 code) = (211)(0.1)2 = 2.11. xx;)(i code)(j code). a44 = (E Other values of aij are computed similarly and entered in Table 7. Finding other Potent Variables after the Most Potent The quantities, as. or coded E x;, necessary for evaluating the five regressions of height on two independent factors when one of the independent factors is age are now available. As an example, consider X 4, age, and X,, silt plus clay content of topsoil; transfer the ai. involving Xi, X4, and Y to a table of the same form as Table 2; and perform the "forward" part of the solution. This manipulation is shown in Table 9 (page 68). 34 34 ALABAMA AGRICULTURAL EXPERIMENT STATION The values in the first three rows of the first three coluns of Table 9 are a4 values entered from Table 7. The "primes" are reminders that the row and column subscripts of X's in Table 9 may not be the original subscripts of Tables 6 and 7. The first three values in the check column are the sums of the quantities in the row involved, and are so obtained. Thus, the value at row X2 of check column is given by 1.154,230,00 = 0.694,590,00 + 0.236,240,00 + 0.223,400,00. All values in A1 are a1; values copied from row All values in B 1 are corresponding A, values divided by the leading A=A,,; thus, 1 j XI. B13 and B14 = A1, Al 5.033,000,00 -2.110,000,00 25 A 14 - 7.366,400,00 A21 = 3.49118483. -42.110,000,00 The computational check in A 1 is 7.366,400,00 = 5.033,000,00 + 0.223,400,00 + 2.110,000,00 and in B1 is 3.491,184,83 = 2.385,308,06 + 0.105,876,78 + 1.0. These results should agree within rounding errors. Following the guide provided in [3], [3a], and Table 2, A 22 = 0.236,240,00 - (0.223,400,00)(0.105,876,78) = 0.212,587,13. This value divided by the leading A, itself, is equal to 1.0. Now, A 23 = 0.694,590,00 - (0.223,400100)(2-385y308106) 0.161,712,18 is entered in the table and is divided by the leading A or A2 0.212, 587,13, before removing from the machine; that is, B 3 =0.161,712,18 1, -0.760)686,59. which is subtracted from 0.694,590,00 could also have been obtained from the product (0.105,876,78) (5.033,000,00) , but the rounding error is smaller on the average when the pair more nearly equal in absolute value is chosen. Further, A 24 = 1.154,230,00 - that the product, 58(0.223,400,00) 2 13Remember (2.385,308,06), -0.1 (0.223)400,00) (3.491,184,83)= 0.374,299,31, DETERMINING POTENT INDEPENDENT VARIABLES 35 and B 24 - A 24 , before clearingmachine 0.212,587,13 = 1.760,686,59. Checks: 0.374,299,31 = 0.161,712,18 and 1.760,686,59 = 0.760,686,59 + 1.0. These results should agree within rounding errors. The sum of squares due to regression is given by [5]: + 0.212,587,13 Z 2= Z A,,B . This calculation is indicated under Table 9 and the result is decoded there by dividing by the square of the Y code. Note that decoded A, B 0, = 12.005,255,47(1/0.1)2 = 1200.525,547, the sum of squares 1 due to X 1 as calculated by [19]. Similar solutions must be made for the four other two-factor regressions involving X 4 or age. The results of all five solutions are given in Table 10 (page 69) where it may be observed that the greatest E 2 , reduction due to regression, is that due to X' and X', which are equal to X 4 and X 2, respectively, or age and imbibitional water value of the most impervious soil horizon. Thus, it is concluded that the second most potent predictor-subject to the condition that one of the predictors is X 4 or age-is X 2 or imbibitional water value of the most impervious soil horizon. The significance of this result may be tested by adding three more lines to Table 8. The sum of squares due to imbibitional water value independent of the most potent variable, age, is the difference in sums of squares for age plus imbibitional water value and age alone; that is, 1241.49 1,200.53 = 40.96 with 1 degree of freedom. The probability of F = 3.12 with 1 and 37 degrees of freedom due to chance alone is less than 0.1. At the 10 per cent significance level, it may be concluded that the height that young longleaf pine trees attain is related to the imbibitional water value of the most impervious soil horizon. At this stage one could compute on the worksheet for X1 = X4 and X2 = X 2 the partial regression coefficients, b2 = bY2,.1 and b1 = by .2, 36 ALABAMA AGRICULTURAL EXPERIMENT STATION for imbibitional water and age, respectively, to verify that their signs and size are reasonable. This computation is not necessary and is not done in this example. Since X 2 is the second most potent variable, it is necessary to complete Table 7 with respect to column X 2 and row X 2 and then to solve the four regressions of Y on three independent variables when two of them are X 4, or age, and X 2, or imbibitional water. The usual checks [1], [la], [2], and [2a] are made on the accuracy of entries in Table 7. The new entries are coded to a2; by the existing coding factors. The mask to reduce the duplication and copy work of solving several multiple regressions of three independent variables when two of the independent variables are always the same may be copied from the solution of the multiple regression on the two independent variables chosen to be constant. The two constant variables are the two most potent ones or, in this study, X1 = X 4, age, and X = X 2, imbibitional water value of most impervious horizon. The solution when Xi = X4 and X2 = X 2 has not been shown, but is of the same form as Table 9. The columns for Xf = X 4 and XI = X 2 are copied from their solution table into columns X/' = X 4 and X' = X 2 of Table 11 (page 69), where the double "primes" indicate possible further changes in subscripts. The next column is left open for X3' = ?; and X3 = Y of Table 9 is copied into X4' = Y of Table 11. The check column, X5' or the check sum which includes the values entered in X3', is different for each variable and is also left open. Table 11 represents the mask to be used for these variables. The dotted lines show how the mask would be cut out to allow several different X, to be evaluated as Xi'. To solve the regression of height on Xi' = X 4,X2' = X 2,and XI' = X 1, place the mask on a fresh sheet of paper, copy in the needed values of az;, and solve. The final results will appear as in Table 12 (page 70). The dotted line divides the portion on the mask from that copied directly to the new sheet from Table 7 and from that which must be worked out for a particular solution. The first four values in the check column of Table 12 are obtained by adding quantities in the row in which the value is to be entered; that is, 5.438,860,00 = -0.149,600,00 + 0.535,760,00 + 5.745,700,00 0.693,000,00, etc. Line A1; is line X' copied. Line B1? is line A1; divided by the leading A of that line, A 11 ; thus, 0.223,400,00 0.105,876,78 = 2.110,000,00 A 21 of check column is given by [3] and [3a] , 5.438,860,00 - (-0.693,000,00)(3.162,748,82) = 7.630,644,93 DETERMINING POTENT INDEPENDENT VARIABLES 37 which is divided by 5.518,093,84 before clearing machine to yield B 2 h whose value is 1.382,840,73. Remember that (-0.693,000,00) (3.162,748,82) was chosen rather than (-0.328,436,02)(6.673,400,00) because the absolute magnitudes of factors in the first product are nearer each other in size. Further, A 3 3 = 0.236,240,00 - (0.223,400,00)(0.105,876,78) (0.609,132,61)(0.110,388,23) = 0.145,346,06 Computational checks are available in that each value in the check column should be the same (within rounding) as the sum of the other quantities in the same line, [4] and [4a]. In the lines for A . and B ;, missing entries are, zero so that there is no need to turn up the column at the diagonal; that is, 6.673,400,00 = 5.033,000,00 + 0.223,400,00 - 0.692,000,00 + 2.110,000,00 and 1.382,840,73 = 0.272,452,50 + 0.110,388,23 + 1.0. When working to eight decimal places, as in this example, rounding errors preventing checking in the seventh place may occur if a diagonal value of A . becomes less in absolute value than 0.010,000,00. The results of the four regressions with three independent variables are summarized in Table 13 (page 71). Even the trio of independent variables with the largest regression sum of squares does not add a significant amount to the reduction in sum of squares of Y attributable to regression. (See Table 8.) It is probably safe to conclude at this stage that only age and imbibitional water value of the most impervious soil horizon are potent variables. Attention is called to the fact that in Table 8 the sum of squares, "reduction due to age" which equals 1,200.53, is the same as AlBg(±) = (5.033,000,00)(2.385,308,06)01 = 1,200.526 of Table 12; and the sum of squares, "reduction due to imbibitional water value independent of age" which equals 40.96 of Table 8, is the same as A 2 gB2 ,j = (1.503,418,48)(0.272,452,50)01 = 40.961 of Table 12, both of which are found on the mask. The sum of squares, 38 ALABAMA AGRICULTURAL EXPERIMENT STATION "reduction due to (age) 2 independent of the others," is equal to 6.53 and the "residual" sum of squares after all the variables are accounted for is 479.08. These are calculated by A 3oB3,(1/m 2) 2 and by A 40(1/my) 2, respectively, of a table like Table 12 but including (age) 2 as the third independent variable rather than "silt plus clay content of topsoil." The results of this analysis fall into a pattern that seems to be quite common; that is, there are only one, two, or at most a few variables that are sufficiently strongly related to the dependent variable to be useful as predictors. Although not necessary, it is usual that as new variables are added to the list in order of potency, as assessed by the method described above, later-added variables contribute smaller sums of squares to regression and are of less significance. This is because of occasional interrelationships among the variables such that two or three variables jointly contribute sizably to the sums of squares of regression with very small increments from any one or two of the variables used without completing the set. For this reason it may be desirable to ascertain that two successive most potent variables are both nonsignificant before abandoning the search. If one should wish to know whether the fourth most potent variable in this study is significant, it will be necessary to identify and test this variable by extending the process already described. Fill in the missing values in row and column X 4, solve the three regressions of Y on four independent variables when three of them are X 4, X 2, and Xs, y2 by means of three more lines in and test the one with the greatest Table 8. For those who are curious, silt plus clay content of B horizon is the next most potent variable. However, the total sum of squares due to regression of these four variables is 1251.29, which is only 3.27 more than the reduction due to three variables, and actually less than the mean square of residual. It is concluded that, of this list of six independent variables, only two have genuine worth as predictors of height of dominant trees in young longleaf pine plantations. These variables are age and imbibitional water value of the most impervious soil horizon. The solution for these particular two independent variables has not been shown as such. However, it has been reproduced on the mask portions of Tables 11 and 12. From these values, using Table 2 as a guide, it can be determined that b = b2 = 0.272,452,50 - and b' = b4= 2.385,308,06 (0.272,452,50)(-0.328,436,02) = 2.474,791,27, DETERMINING POTENT INDEPENDENT VARIABLES 39 hence, using [10], b2* = (0.272,452,50) and ( - 0.272,5 b = (2.474,791,27) (01 = 2.475. From the foregoing results, the means of Tables 6 or 7 and equations [7] and [8], the regression equation, Y =Y + E bix = Y - E bX + blX, + b2 X 2 is found to be V = 28.48 Hence, (2.475)(10.92) - (0.272,5)(5.88) ± 2.475X 4 + 0.272,5X 2 . = -0.15 + (2.475)(age) + (0.272,5)(imbibitional water value). Note as checks on computation that every B, in Table 12 agreed with its Aii in sign and that all values on the diagonal of the solution were positive. These conditions must always be true. NUMERICAL EXAMPLE OF A DISCRIMINANT The data for illustrating the discriminant function, presented in Table 14 (page 71), are taken from Goulden (11), page 352, who abstracted them from the study by Cox and Martin (3) to determine a discriminant function for differentiating soils with and without Azotobacter. The sums of squares and products of deviations for each group are set forth in the upper part of Table 15 (page 72) and then added together in the lower part of the table. This is just one of several ways these sums can be obtained since any procedure yielding the sums of squares and products "Within Groups" would give the same results. The differences, di, between the average values of the two groups for the several variables, Xi, are calculated by [21] and listed in a column, X 4 = d, of the lower part of Table 15. This column takes the place of the column X = Y of Table 1, in which the E xiy values of regression are recorded. The values in the lower part of Table 15 are coded by powers of 10 to the quantities a 1 , which are 40 40 ALABAMA AGRICULTURAL EXPERIMENT STATION then removed to Table 16 where the abbreviated Doolittle solution is performed, including the back solution for the cj values and calculation of the discriminant coefficients, A1. The upper part of Table 15 should not require much explanation. The sums of squares and products of the possible discriminating variables are obtained by the usual processes. The column of the check sum is used in the same way as illustrated for regression in Table 1. Thus for example: in Group II, using [1], [la],[2], and [2a] to check X 2: 64,677.8 = 20.928 + 37,979 + 5,770.8 and also 61,189.7333 + 3,488.0667 _ -186.6667 + 3,632.0000 + 42.7333 - 5,728.0667 + 34,347.0000 + 21,114.6667 64,677.8000 In the lower part of Table 15, by [21] and, for example, d2 = X2, 1 - 221 = 87.8400 - 35.6667=52.1733. The powers of 10 for coding factors are chosen as the square roots of the even powers of 10 that reduce the diagonal values of E xix; to code and X; values between 0.1 and 10.0 and are written in as code. For example, Xi (0.0001) (E x2 x2 ) - (0.0001) (88,897.3600) 2 =8.889,736,00, a value between 0.1 and 10.0; so, Xi code for X and since the =-VO.0001 = 0.01. Xi code is equal to the X ; code when i X, code forX 2 = 0.01. The values in the lower part of Table 15 are obtained by [20] from the E xix; values by multiplying each E xix 1 by the appropriate ai code factors. For example, a1 = Ex~x3 (X 1 code) (X3 3 code) - (148.2403)(0-1)(0-01) - 0.148,240,30. DETERMINING POTENT INDEPENDENT VARIABLES 41 After transferring the a.. to Table 16 (page 73), the "forward" portion of the Doolittle solution is performed as already illustrated several times. (See Table 2 for directions.) There is no value in the column, X4 = d, corresponding to E y2 or go so that there is no A,, value to be computed. Using the directions of Table 2, the X,: may now be calculated in the column, X 4 = d, exactly as the b are calculated in regression; that is, X3 = B 34 = B 3g = 0.054,935,95, and X = B2g - X3B 2 23 = 0.028,634,99 = 0.024,432,52, (0.054,935,95)(0.076,497,57) etc. Checks are carried in column X5, which is the Check E. The c are calculated exactly as indicated in the appropriate cells of Table 2, as examples, C1 2 = -B 1 3 c2 3 - B12C22 = - (0.707,354,58)(-0.167,357,12) -(2.013,252,37)(0.137,175,72) = -0.157,788,52 and 1 c11 = - B1 3 c1 3 - B12c12 1 0.20957000 - (0.707,354,58)(-1.210,578,80) - (2.013,252,37)(-0.157,788,52) = 5.945,651,91. Remember in determining the c,; to start at the bottom with C3 33 1 A 33 1 0.457,091,82 = 2.187,744,25. A check may be established on the calculation of the c values: the sum of all the ai; values in some particular row, i, each multiplied by the ci at the same position of the c,; table is unity; that is, Eaic, = 1.0. 42 42 ALABAMA AGRICULTURAL EXPERIMENT STATION For example, Z a c 2j a23c23 + a22C 22 I- a12c12 (0.913,509,33)(-0.167,357,12) + (8.889,736,00)(O.137,175,72) ± (0.421,917,30)(-0-157J88,52) = 0.999,999,94, which is sufficiently close. These checks are outlined in Table 2. The Xi may also be computed from the c1; in exactly the same manuer as the bi were, letting the column, X = d, serve in the place of X = Y. As an example, b2 = c 23g 3 +C + 2 2g2 + + C12 g and likewise X 2 = C2393 C2 2g 2 C1 91 2 so X 2 = (-0.167,357,12)(0.145,141,00)-+ (0.137,175,72)(0.521,733,00) = +F(-0.157788,52)(0.144,790,00) 0.024,432,52. The XA decoded in the same way as the bi; thus, similarly to [10] are decoded X As an example, decoded = = A[29] mnd X, = *= With the tion X, 3 3 m3 = 13m% 0.054,935,95 0. ' 1.0' - 0.000,549,36. ' computed, it is possible to write the discriminant func- [25] : Z = x x1 + x x2 + x x 3, 1 2 3 and using the decoded A;, Z= 0.060,284X 1 ± 0.0002244X2 + 0.000,549X 2.248X 3, 3 . Butt the units of Nare arbitrarily chosen, so divide each Nby the smallest N,, which gives N 2 Z = 246.7X 1 + X 2 + (This is identically the answer Goulden would have obtained, except for what seems to have been an error in rounding off the divisor.) It might better bring out the interrelationships and relative contributions of the variables to make the coefficient, N, of the most potent variable DETERMINING POTENT INDEPENDENT VARIABLES 43 4 unity. To do this divide each coefficient by the X of the most potent variable. This gives Z = X 1 + 0.004,053X 2 + 0.009,113X 3 . The Xi are really coefficients such that the difference, D, between the average Z-values of the two groups as given by [22], [23], or [24], D = Z - Zii = > Aigig= 3 f1 Xigi is maximized relative to the variability within the groups. D 4+ A3 3g (0.144,790,00) (0.690,890,87) A1gB15 -+-A 2 gB 2 gB + (0.230,234,19)(0.028,634,99) + (0.0252110277)(0.054;935;95) = 0.100,034,09 + 0.006,592,75 + 0.001,379,48 0.108,006,32, or, using the coded (not yet decoded) results: D = X91 + 1 X2g8 ± X38 3 (0.602,842,84) (0.144,790,00) + (0.024432,52)(0-5212733,00) ± (0.054,935,95)(0.145,141,00) = 0.108,006,33. From the calculation of D = A 10B 1 , + A2 5B2 0 + A 0B30, it can be seen that D is made up of three parts: 0.100,034,09 + 0.006,592,75 + 0.001,379,48. These are due to X 1 , to X 2 independent of X 1 , and to X 3 independent of X 1 and X 2, respectively. IUsing the same method as in multiple regression [15], it is possible to estimate the contribution of each variable to D independent of the other two despite the fact that X3 is the only one occurring in the last position; that is, Di.all others 2 (x)ci [30] For the three variables these values 1'3_ are: _.61135 =(A 1) 2 c11 _ (0.602,842,84)2 _(-044 5.945,651,91 2,2 2 =0.061,23 570 . 2.3 c22 0.137,175,72 2 49 5,5 D .2_(00 =0 .004,351,70 c33 2.187,744,25 -0.0,39,8 44 ALABAMA AGRICULTURAL EXPERIMENT STATION Note particularly that D3. 12 = 0.001,379,48 is the same as was estimated for X 3 independent of X, and X 2 by the term A 3 B 3o. It is apparent that X 3 contributes least to the discriminant so that if a two-variable discriminant should be desired, the appropriate two variables (if practical matters of measurement do not intervene) would be X, and X 2, or pH and available phosphate, respectively. The potency of individual, single-variable discriminants may be quickly estimated from quantities available in Table 16. The procedure is analogous to [19] and [19a] for estimating the several simple regressions in regression analysis, but instead of computing A2 2 ( i 2 xy)2 calculate (d)2 For the three variables these values are: - [31] Zx1 Sx2 (d 3)2 x- (d) 2 _ (0.144,790,00)2 0.209,570,00 8.889,736,00 (0.145,141,00)2 0.609,001,19 - 0.100,034,09 (d2)2 (0.521,733,00)2 =0.030,620,18 - 0.034,590,92 The largest D is due to X 1 ; hence the most potent single variable to use as a discriminator would be X1 , or pH. When there are only three independent variables (in either regression or discriminant), a combination of operations consisting of finding the most potent single variable and then finding the weakest of the three variables in the three-variable relationship results in complete knowledge of order of potency, since the most potent pair comes from dropping the weakest. In this study the most potent variable is X 1, the next most potent is X 2, and the least potent is X 3. This is the same order in which the variables were tabulated but order had nothing to do with this result, a fact one may verify by arranging the variables in some other order and solving again. Having ordered the potency of these three variables as discriminators, it is possible to make F tests of the significance of the amounts by which D is increased by the addition of each successive variable. For this DETERMINING POTENT INDEPENDENT VARIABLES 45 purpose, use the arrangement of Table 5 and equations [26], [27], and [28]. The results are tabulated in Table 17 (page 74). The value of D1 is 0.100,034,09 and, being the largest such D, indicates that X 1 is the most potent single discriminator. The sum of squares due to this variable as a discriminator may be calculated by [26} SS due to 1 variable nn D2 (25)(27) ni + n1 =25 + 27 10003409)2 = 0.129,896,23. This is entered in Table 17 at line (2). The residual or error sum of squares = D is entered at line (3). The total sum of squares may be calculated by (2) + (3) in Table 17 or by [27J: Total SS = nI + nil nInIn D2 + D = 0.129,896,23 + 0.100,034,09 = 0.229,930,32. This is entered in Table 17 at line (1). To find the sum of squares for the two most potent discriminators, it is first necessary to find D for the two most potent variables. This value can be found by solving the discriminant for these two variables, but it may also be had in this special case in which there are only three independent variables as D for three variables minus D for the weakest variable independent of the other two variables, in this example D3.12; thus Domitting X 3 = D2variables = D3variables -D3.12 = 0.108,006,33 - 0.001,379,48 = 0.106,626,85. From this value, the sum of squares due to two variables may be calculated from [26] as follows: SS due to 2 variables = nIni D 2 = (25)(27)(0.106,626,85)2 n1 + n1I 25 + 27 = 0.147,581,75. The total sum of squares may be calculated by [27] Total SS = nI nn + I l D2 + D = 0.147,581,75 = 0.254,208,60. + 0.106,626,85 46 ALABAMA AGRICULTURAL EXPERIMENT STATION To be comparable with sums of squares already calculated for the single most potent variable X 1, this sum of squares due to the two variables, X 1 and X 2, must be adjusted for the fact that its total sum of squares is not the same total sum of squares as for X 1 only. From [28] this adjustment is Total SS, 1 variable Adj. SS, 2 variables = (SS due to 2 variables) Total SS, 2 Variable = 0.147,581,750.229,930,32 0.133,486,90. 0.254,208,60 = This value is entered in Table 17 at line (4). The sum of squares due to 3 variables is SS due to 3 variables = _nn D 2 = (25)(27)0.108,006,33 nI + nii 25 + 27 ( = 0.151,425,48. The total sum of squares is Total SS = 0.151,425,48 + 0.108,006,33 = 0.259,431,81. Adjusted SS due to three variables is SS, 1 variable Adj. SS, 3 variables = (SS due to 3 variables) Total Total SS, 3 variables 0.229,930,32 = (0.151,425,48) 0.25943181= 0.134,206,01. This value is entered in Table 17 at line (7). After entering the foregoing values in Table 17, the remaining quantities may be calculated as indicated therein. From Table 17 it is evident that the discriminant based on the three variables, X 1 = pH, X 2 = phosphate, and X 3 = nitrogen content is no better than the one based on two, pH and phosphate. The logical interpretation is to ignore nitrogen and compute the discriminant function based on the two most potent variables only. The Xi could be recomputed leaving out X3 , but they can be much more quickly computed from data in Table 16 using [17], the formula for computing some bi after deleting some variable, Xk. Xi after deleting X, - c_ L ckk DETERMINING POTENT INDEPENDENT VARIABLES 47 4 In this case X, - X 3 so that -- C1 C33 XA after deleting X 3 = 0.602,842,84 -0.633,241,41 - -1.210,578,80 0.054,935,95 2.187,744,25 and x 2 after deleting X3 = A2-C - C23 C33 0.024,432,52-0.167,357,12 2.187,744,250 0.028,634,99. 9 - Since these are the values of X that would be obtained considering just X 1 and X 2 as predictor variables, they may be used to obtain the discriminant function based on these two variables alone. Therefore, by [251 and [29] the decoded discriminant function is: Z = - 1X~ + xA 1 2 = xl1 ndan X 1 ±X 2 nX 0.01 .1 0.633,241,41 = 0.1 ? + 0.028,634,99 1 1.01.0 0.063,324,14X 1 + 0.000,286,35X 2 . However, since these units are arbitrary, divide through by the smallest A, or 0.000,286,35, to obtain Z= 221.1X 1 + X? 2 or divide through by the A of the most potent variable to obtain Z = X 1 ± 0.004,522X 2 which is the discriminant function for the two most potent variables, pH and soil phosphate content. Division by A of the most potent variable gives other A values as proportions of the most potent. Working the discriminant with three variables, as just concluded, has served to give numerical examples of many of the computational procedures used in regression and discriminant analysis. It has served also to show that these procedures are for the most part identical. The demonstrated procedures were used to find the order of potency of three independent variables. 48 ALABAMA AGRICULTURAL EXPERIMENT STATION The ordering of three independent variables according to potency, however, is a special case that does not call for use of procedures proposed earlier in this bulletin for finding potent variables. However, the proposed procedures may be used. To demonstrate their operation in discriminant analysis, this same problem will be re-examined using the proposed procedure. As before, the first step would be the preparation of Table 15, but 1 xa (those quantities on the Xi, with this difference, only diagonal), and the di would be computed. From these, one would determine the potency of all individual, single-variable discriminants by [31] (di) 2 2 Di 7 1, These have already been computed and will not be duplicated here. The result is noted that X 1 is the most potent variable with D1 = 0.100,034,09. The significance of this discriminant may be tested by means of the first three lines of Table 17 using [26] and [27]. Since D is the same as in the previous analysis the results in Table 17 will be the same. To determine the next most potent variable, it would be necessary to complete in Table 15 all rows and columns associated with the most column potent variable, X1. This includes the value in the Check to be used with [1], [la], [2], and [2a] in checking work. The results would be sufficient to evaluate every two variable discriminant in which one of the variables is the most potent variable X 1. The entries of Table 15 would now be coded by powers of 10 and removed to individual matrices for the abbreviated Doolittle solution. The matrix for the discriminant using X 1 and X 2 is given in Table 18 (page 75). The result, D = 0.106,626,84 compares with D = 0.103,654,62 for the discriminant with X, and X 3. Thus, the two-variable discriminant with the most potency is that one with X 1 and X 2; hence X 2 is designated as the second most potent variable. Using [26], [27], and [28] three more lines may be added to Table 17, testing the significance of X 2 as the second most potent variable. Since the value, 0.106,626,84, obtained for the discriminant with X 1 and X 2 is the same as that obtained by deleting the weakest variable in three, the results of testing significance are the same as before. To evaluate further independent variables, it is necessary to complete all rows and columns involving Xa in Table 15, thus making possible the evaluation of all discriminants with three independent variables in which two of the variables are X1 and X 2. In this particular example, E DETERMINING POTENT INDEPENDENT VARIABLES 49 there is only one such discriminant (which has already been evaluated), but the procedure as laid out is general; therefore, it applies no matter how many variables are in the study. Also, if there were more variables, the procedure could be extended until either sufficient discriminatory power is obtained or until additional variables added to the discriminant do not significantly increase the discriminatory power. The mask for reducing computational work in solving matrices of three or more independent variables when there are large numbers of variables is just as useful in discriminant analysis as in regression analysis. In the present numerical example, the regression with three independent variables would turn out as previously worked so that Table 17 would be completed without change. This indicates that the proposed procedure for determining potent variables and the special case procedure used for three independent variables give the same results. OTHER TESTS OF SIGNIFICANCE Having identified the most potent predicting (or discriminating) variables in a regression (or discriminant), one would probably wish to complete the regression (or discriminant). This would probably involve calculating the regression coefficients, bi (or discriminant coefficients, Xi), the inverse matrix of c. values, and the regression formula, Y (or discriminant function, Z). A summary of calculation procedures for finding such quantities and testing significance of various propositions follows. The sum of squares of deviations in Y, called E y2, is calculated in the usual manner for a sum of squares in cell X5 = Y of column X 5 = Y in Table 1. The coded value of E y2 is ass ge. 55 or The reduction in sum of squares of deviations in Y attributable to regression may be calculated as either S or = >Aigig [5] E y2 = E big [9] and, similarly, the value of the discriminant may be calculated by either AigBQ D = ~A [23] 50 50 ALABAMA AGRICULTURAL EXPERIMENT STATION or D~[24] where i specifies a particular independent variable, Xi, and the other symbols are used in the same way as in Tables 1 and 2. The regression sum of squares may be decoded as Z 2* Z 2() [12] where 7n, is the power of 10 used for Y code iu Table 1, aud the superscript asterisk deuotes a decoded value. Since the units of D are perfectly arbitrary, there is no need to decode D. The total sum of squares due to regression on any number of variables may be divided into (1) that due to X 1 alone or Y1~ = 1g 19) (2) that due to X 2 independent of X 1 or Z2~. = (3) that due to X 3 independent of X 1 , and X 2 or Z 13.12 = A3 gB 3 g) (4) etc. Each of these values may be decoded by multiplying by (1/mn) 2 . The sum of squares due to adding or deleting any variable to or from a set may be calculated as yi ~.al o thers =[15] decoded by multiplying may be by (1/tm). CiThis The proportion of E y2 due to regression is Z y2 R _decoded EZill - E [6] none of these need to be decoded. The multiple correlation coefficient, _ y does not need to be decoded. 2 \/decodedZ D2 DETERMI NING POTENT INDEPENDENT VARIABLES 51 5 In general, the sum of squares of deviations from regression (residual sum of squares, error sum of squares, or sum of squares of deviations in Y independent of X 1, X 2, X 3, " "" , X 7 ) is Zd 2ii23 ... r - y - y= 9. This may be decoded as Y. . r 123 -. Y.123. . where d is a deviation, Y - Y, and r is the number of independent variables. The mean square deviation from regression, or sample variance, or , X, is error in Y independent of X 1, X 2, X 3, 8 2 Y.123.r.r _ degrees of freedom Zd 2 3 r _Z n dYi 2 3 .r -(r+1) A9 n (r+1) n-(r+1) This may be decoded as SY.123... r n-(r+1) SY.123... The regression coefficients, b , more properly called the partial reb Y her x ' s, and also the discriminant coefficients, gression coefficients, "ot Xi, of discriminant analysis are calculated in two different ways in Table 2 (column X 5 = Y, and bottom section). The values of bi aud may Xi be decoded as =ib - [10] and A: =A i mi[29] where the rn-values a re. the powers of 10 used for coding. Values of Xi may be further coded by dividing each Xi by the smallest Xi, or more meaningfully, perhaps, by the X of the most potent variable. 52 52 ALABAMA AGRICULTURAL EXPERIMENT STATION The standard partial regression coefficient is where j = i. This does not need to be decoded. The regression function or estimated value of Y for any particular set of X.,,values is =Y+- Z bixi, [7] where is the deviation of some specified X, say i, from its mean, X This equation may be rewritten as Y = xi YY - biXi + b1 X1 + b2 X2 + b3 X 3 + + brXi, [81 is the mean of where Y is the estimated value, Y is the mean of Y, , Xr are the particular set of X's of the ith X, and Xi, X 2 , X 3 , " interest. This result may be decoded as my The discriminant function is Xi Z = X1X 1 + X2X 2 + X3X3 ± o.®o=:[25] where either the Xi must be decoded or the Xi must be coded. The Ganss mnltipliers, c-values, or elements of the inverse matrix of sums of squares and products, called = ci;, are calculated in the lower portion of Table 2. These may be decoded as ~mm o[11] The estimated variance of Y is given by 52 Y.123. .. r 2 SY.123... This may be decoded as SY.13.. .r=SY123 (1)2 The estimated variance of any b, is given by 2 2 DETERMI NING POTENT INDEPENDENT VARIABLES 53 5 where i = j. This may be decoded as eSb* b The estimated variance of a quantity, such as Y-Y-tbixi~ if the xi is a population characteristic, is the sum of the variances of the two terms Y and bixi. These values are 2 2 SI.123...r + 2 bix SY.123. n .r + ± 2 2 n ... r I8Y.13...-iix 5Y. 123 +C x This may be decoded as sY SY.123.n(I + Ciixi) ) - The inverse of a matrix with just one independent variable is c11 1/E x2 so that the estimated variance of Y + bx is given by 2 + x2 which is the formula often given for the variance of an estimated Y in simple regrcssion. The estimated variance of some predicted value, such as Y = Y + b where is not a population characteristic but is the result of a single observation, will have all the variability of the estimating procedure as just outlined for the case of a population plus the variability of an individual Y value, ........ rOThus the variance of the prediction for a single individual is given by xi xi SY. 123... .r 1i + +iix which may be decoded as = 2Y123...r(1 +- + ciiximi)(W) . The estimated variance of a quantity, such as Y = Y + bix if the xi are population characteristics, is the sum of the variances of the two terms, Y and ixi The term bixi is itself a linear combination of b's with their coefficients. The variance of any linear com- Ei J~, j 54 54 ALABAMA AGRICULTURAL EXPERIMENT STATION bination of b's with their coefficients as b1 x1 + b2x2 + b3 x3 + is given as SY. 123{x1c 1 + 2x1x2 c12 ± 2x1x3c 13 + + xzc2 2 + + brXr + + ."°+ 2x 1 xc1, 2x 2x3 C ±+ 23 2P 2x2xrC 9 2X 3XrC3 r XC 33 ± + 2~r} One must be careful to observe all signs in the above. Decoding would be a matter of decoding each term. From the foregoing discussion, the variance of a multiple regression estimate, SY + b ix i is 2 .r) ± the variance of the linear combination of b's. The variance of a prediction based on a single individual is (sy. 123r) ( + ) + the variance of the linear combination of b's. All of the foregoing variances may be used in the customary mane of using variances to establish bonfldence limits within which it is believed that the true or population value lies. Thus, population parameter = sample estimate ± (t«) ( Vvariance of the estimate), where to is Student's t at the a probability level of chance occurrence with the degrees of freedom of the error or residual mean square from which the variance was estimated. If it is desired to test a hypothesis rather than establish a confidencc interval, then calculate the test statistic estimate - hypothetical value. = variance of the estimate Compare this calculated t-value with the distribution of Student'st in standard t tables, using the degrees of freedom of the mean square from which the variance estimate was obtained. The interpretation of such -values is made in the same manner as for a t calculated for any other statistic. DETERMINING POTENT INDEPENDENT VARIABLES 55 Some texts give directions for calculating matrices of partial regression coefficients and partial correlation coefficients. Most of these texts start by coding the sums of squares and products by dividing x2, thus yielding a matrix of simple each E xix, by VE x correlation coefficients, then solving for the matrix of ci; values inverse to this coefficient matrix. The matrix of ci in this bulletin may be converted to exactly this matrix, since any c, of the matrix of correlation coefficients is given by /E c* = cimimIVEx 2 v x . [13] The mean square for error from such a matrix of correlation coefficients can be computed from the results of this analysis as = s,23...r / 56 ALABAMA AGRICULTURAL EXPERIMENT STATION LITERATURE CITED (1) ANDERSON, R. L. AND BANCROFT, T. A. Statistical Theory in Research. McGraw-Hill Book Co., Inc., N. Y. 1952. (2) BENNETT, C. A. AND FRANKLIN, N. L. Statistical Analysis in Chemistry and the Chemical Industry. John Wiley and Sons, Inc., N. Y. 1954. (3) Cox, G. M. AND MARTIN, W. P. Use of a Discriminant Function for Differentiating Soil with Different Azotabacter Population. Iowa State College Jour. Sci. VII: 323-31. 1937. (4) CROXTON, F. E. AND COWDEN, D. J. Applied General Statistics. Prentice-Hall, Inc., Englewood Cliffs, N. J. 1955. (5) DWYER, P. S. The Solution of Simultaneous Equations. Psychometrika 6: 101-129. 1941. (6) . Linear Computations. John Wiley and Sons, Inc., N. Y. 1951. John Wiley and Sons, Inc., N. Y.1941. (8) FISHER, R. A. Statistical Methods for Research Workers, 11th Ed. Oliver and Boyd, Edinburgh, Scotland. 1950. (9) JOAN AND FOOTE, R. J. Computational Methods for Handling Systems of Simultaneous Equations with Applications to Agriculture. U. S. Dept. Agr. Agricultural Handbook 94. 1955. FRIEDMAN, GOGGANS, JAMES F. AND SCHULTZ, E. FRED, JR. Growth of Pine Plantations in Alabama's Coastal Plain. Ala. Agr. Expt. Sta. Bul. 313. 1958. (7) EZEKIEL, MORDECAI. Methods of Correlation Analysis, Revised. (10) (11) Methods of Statistical Analysis, 2nd Ed. John Wiley and Sons, Inc. N. Y. 1952. GOULDEN, C. H. HENDRICKS, (12) W. A. The Theory of Sampling with Special Reference to the Collection and Interpretation of Agricultural Statistics. N. C. Inst. Statis. Mimeo. Series No. 1. 1942. (13) KRAMER, CLYDE Y. Simplified Computations for Multiple Regres- sion. Ind. Quality Control Vol. 13. Feb. 1957. (13a) LEONE, FRED C. Statistical Programs for High Speed Computers. Technometrics Vol. 3:123. 1961. DETERMINING POTENT INDEPENDENT VARIABLES 57 (14) MATHER, K. Statistical Analysis in Biology. Interscience Publishers, Inc., N. Y. 1946. (15) MILLS, F. S. Statistical Methods. Henry Holt and Co., N. Y. 1955. (16) PEACH, PAUL. Curve Fitting and Analysis of Variance. N. C. Inst. Statis. Mimeo. Series No. 3. 1947ca. (17) QUENOUILLE, M. H. Associated Measurements. Academic Press, Inc., N. Y. 1952. (18) RAo, C. R. Advanced Statistical Methods in Biometric Research. John Wiley and Sons, Inc., N. Y. 1952. (19) SNEDECOR, G. W. Statistical Methods, 5th Ed. The Iowa State University Press, Ames, Ia. 1956. (20) TIPPETT, L. H. C. Technological Application of Statistics. John Wiley and Sons, Inc., N. Y. 1950. (21) . The Methods of Statistics, 4th Ed. John Wiley and Sons, Inc., N. Y. 1952. (22) VILLARS, D. S. Statistical Design and Analysis of Experiments for Development Research. Win. C. Brown Co., Dubuque, Ia. 1951. (23) WERT, J. E., NEIDT, C. O., AND AHMANN, I. S. Statistical Methods in Education and Psychological Research. Appleton-CenturyCrofts, Inc., N. Y. 1954. TABLE 1. CALCULATING, CHECKING AND CODING THE SUMS OF SQUARES AND PRODUCTS OF DEVIATIONS FOE MULTIPLE REGRESSION WITH FOUR INDEPENDENT VARIABLES AND ONE DEPENDENT VARIABLE Tr Xi Item C'ode X1 x2 x3 K4 xY ZX5 =LY X 5= X6= Chek a fix; x; X; Code XI. :x, LX2 XI X2 ZX3 X m LX4 4 4 ZX 6 0 MI i m2 z ix x Z xlx j-= all = Zx1 X 1X jxix 1)(mim1 ) C,1j im all zx z1X x2 1 C C1 2 13 XX zlx3 C, x=x C C15 14 Y y X1X2 Zxlx5 = 3 =E g zxx6 1 C zl 16 C -r C a -o (z a12 a14 a1 = 5 Sx2x 1 C21 z (L X 2)(Z Z X 2x1 = j= X 1 )/n 2Xj- a21 = (1Z x2x 1)(m~m 1) Cjin2 2 Zx C22 ZX2X3 a23 C23 z~x a2 4 C 24 Lx2x5 =ZX Lx2x 6 z zX2X C25 = = C5 2 L 5 a 25 = g 2 z~x C2 6 0 z.~ X4 -(CI =~ X3)(7,X9)2 =5z X~x -c 3 C33 3 C34 a34 zx 5 =E x 3 C35 C3y = x3y 3y x3 6 03x6 x = Ex a=(E x3xf)(m~mj) a33 a35 =g z Z 0 -o E X 4Xj 04; = Ex a = 4 xj (E 71X -- (57 X 4)( X 9)/n 4X j - 044 C4j a44 ZX 5 4X a 45 zxx 5 = X4 4 C45 C4y = = = 0 x 4 x;)(m 4 m 3) g4 X 4y S~x Sx5x 6 C46 z --I mI z -v m z X 5 =Y Ex5xi a 51 I = C7 z 055 = (E = Z X 5X j - m5=my Sx 5 2 053, y2 C56 -C S_ X5X6 xsx;)(msm;) I a55 = 9 TABLE 2. ABBREVIATED DOOLITTLE SOLUTION OF THE SIMULTANEOUS EQUATIONS FOR A MULTIPLE REGRESSION WITH FOUR INDEPENDENT VARIABLES AND ONE DEPENDENT VARIABLE Column Row X1X 3X an3 a23 a33 5=Y X = CheckZ X1 X2 an 1 a12 a22 a144 a24 a34 a 44 a 15 = g +a a2 = g2 12 +a 11 h2 = a25-+a2 4 ±+a23 +a22+al2 X3 X4 X5==Y a35 = 9 4 a45 = h3 =a32 +a3 4 +a33 +a2 3 +a h 4 =a4 ±a 4 4 ±a3 4 5 h5= 9,+94+93 1 3 +a24 +al4 a1 2 =g +92+91 A1 , A11=all A B11 =A22 A 11 = 1.0 12 a1 B12 =A 12/A 11 A13-a 13 A14 A 24 -a 1 4 B13=A13 /A21 = B14 =A 14 /All g A 12 gB1g2AgA A~h-h1 Blh =Alh/Aij A 22 B2; A3? = a22 -. A 12 B 12 B22 =A22/A22 1.0 A 23=a23 -A 12B12 B23=A 23/A 22 =a = A21 B24 = 2/2 A22 ~2 -A 12B 12 B 2g = gA2 4 A 2 h =h2 --A 12B 1h B 2,, =Alh/A22 A~h =h3 -A 13B1 , -A23B21, B 3,, =A22 /A 3 3 A34 B 3 =A33/A3 =1.0 a 3 4 A 13 B39 B34 =A34 /A23 A~g =g3 -A 1 3B 22 -A 23B 2 g B22 =A 2g/A33 A 42 A44 =a44-A B 44 =A 44 1 4B 14 44 A45 =g 4-A 14 B 1 4 -A B21 -"A B31 24 34 B4 g A 4g/A 44 A 5 A4h =h 4 -Al4Blh -A 24 B24 -A3 4B 34 /A = 1.0 -A24B~h -A 34B 3 h B4h =A 4 h/A 4 4 0 A 51 2 =A 2 g=-A 1 B 1 g Ash =h5-A 2 -A 2 gB2 g-A B3g -A 2 gB2 h-'A 5 Blh B2 5 B51 C1j B5gBgg Agg/Agg=1.0 2 c1 =1/A 11 -B 14c14 -B13c13 -B 12c12 C22 -1/A22 Bsh=A~h/Agg=1.0 c = -B 12 14c24 -Bssc23 c13= -B4c34 -Bssc33 -B12c23 C3= -"B2 4c34 - B 23c33 -B 12c22 C 4= -B 4c 4 4 -Bu3c 4 1 B12c24 C = - B 24c 44 24 b =B5 2 -b 4 B1 4 1 - hlb =Blh-h 4 bBl 4 b3B13 -b 2B12 h3bB 3 -h 2 bBu2 5 B24c24 -B23c2 3= 1/A C4; 33 -B 3c34 b2 -B 21 -b 4B2 4-b3B 2 3 hlb =BIS-h 4 bB24 -B 3 4c34 C = -B 34c44 34 C44 = b3 =B3 g-b 4B 3 4 h3b =B 35 -h 4 bB 4 3 h4b 1/A 44 =B4h b; Check b1 f/4c4±+g3cl +g b 2 -g4C2 1 2 C12 ±g 1 2 +91012 +2C22 +g C +91013 4 +g3023 b3 - 94034+g3C3l 2 23 b4 = g4c 44 +g3c 34 +qc24 I+-glcl4 2 Sum of squares due to regression Any predicted Y y = -Z big 5 orZJ~ ASQBL 1 = c a14 -tCilalS 14 +4C2ass +clsahl = 1.0 C24a24 A-d3a23 +c22a22I-cua 12 = 1 .0 C34a 4 -1c33a3 3 C44a44 -jc 34a34 +J-C24 a24 +c 14 14 = =f Y 1 +-c 23a2 +Cllals = 1.0 a 1.0 -'57ibiXi+biX ±b2X 2 ±b3X 3+b 4X 4 TABLE 3. FORM FOR CALCULATING SUMS OF SQUARES AND PRODUCTS REQUIRED IN SEARCH FOE POTENT VARIABLES IF THERE ARE 10 INDEPENDENT VARIABLES Xi Item Xi Code i xi X1 X2 (1) (1) m2 X3 (1) (1) m3 X4 (1) (1) m4 X5 (1) (1) m5 X6 (1) (1) m6 X7 (1) (1) m7 X8 (1) (1) 8 X9 (1) (1) m 9 X 10 X 11 = Y X 12 = Check zX; X;i (1) (1) (1) (1) n210im 11 (1) (1) = (1) X1 57,x x 1 = X; Code i Ml MS MY 9 z xxj= Lz x 1xj a 1j = C12 (~ ( X 1 )(Z X ;)/n - (1) (3) (2) (4) (1) a xix j)(rim; 2 )(Z ) i X2 Z X2Xj C2; = (LI X c c X;)/n j a X3 ~4 -X 5 EX2Xj = Z X2Xj a2j = (S x 2 x;)(m 2m;) Same 4 items as for (1) mn 3 P14 (3) (1) (3) (1) (3) (2) (2) (2) (1) (4) (3) (4) (2) (2) (2) (2) (3) (3) (3) (1) (1) (1) (1) (1) (1) (1) (1) (2)m (4) (3) C X 1 and X2 M5 9 in6 M27 MSo in 9 k () (1) (() (1) (1) m z --f -4 X11 -y i1 c il (1) z -' 0 DETERMINING POTENT INDEPENDENT VARIABLES TABLE 63 6 4. ABBREVIATED DOOLITTLE SOLUTION SHOWING MASK AND PARTS THAT ARE IDENTICAL IN EACH OF THE EIGHT REGRESSIONS OF Y ON THREE PARTS VARY AS X~ VARIES)a INDEPENDENT VARIABLES (OTHER Column Row XI=CheckL = = X= a1 al= X6 X2 a 66 a'12 a a63 4 / S a1 44 = 11 =13 2 Xf ?1 ( a22= a3 =a g3 A1 Ail B11 A1.0 B 12 B10 A 20 * A22 B3 2 1.0 B20 A21 B3 2 A 41 B41 a "Primes" indicate that the subscripts of this table are not the original subscripts of Table 3. 64 ALABAMA AGRICULTURAL EXPERIMENT STATION TABLE 5. F TESTS OF SIGNIFICANCE OF ADDITIONAL VARIABLES IN MULTIPLE REGRESSIONa Source of variation Degrees of freedom n-le Sum of squaresb Iy2 Mean square F Chance probability (1) Total (2) Reduction due to most potent variable (3) Residual = (1) - (2) (4) Reduction due to most potent pair of variables (5) Second variable independent n - 1 1 -i1 2 E f2 f 2 2 (6) of first = (4) - (2) Residual = (1) - (4) 1 n 1 (7) Reduction due to most potent trio of variables (8) Third variable independent (9) Residual of first two = (7) = (1) - (7) (4) 3 E 2 n-1-3 1 a The testing process can be extended to as many variables as desired. b All sums of squares are sums of squares of deviations in Y and must all be either coded or decoded, but not mixed. It would seem that decoded values would be better, since this would allow for changes in code as the problem proceeds, if desirable. c n = the number of sets of observations of Xi and Y. DETERMINING POTENT INDEPENDENT VARIABLES TABLE 6. AVERAGE HEIGHTS, Y, OF DOMINANT TREES TOGETHER WITH MEASUREMENTS OF SIX POSSIBLE PREDICTORS, Xi, OF HEIGHT FOR 40 PLANTINGS OF LONGLEAF PINE 65 Planting no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 X 1a X 2b X30 X4 d Xs=X4 X 6 =X 1 X 4 X 7= Y Xg= Check 407.9 406.0 370.7 765.3 479.3 382.2 374.5 199.3 342.3 277.4 248.4 295.9 364.9 231.8 348.0 204.5 463.9 534.5 719.7 603.1 593.5 686.2 462.7 317.9 239.7 298.6 650.0 653.4 456.1 680.9 781.2 564.8 465.7 680.5 474.9 482.1 586.5 571.1 337.9 228.5 18231.8 ..... 16.8 30.4 14.6 42.8 19.7 15.1 18.3 11.7 13.5 8.5 7.8 9.5 14.6 14.6 13.6 15.8 19.0 23.8 36.2 28.0 19.3 26.2 15.9 13.5 13.2 18.4 20.6 32.0 21.2 23.8 29.3 26.4 22.8 34.0 19.5 18.2 27.3 24.7 19.8 20.4 10.6 17.5 5.0 19.9 3.7 5.7 5.8 5.6 2.6 9.1 4.1 2.3 3.6 6.5 3.7 4.5 4.8 6.1 6.6 6.1 5.3 5.8 2.8 3.0 2.4 2.9 4.4 5.8 7.0 3.5 5.5 4.2 3.4 5.9 5.3 5.2 4.1 5.7 15.7 3.4 31.5 63.3 29.0 69.2 30.9 35.4 28.4 28.2 21.3 33.5 16.9 21.0 27.8 37.6 25.1 31.4 28.1 29.5 51.6 43.2 48.4 34.8 23.8 25.3 26.3 37.6 34.5 44.6 38.0 26.9 33.7 33.4 29.7 45.7 33.6 52.5 36.6 57.8 54.0 24.0 11 7 11 11 12 11 10 7 11 10 10 11 11 7 11 6 12 12 12 12 14 14 13 10 8 8 15 12 11 15 15 12 11 12 12 12 12 12 8 6 121 49 121 121 144 121 100 49 121 100 100 121 121 49 121 36 144 144 144 144 196 196 169 100 64 64 225 144 121 225 225 144 121 144 144 144 144 144 64 36 184.8 212.8 160.6 470.8 236.4 166.1 183.0 81.9 148.5 85.0 78.0 104.5 160.6 102.2 149.6 94.8 228.0 285.6 434.4 336.0 270.2 366.8 206.7 135.0 105.6 147.2 309.0 384.0 233.2 357.0 439.5 316.8 250.8 408.0 234.0 218.4 327.6 296.4 158.4 122.4 9190.6 229.76 32.2 26.0 29.5 30.6 32.6 27.9 29.0 15.9 24.4 31.3 31.6 26.6 26.3 14.9 24.0 16.0 28.0 33.5 34.9 33.8 40.3 42.6 31.5 31.1 20.2 20.5 41.5 31.0 24.7 29.7 33.2 28.0 27.0 30.9 26.5 31.8 34.9 30.5 18.0 16.3 1139.2 28.48 E X 820.8 235.1 1424.1 437.0 4985.0 35.60 10.92 124.62 5.88 20.52 a X1 = silt plus clay content of topsoil in per cent, b X2 = imbibitional water value of the most impervious soil horizon, c X3 = silt plus clay content of B horizon in per cent, d X4 = age of planting in years. TABLE 7. CALCULATION OF CODED SUMS OF SQUARES AND PRODUCTS FOR MULTIPLE REGRESSION OF HEIGHT OF LONGLEAF PINE ON SIX OTHER VARIABLES xi xi I XXi X1 820.8 20.5 0.01 x j Item Code X2 235.1 5.9 0.1 X3 1,424.1 35.6 0.01 X4 437.0 10.9 0.1 9,190.6 8,967.2 223.4 0.223,4 X5 4,985.0 124.6 0.01 X6 X 7 =Y Check X8 = 57 . Lxi X2jCode X1 x1x 1 Cl; 9,190.6 229.8 0.001 1,139.2 28.5.. 0.1 24,070.97 23,6.38 694.59 0.694,59 6,680.69 6,695.65 -14.96 -0.149,6 41, 048.12 40,558.37 489.75 0.489,75 12,949.1 12,445.8 503.3 5.033 152,323.5 141,972.8 10,350.7 10.350,7 18,231.8 E xlx 1 0.01 a2; L7 C1X 2X 1 2 LxIX; 0.1 19,205.06 5,360.0.1 4,824.25 16, 842.82 535.76 2,362.24 0.236,24 0.535,76 1,956.37 1,381.80 574.57 5.745,7 C31 Zx~x1 a3; 0.01 SXX 4 C4 1 xr~y; 0.1 a4 i a5j 2,499.2 27, 806.4 2,568.5 29,299.4 -1,493.0 -69.3 - 1.493,0 -0.693 15,579.0 56, 224.37 15, 558.3 50,701.52 20.7 5,522.85 0.020,7 0.552,285 58,853. 4,985. 4,774. 54,461. 4,392. 211. 4.392 2.11 9,677.79 8,370.15 1,307.64 1.307,64 714.593. 621, 256. 93,337. 9.333,7 57,903.39 54,017.75 3,885.64 0.388, 564 111, 883.85 107,157.40 4,726.45 107,160.4 100,407.3 6,753.1 0.675,31 211,216.30 199,182.42 12,033.88 c5, X6 E X 5X 0.01 57,x~x; Z7X 6 X1 C6i Lxix 0.0011 279,200.63 2,586, 263.62 261,748.29 2,111,678.21 474,585.41 17, 452.34 1.745, 234 0.474,585,41 34,171.52 32,444.42 1,727.10 17.271,0 550,444.53 519,241.66 31,202.87 C71 7X 711 Lx7 x; 0.1 DETERMINING POTENT INDEPENDENT VARIABLES TABLE 67 6 8.TESTS OF SIGNIFICANCE OF ADDITIONAL VARIABLES IN PREDICTING HEIGHT OF LONGLEAF PINE Source of variation Degrees of freedom 39 1 38 2 Sum of squares Mean square F' Chance probability (1) Total (2) Reduction due to age (3) Residual = (1) - (2) 1, 727.10 1,200.53 526.57 1,241.49 1,200.53 13.86 86.62 <0.001 (4) Reduction due to age and imbibitional water value (5) Reduction due to imbibitional water value independent of age = (4) - (2) (6) Residual = (1) - (4) (7) Reduction due to age, imbibitional water value 2 and (age) (8) Reduction due to 1 37 40.96 485.61 40.96 13.12 3.12 .<0.1 (age) 2 3 1,248.02 independent of others =(7) - (4) (9) Residual = (1) - (7) 1 36 6.53 479.08 6.53 13.31 0.49 >0.3 aDecoded sums of squares. 68 68 ALABAMA AGRICULTURAL, EXPERIMENT STATION TABLE 9. "FORWARD" PORTION OF THE ABBREVIATED FOR MULTIPLE REGRESSION OF HEIGHT ON X X2= = DOOLITTLE SOLUTION X , SILT AND 1 X4, AGE, Y AND CLAY OF TOPSOIL.a Row .-- il= X4 X~ =-X 2 0.223,400,00 0.236,240,00 X'3 X'4 - CheckZ X X4 X~= X 2 = 2.110,000,00 3= Y 5.033,000,00 0.694,590,00 17.271,000,00 5.033,000.00 2.385,308,06 0. 161, 712, 18 0.760,686,59 5.142,732,25 1.0 7.366,400,00 1.154,230,00 22.998,590,00 7.366,400.00 3.491,184,83 0.374)299)311.760,686,59 5.142,732,28 1.0 2.110.000,00 1.0 0.223,400,00 0.105,876,78 0.212,587,13 1.0 A21 B 21 A 31 B31 LY',2* A jBj - (5.033,000,00)(2.385,308,06) 12.005,255,47 + 0.123,012,29 e2 = (12.128,267,76)( + (0.161,712,18)(0.760,686,59) = = 12.128,267,76 - ( 2)( 2 1,212.83 fl"Primes" indicate that the subscripts of this table are not the original subscripts of Tables 6 and 7. DETERMINING POTENT INDEPENDENT VARIABLES TABLE 10. REDUCTIONS IN SUM OF SQUARES OF SPECIFIED Two VARIABLES 69 6 Y DUE TO THE XZ X4 X471 X475 X74 2 2 Decoded y2 X2 X6 12.414,865,59 12.356,524,65 12.128,267,75 12.087,520,95 12.075,158,38 1,241.49 1,235.65 1,212.83 1,208.75 1,207.52 TABLE 11. MASK TO AID IN SOLUTIONS OF REGRESSIONS OF HEIGHT ON XS=X4 = AGE, X2' =X = IMBIBITIONAL WATER VALUE AND X'= ? Column Row XI/=2 X5' Check x4 M'-11 4 2.110,000,00 2/= 2 2' 2 -0.693,000,00 5.745,700,00 5.033, 000, 00 0.149,600100 ---- --- --- --- --- -- --y 17.271,000,00 -- - - - - - 5.033,000,00 2.385,308,06 1.503,418,48 0.272,452)50 Ali 2.110,000,00 -0.693,000,00 -0.328,436,02 5.518,093,84 1.0 Bj 1 A1 1.0 B2 , Computational note : Since = Z A Z6 Ba from the = Y column, each and every E y will have (5.033,000,00)(2.385,308,06) (1.503,418,48)(0.272,452,50) as part of the result; this quantity may be calculated once and entered on the mask to be used as needed. a Double "primes" indicate possible changes in subscripts beyond those denoted by "primes. L'2 + X"' TABLE 12. "FORWARD" X''= PORTION OF THE ABBREVIATED DOOLITTLE SOLUTION FOR MULTIPLE REGRESSION OF' HEIGHT ON X, = X2, IMBIBITIONAL WATER VALUE; X3" = X 4 , AGE; X1, SILT & CLAY CONTENT OF TOPSOIL, SHOWING MASK IN POSITION" Column Row X'=X 4 X ' -? X = CheckZ x2'= 2 x 3'= 1 x 4 '- 2.110,000,00X"= 0.693,000,00 5.745,700,00 0.223,400,00 0.535,760,00 -0.149,600,00 5.033,000,00 6.673,400,00 5.438,860,00 1 .689,990,00 22.847,990,00 6.673,400,00 3.162,748,82 7.630,644,93 1.382,840,73 0.141,098,53 0.970,776,44 4.856,010,28 1.0 x 0.236,240,00 0.694,590,00 17.271,000,00 A 11 B1 1 A21 B 21 A 31 B31 2.110,000,00 1.0 0.693,000,00 0.328,436,02 5.518,093,84 1.0 0.223,400,00 0.105,876,78 0.609,132,61 0.110,388,23 0.145,346,06 1.0 5.033,000,00 2.385,308,06 1.503,418,48 0.272,452,50 -0.004,247,52 o0.029.223152 4.856,010,28 1.0 Zy~2 ZyQI = i igj = - (5.033,000,00)(2.385,308,06) + (1.503,418,48)(0.272,452,50) + (-0.004,247,52)( -0.029,223,52) 12.005,255,47 A- 0.409,610,12 -+-0.000,124,13 =12.414,989,72 = y2)L2 Code) 12.414 ,989,72)()I1 2-=1241.50 DETERMINING POTENT INDEPENDENT VARIABLES TABLE 13. REDUCTION IN SUM OF SQUARES 7 71 OF Y DUE TO THE SPECIFIED THREE VARIABLES 2'3' X4 X2 X4 X X4 X X5 6 1 12.480,246,12 12.443,175,56 12.420,189,26 12.414,989,72 Decoded 1,248.02 1,244.32 1,242.02 1,241.50 TABLE 14. MEASUREMENTS OF 3 POSSIBLE DISCRIMINATORS OF THE PRESENCE OF Azotobacter IN SOILS Group (FROM GOULDEN FROM Cox AND MARTIN) I (n = 25) x2 x x3 containing Azotobactera Group II (nu = 27) without Azotobacter' x1 6.0 7.0 8.4 5.8 6.9 7.8 7.8 6.9 7.0 6.7 6.2 6.9 8.0 8.0 8.0 6.1 7.4 7.4 8.4 8.1 8.3 7.0 8.5 8.4 7.9 46 35 115 35 55 52 52 208 70 35 27 52 60 156 90 44 207 120 65 237 57 94 86 52 ].46 x X 2 3 24 17 28 17 25 29 29 58 13 16 44 27 58 68 37 27 31 32 43 45 60 43 40 48 52 76.0 59.0 151.4 57.8 86.9 88.8 88.8 272.9 90.0 57.7 77.2 85.9 126.0 232.0 135.0 77.1 245.4 159.4 116.4 290.1 125.3 144.0 134.5 108.4 205.9 184.9 X a 2196 87.8400 = 911 36.4400 3,291.9 6.2 49 5.6 31 5.8 42 5.7 42 6.2 40 49 6.4 31 5.8 6.4 31 5.4 62 5.4 42 5.7 35 5.6 33 5.8 24 70 7.3 21 6.1 6.2 36 6.7 35 5.9 33 5.6 25 5.8 31 6.1 30 21 6.1 5.7 35 5.8 37 5.8 28 5.7 34 5.8 16 160.6 963 5.9481 35.6667 = 30 23 22 14 23 18 17 19 26 16 22 24 15 14 21 26 26 21 32 30 24 25 22 24 19 20 19 592 21.9259 85.2 59.6 69.8 61.7 69.2 73.4 53.8 56.4 93.4 63.4 62.7 62.6 44.8 91.3 48.1 68.2 67.7 59.9 62.6 66.8 60. 1 52. 1 62.7 66.8 52.8 59.7 40.8 1,715.6 7.3960 X 1 = pH, X2 available phosphate content, X 3 total nitrogen content. 72 72 ALABAMA AGRICULTURAL EXPERIMENT STATION TABLE 15. CALCULATION OF CODED SUMS OF SQUARES AND PRODUCTS FOE DISCRIMINATION OF PRESENCE OF Azotobacter IN SOIL Item xi Code i Xi -_ _ _ _ X1 X2 X Check L Group I, containing Azotobacter Lxi xi xl LX1Xj Cli 184.9 7.3960 1,384.05 1,367.5204 16.5296 2196. 87.8400 16,620.8 16,241.6160 379.1840 278,162. 192,896.6400 85,265.3600 911. 36.4400 6,892.8 6,737.7560 155.0440 89,344. 80,022.2400 9,321.7600 38,701. 33,196.8400 5)504.1600 3,291.9 x2 LX 2 Xj c2j L7,xixj X L X3Xj 3 L x3xi CUj EX2X j 24,897.65 24,346.8924 550.7576 384,126.8 289,160.4960 94,966.3040 134,937.8 119,956.8360 14,980.9640 Group II, without Azotobacter 592. 21.9259 1,715.6 Xi XL xix 2x 160.6 5.9481 959.70 955.2726 4.4274 963. 35.6667 5,770.8 5,728.0667 42.7333 37,979. 34,347.0000 3,632.0000 X L 2 x L X2X j X LX3X 3 Cj 3 3,514.5 3, 521.3037 -6.8037 20,928. 21,114.6667 - 186.6667 13,566. 12,980.1481 585.8519 10,245.00 10,204.6430 40.3570 64,677.8 61, 189.7333 3,488.0667 38,008.5 37,616.1185 392.3815 Both groups xij 0.1 al, x2 0.01 421.9173 0.421,917,30 88,897.3600 8.889,736,00 X3 0.01 148.2403 0.148,240,30 9,135.0933 0.913,509,33 6,090.0119 0.609,001,19 X 4 = 1.0 1.4479 0.144, 790,00 512.1733 0.521,' 733,00 14.5141 0.145, 141,00 0.1 0.01 a2, 20.9570 0.209,570,00 0.01 a31 n I r)A c)Ars nr\r TABLE 16. ABBREVIATED DOOLITTLE SOLUTION FOR DISCRIMINANT FUNCTION FOR PRESENCE OF Azotobacter BASED ON X 1 , pH; X 2 , AVAILABLE PHOSPHATE CONTENT; AND X3 , TOTAL NITROGEN CONTENT -4 z Column Row X1X X,0.209,570,00 X28.889,736,00 X30.609,001,19 Ali B11 A 21 B2 ; C) 4 3X 0.421,917,30 0.148,240,30 0,913,509,33 0.148,240,30 0.707,354,58 0.615,064,20 0.076,497,57 0.457,091,82 1.0 =d X 5 =CheckZ 0.924,517,60 10.746,895,63 1.815,891,82 0.924,517,60 4.411,497,83 8.885,608,38 1.105,132,57 0.482,202,58 1.054,935,92 h1 h2 = = 0.144,790,00 0.521,733,00 0.145,141,00 0.144,790,00 0.690,890,87 0.230,234,19 0.028,634,99 0.025,110,77 0.054,935,95 Al = = z --I 0.209,570,00 1.0 0.421,917,30 2.013,252,37 8.040,310,00 1.0 z v z mI -a A 31 B 31 Cli C 1 = mn 5.945, 651, 91 c1 2 C22 = = -0.157,788,52 0.137,175,72 0.024,432,52 c 13 C2 3 = Xi Check c3 C3= A2 = A3 = =-0.167,357,12 2.187,744,25 0.054,935,95 1.000,000,00 -1.210,578,80 X2 D X= 3 = 0.602,842,85 0.024,432,52 1.602,842,84 1.024,432,54 0.054,935,95 h3 = 1.054,935,92 0.602,842,84 l = 0.100,034.09 + 0.006.592,75 + 0.001,379,48 = 0.108,006,32 0.060,28X 1 + 0.000,244,3X 2 + 0.000, 549 ,4X 3 246.7X 1 + X2 + 2.248X 3 1.000,000,00 0.999,999,94 Za = Zb = b Divide each a The Al must be decoded; or the X 2 must be coded. Ai by the smallest A-value. 74 74 ALABAMA AGRICULTURAL EXPERIMENT STATION TABLE 17. TESTS OF SIGNIFICANCE OF ADDITIONAL VARIABLES IN THE DISCRIMINANT FUNCTION FOR PRESENCE OF Azotobacter IN SOILS Source of variation Degrees of freedom 51 1 50 Sum of squares Mean square F' Chance probability (1) Total = (2) + (3) (2) SS due to most potent variable, X 1 = pH (3) Residual = D (4) SS due to 2 most potent variables X1 & X2 = pH and phosphate, (adjusted to SS for 0.229,930 0.129,896 0.100,034 0.129,896 0.002,001 64.92 <0.001 Xi) = 2 0.133,487 0.003, 59 1 0.096,443 0.134,206 0.000, 719 0.095,724 0.000, 719 0.001,994 0.36 (5) Phosphate independent of pH = (4) - (2) (6) Residual (1) - (4) 1 49 3 0.003, 59 1 0.001,968 1.82 >0.10 (7) SS due to all 3 variables (adjusted to SS for Xi) (8) Nitrogen independent of other variables = (7)(9) Residual = (1)- (7) (4) "1 48 >0.5 a sums of squares were adj usted so that every discriminant would have the same total sum of squares as the most potent single discriminator, pH. All DETERMINING POTENT INDEPENDENT TABLE 18. "FORWARD" VARIABLES 75 PORTION OF THE ABBREVIATED DOOLITTLE SOLUTION FOR THE DISCRIMINANT BASED ON X -l pH AND X 2 = SOIL PHOSPEATE CONTENT' Row I = 1 0.421,917,30 8.889,736,00 A' =d =CheckL -X' X2= x 1 2 0.209,570,00 X3= d 0.144,790,00 0.521,733,00 0.145,141,00 0.144,790,00 0.690,890,87 0.230,234,19 0.028,634,99 0.776,277.30 9.833,386,30 0.811,664,00 0.776,277,30 3.704,143,25 8.270,544,19 1.028,634,99 Al B,1 0.209,570,00 1.0 0.421,917,30 2.013,252,37 8.040,310,00 1.0 A27 B2; D = Li AjaA 1 Bi 0 (0.144,790,0O)(0.690,890,87) 0.100,034,09 1-(0.230,24,19)(0.028,634,99) = +0.006,592,75 0.106,626,84 1 "Primes" indicate that the subscripts of this table may not be the original subscripts of Tables 14, 15, and 16.