Conference PaperPDF Available

# SAS for Statistical Procedures

Authors:
SAS FOR STATISTICAL PROCEDURES
I.A.S.R.I., Library Avenue, New Delhi-110 012
rajender@iasri.res.in
1. Introduction
SAS (Statistical Analysis System) software is comprehensive software which deals with many
problems related to Statistical analysis, Spreadsheet, Data Creation, Graphics, etc. It is a layered,
multivendor architecture. Regardless of the difference in hardware, operating systems, etc., the
SAS applications look the same and produce the same results. The three components of the SAS
System are Host, Portable Applications and Data. Host provides all the required interfaces
between the SAS system and the operating environment. Functionalities and applications reside
in Portable component and the user supplies the Data. We, in this course will be dealing with the
software related to perform statistical analysis of data.
Windows of SAS
1. Program Editor : All the instructions are given here.
2. Log : Displays SAS statements submitted for execution and messages
3. Output : Gives the output generated
Rules for SAS Statements
1. SAS program communicates with computer by the SAS statements.
2. Each statement of SAS program must end with semicolon (;).
3. Each program must end with run statement.
4. Statements can be started from any column.
5. One can use upper case letters, lower case letters or the combination of the two.
Basic Sections of SAS Program
1. DATA section
2. CARDS section
3. PROCEDURE section
Data Section
We shall discuss some facts regarding data before we give the syntax for this section.
Data value: A single unit of information, such as name of the specie to which the tree belongs,
height of one tree, etc.
Variable: A set of values that describe a specific data characteristic e.g. diameters of all trees in
a group. The variable can have a name upto a maximum of 8 characters and must begin with a
letter or underscore. Variables are of two types:
Character Variable: It is a combination of letters of alphabet, numbers and special characters
or symbols.
SAS for Statistical Procedures
Numeric Variable: It consists of numbers with or without decimal points and with + or -ve
signs.
Observation: A set of data values for the same item i.e. all measurement on a tree.
Data section starts with Data statements as
DATA NAME (it has to be supplied by the user);
Input Statements
Input statements are part of data section. This statement provides the SAS system the name of
the variables with the format, if it is formatted.
List Directed Input
Data are read in the order of variables given in input statement.
Data values are separated by one or more spaces.
Missing values are represented by period (.).
Character values are followed by \$ (dollar sign).
Example
Data A;
INPUT ID SEX \$ AGE HEIGHT WEIGHT;
CARDS;
1 M 23 68 155
2 F . 61 102
3. M 55 70 202
;
Column Input
Starting column for the variable can be indicated in the input statements for example:
INPUT ID 1-3 SEX \$ 4 HEIGHT 5-6 WEIGHT 7-11;
CARDS;
001M68155.5
2F61 99
3M53 33.5
;
Alternatively, starting column of the variable can be indicated along with its length as
INPUT @ 1 ID 3.
@ 4 SEX \$ 1.
@ 9 AGE 2.
@ 11 HEIGHT 2.
@ 16 V_DATE MMDDYY 6.
;
Reading More than One Line Per Observation for One Record of Input Variables
SAS for Statistical Procedures
INPUT # 1 ID 1-3 AGE 5-6 HEIGHT 10-11
# 2 SBP 5-7 DBP 8-10;
CARDS;
001 56 72
140 80
;
Reading the Variable More than Once
Suppose id variable is read from six columns in which state code is given in last two columns of
id variable for example:
INPUT @ 1 ID 6. @ 5 STATE 2.;
OR
INPUT ID 1-6 STATE 5-6;
Formatted Lists
DATA B;
INPUT ID @1(X1-X2)(1.)
@4(Y1-Y2)(3.);
CARDS;
11 563789
22 567987
;
PROC PRINT;
RUN;
Output
Obs. ID x1 x2 y1 y2
1 11 1 1 563 789
2 22 2 2 567 987
DATA C;
INPUT X Y Z @;
CARDS;
1 1 1 2 2 2 5 5 5 6 6 6
1 2 3 4 5 6 3 3 3 4 4 4
;
PROC PRINT;
RUN;
Output
Obs. X Y Z
1 1 1 1
2 1 2 3
DATA D;
INPUT X Y Z @@;
SAS for Statistical Procedures
CARDS;
1 1 1 2 2 2 5 5 5 6 6 6
1 2 3 4 5 6 3 3 3 4 4 4
;
PROC PRINT;
RUN;
Output:
Obs. X Y Z
1 1 1 1
2 2 2 2
3 5 5 5
4 6 6 6
5 1 2 3
6 4 5 6
7 3 3 3
8 4 4 4 DATA FILES
SAS System Can Read and Write
A. Simple ASCII files are read with input and infile statements
B. Output Data files
Creation of SAS Data Set
DATA EX1;
INPUT GROUP \$ X Y Z;
CARDS;
T1 12 17 19
T2 23 56 45
T3 19 28 12
T4 22 23 36
T5 34 23 56
;
Creation of SAS File From An External (ASCII) File
DATA EX2;
INFILE 'B:MYDATA';
INPUT GROUP \$ X Y Z;
OR
DATA EX2A;
FILENAME ABC 'B:MYDATA';
INFILE ABC;
INPUT GROUP \$ X Y Z;
;
Creation of A SAS Data Set and An Output ASCII File Using an External File
DATA EX3;
FILENAME IN 'C:MYDATA';
SAS for Statistical Procedures
FILENAME OUT 'A:NEWDATA';
INFILE IN;
FILE OUT;
INPUT GROUP \$ X Y Z;
TOTAL =SUM (X+Y+Z);
PUT GROUP \$ 1-10 @12 (X Y Z TOTAL)(5.);
RUN;
This above program reads raw data file from 'C: MYDATA', and creates a new variable TOTAL
and writes output in the file 'A: NEWDATA’.
Creation of SAS File from an External (*.csv) File
data EX4;
/*give the exact path of the file, file should not have column headings*/
input sn loc \$ year season \$ crop \$ rep trt gyield syield return kcal; /*give
the variables in ordered list in the file*/
/*if we have the first row as names of the columns then we can write in the above statement
firstobs=2 so that data is read from row 2 onwards*/
biomass=gyield+syield; /*generates a new variable*/
proc print data=EX4;
run;
Note: To create a SAS File from a *.txt file, only change csv to txt and define delimiter as per
file created.
Creation of SAS File from an External (*.xls) File
Note: it is always better to copy the name of the variables as comment line before Proc Import.
/* name of the variables in Excel File provided the first row contains variable name*/
proc import datafile = 'C:\Users\Desktop\DATA_EXERCISE\descriptive_stats.xls'
/*give the exact path of the file*/
out = descriptive_stats replace; /*give output file name*/
proc print;
run;
If we want to make some transformations, then we may use the following statements:
data a1;
set descriptive_stats;
x = fs45+fw;
run;
Here proc import allows the SAS user to import data from an EXCEL spreadsheet into SAS.
The datafile statement provides the reference location of the file. The out statement is used to
name the SAS data set that has been created by the import procedure. Print procedure has been
utilized to view the contents of the SAS data set descriptive_stats. When we run above codes
we obtain the output which will same as shown above because we are using the same data.
SAS for Statistical Procedures
Creating a Permanent SAS Data Set
LIBNAME XYZ 'C:\SASDATA';
DATA XYZ.EXAMPLE;
INPUT GROUP \$ X Y Z;
CARDS;
.....
.....
.....
RUN;
This program reads data following the cards statement and creates a permanent SAS data set in a
subdirectory named \SASDATA on the C: drive.
Using Permanent SAS File
LIBNAME XYZ 'C:\SASDATA';
PROC MEANS DATA=XYZ.EXAMPLE;
RUN;
TITLES
One can enter upto 10 titles at the top of output using TITLE statement in your procedure.
PROC PRINT;
TITLE ‘HEIGHT-DIA STUDY’;
TITLE3 ‘1999 STATISTICS’;
RUN;
Comment cards can be added to the SAS program using
FOOTNOTES
One can enter upto 10 footnotes at the bottom of your output.
PROC PRINT DATA=DIAHT;
FOOTNOTE ‘1999’;
FOOTNOTE5 ‘STUDY RESULTS’;
RUN;
For obtaining output as RTF file, use the following statements
Ods rtf file=’xyz.rtf’ style =journal;
Ods rtf close;
For obtaining output as PDF/HTML file, replace rtf with pdf or html in the above statements.
If we want to get the output in continuos format, then we may use
Ods rtf file=’xyz.rtf’ style =journal bodytitle startpage=no;
LABELLING THE VARIABLES
Data dose;
title ‘yield with factors N P K’;
input N P K Yield;
SAS for Statistical Procedures
Label N = “Nitrogen”;
Label P = “ Phosphorus”;
Label K = “ Potassium”;
cards;
...
...
...
;
Proc print;
run;
We can define the linesize in the output using statement OPTIONS. For example, if we wish
that the output should have the linesize (number of columns in a line) is 72 use Options linesize
=72; in the beginning.
2. Statistical Procedure
SAS/STAT has many capabilities using different procedures with many options. There are a
total of 73 PROCS in SAS/STAT. SAS/STAT is capable of performing a wide range of
statistical analysis that includes:
1. Elementary / Basic Statistics
2. Graphs/Plots
3. Regression and Correlation Analysis
4. Analysis of Variance
5. Experimental Data Analysis
6. Multivariate Analysis
7. Principal Component Analysis
8. Discriminant Analysis
9. Cluster Analysis
10. Survey Data Analysis
11. Mixed model analysis
12. Variance Components Estimation
13. Probit Analysis
and many more…
A brief on SAS/STAT Procedures is available at
http://support.sas.com/rnd/app/da/stat/procedures/Procedures.html
Example 2.1: To Calculate the Means and Standard Deviation:
DATA TESTMEAN;
INPUT GROUP \$ X Y Z;
CARDS;
CONTROL 12 17 19
TREAT1 23 25 29
TREAT2 19 18 16
TREAT3 22 24 29
CONTROL 13 16 17
TREAT1 20 24 28
TREAT2 16 19 15
SAS for Statistical Procedures
TREAT3 24 26 30
CONTROL 14 19 21
TREAT1 23 25 29
TREAT2 18 19 17
TREAT3 23 25 30
;
PROC MEANS;
VAR X Y Z;
RUN;
The default output displays mean, standard deviation, minimum value, maximum value of the
desired variable. We can choose the required statistics from the options of PROC MEANS. For
example, if we require mean, standard deviation, median, coefficient of variation, coefficient of
skewness, coefficient of kurtosis, etc., then we can write
PROC MEANS mean std median cv skewness kurtosis;
VAR X Y Z;
RUN;
The default output is 6 decimal places, desired number of decimal places can be defined by
using option maxdec=…. For example, for an output with three decimal places, we may write
PROC MEANS mean std median cv skewness kurtosis maxdec=3;
VAR X Y Z;
RUN;
For obtaining means group wise use, first sort the data by groups using
Proc sort;
By group;
Run;
And then make use of the following
PROC MEANS;
VAR X Y Z;
by group;
RUN;
Or alternatively, me may use
PROC MEANS;
CLASS GROUP;
VAR X Y Z;
RUN;
For obtaining descriptive statistics for a given data one can use PROC SUMMARY. In the above
example, if one wants to obtain mean standard deviation, coefficient of variation, coefficient of
skewness and kurtosis, then one may utilize the following:
PROC SUMMARY PRINT MEAN STD CV SKEWNESS KURTOSIS;
CLASS GROUP;
SAS for Statistical Procedures
VAR X Y Z;
RUN;
Most of the Statistical Procedures require that the data should be normally distributed. For
testing the normality of data, PROC UNIVARIATE may be utilized.
PROC UNIVARIATE NORMAL;
VAR X Y Z;
RUN;
If different plots are required then, one may use:
PROC UNIVARIATE DATA=TEST NORMAL PLOT;
/*plot option displays stem-leaf, boxplot & Normal prob plot*/
VAR X Y Z;
/*creates side by side BOX-PLOT group-wise. To use this option first sort the file on by
variable*/
BY GROUP;
HISTOGRAM/KERNEL NORMAL; /*displays kernel density along with normal curve*/
PROBPLOT; /*plots probability plot*/
QQPLOT X/NORMAL SQUARE; /*plot quantile-quantile QQ-plot*/
CDFPLOT X/NORMAL; /*plots CDF plot*/
/*plots pp plot which compares the empirical cumulative distribution function (ecdf) of a
variable with a specified theoretical cumulative distribution function. The beta, exponential,
gamma, lognormal, normal, and Weibull distributions are available in both statements.*/
PPPLOT X/NORMAL;
RUN;
Example 2.2: To Create Frequency Tables
DATA TESTFREQ;
INPUT AGE \$ ECG CHD \$ CAT \$ WT; CARDS;
<55 0 YES YES 1
<55 0 YES YES 17
<55 0 NO YES 7
<55 1 YES NO 257
<55 1 YES YES 3
<55 1 YES NO 7
<55 1 NO YES 1
55+ 0 YES YES 9
55+ 0 YES NO 15
55+ 0 NO YES 30
55+ 1 NO NO 107
55+ 1 YES YES 14
55+ 1 YES NO 5
55+ 1 NO YES 44
55+ 1 NO NO 27
;
PROC FREQ DATA=TESTFREQ;
SAS for Statistical Procedures
TABLES AGE*ECG/MISSING CHISQ;
TABLES AGE*CAT/LIST;
RUN:
SCATTER PLOT
PROC PLOT DATA = DIAHT;
PLOT HT*DIA = ‘*’;
/*HT=VERTICAL AXIS DIA = HORIZONTAL AXIS.*/
RUN;
CHART
PROC CHART DATA = DIAHT;
VBAR HT;
RUN;
PROC CHART DATA = DIAHT;
HBAR DIA;
RUN;
PROC CHART DATA = DIAHT;
PIE HT;
RUN;
Example 2.3: To Create A Permanent SAS DATASET and use that for Regression
LIBNAME FILEX 'C:\SAS\RPLIB';
DATA FILEX.RP;
INPUT X1-X5;
CARDS;
1 0 0 0 5.2
.75 .25 0 0 7.2
.75 0 .25 0 5.8
.5 .25 .25 0 6.3
.75 0 0 .25 5.5
.5 0 .25 .25 5.7
.5 .25 0 .25 5.8
.25 .25 .25 .25 5.7
;
RUN;
LIBNAME FILEX 'C:\SAS\RPLIB';
PROC REG DATA=FILEX.RP;
MODEL X5 = X1 X2/P;
MODEL X5 = X1 X2 X3 X4 / SELECTION = STEPWISE;
TEST: TEST X1-X2=0;
RUN;
SAS for Statistical Procedures
Various other commonly used PROC Statements are PROC ANOVA, PROC GLM; PROC
CORR; PROC NESTED; PROC MIXED; PROC RSREG; PROC IML; PROC PRINCOMP;
PROC VARCOMP; PROC FACTOR; PROC CANCORR; PROC DISCRIM, etc. Some of these
are described in the sequel.
PROC TTEST is the procedure that is used for comparing the mean of a given sample. This
PROC is also used for compares the means of two independent samples. The paired observations
t test compares the mean of the differences in the observations to a given number. The
underlying assumption of the t test in all three cases is that the observations are random samples
drawn from normally distributed populations. This assumption can be checked using the
UNIVARIATE procedure; if the normality assumptions for the t test are not satisfied, one should
analyze the data using the NPAR1WAY procedure. PROC TTEST computes the group
comparison t statistic based on the assumption that the variances of the two groups are equal. It
also computes an approximate t based on the assumption that the variances are unequal (the
Behrens-Fisher problem). The following statements are available in PROC TTEST.
PROC TTEST <options>;
CLASS variable;
PAIRED variables;
BY variables;
VAR variables;
FREQ Variables;
WEIGHT variable;
No statement can be used more than once. There is no restriction on the order of the statements
after the PROC statement. The following options can appear in the PROC TTEST statement.
ALPHA= p: option specifies that confidence intervals are to be 100(1-p)% confidence intervals,
where 0<p<1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or 1 or more, an
error message is printed.
COCHRAN: option requests the Cochran and Cox approximation of the probability level of the
approximate t statistic for the unequal variances situation.
H0=m: option requests tests against m instead of 0 in all three situations (one-sample, two-
sample, and paired observation t tests). By default, PROC TTEST uses H0=0.
A CLASS statement giving the name of the classification (or grouping) variable must
accompany the PROC TTEST statement in the two independent sample cases. It should be
omitted for the one sample or paired comparison situations. The class variable must have two,
and only two, levels. PROC TTEST divides the observations into the two groups for the t test
using the levels of this variable. One can use either a numeric or a character variable in the
CLASS statement.
In the statement PAIRED PairLists, the PairLists in the PAIRED statement identifies the
variables to be compared in paired comparisons. You can use one or more PairLists. Variables
or lists of variables are separated by an asterisk (*) or a colon (:). Examples of the use of the
asterisk and the colon are shown in the following table.
PAIRED A*B; A-B
SAS for Statistical Procedures
PAIRED A*B C*D; A-B and C-D
PAIRED (A B)*(C B); A-C, A-B and B-C
PAIRED (A1-A2)*(B1-B2); A1-B1, A1-B2, A2-B1 and A2-B2
PAIRED (A1-A2):(B1-B2); A1-B1 and A2-B2
PROC ANOVA performs analysis of variance for balanced data only from a wide variety of
experimental designs whereas PROC GLM can analyze both balanced and unbalanced data. As
ANOVA takes into account the special features of a balanced design, it is faster and uses less
storage than PROC GLM for balanced data. The basic syntax of the ANOVA procedure is as
given:
PROC ANOVA < Options>;
CLASS variables;
MODEL dependents = independent variables (or effects)/options;
MEANS effects/options;
ABSORB variables;
FREQ variables;
TEST H = effects E = effect;
MANOVA H = effects E = effect;
M = equations/options;
REPEATED factor - name levels / options;
By variables;
The PROC ANOVA, CLASS and MODEL statements are must. The other statements are
optional. The CLASS statement defines the variables for classification (numeric or character
variables - maximum characters =16).
The MODEL statement names the dependent variables and independent variables or effects. If
no effects are specified in the MODEL statement, ANOVA fits only the intercept. Included in
the ANOVA output are F-tests of all effects in the MODEL statement. All of these F-tests use
residual mean squares as the error term. The MEANS statement produces tables of the means
corresponding to the list of effects. Among the options available in the MEANS statement are
several multiple comparison procedures viz. Least Significant Difference (LSD), Duncan’s New
multiple - range test (DUNCAN), Waller - Duncan (WALLER) test, Tukey’s Honest Significant
Difference (TUKEY). The LSD, DUNCAN and TUKEY options takes level of significance
ALPHA = 5% unless ALPHA = options is specified. Only ALPHA = 1%, 5% and 10% are
allowed with the Duncan’s test. 95% Confidence intervals about means can be obtained using
CLM option under MEANS statement.
The TEST statement tests for the effects where the residual mean square is not the appropriate
term such as main - plot effects in split - plot experiment. There can be multiple MEANS and
TEST statements (as well as in PROC GLM), but only one MODEL statement preceded by RUN
statement. The ABSORB statement implements the technique of absorption, which saves time
and reduces storage requirements for certain type of models. FREQ statement is used when each
observation in a data set represents ‘n’ observations, where n is the value of FREQ variable. The
MANOVA statement is used for implementing multivariate analysis of variance. The
SAS for Statistical Procedures
REPEATED statement is useful for analyzing repeated measurement designs and the BY
statement specifies that separate analysis are performed on observations in groups defined by the
BY variables.
PROC GLM for analysis of variance is similar to using PROC ANOVA. The statements listed
for PROC ANOVA are also used for PROC GLM. In addition; the following more statements
can be used with PROC GLM:
CONTRAST ‘label’ effect name< ... effect coefficients > </options>;
ESTIMATE ‘label’ effect name< ... effect coefficients > </options>;
ID variables;
LSMEANS effects < / options >;
OUTPUT < OUT = SAS-data-set>keyword=names< ... keyword = names>;
RANDOM effects < / options >;
WEIGHT variables
Multiple comparisons as used in the options under MEANS statement are useful when there are
no particular comparisons of special interest. But there do occur situations where preplanned
comparisons are required to be made. Using the CONTRAST, LSMEANS statement, we can
test specific hypothesis regarding pre - planned comparisons. The basic form of the CONTRAST
statement is as described above, where label is a character string used for labeling output, effect
name is class variable (which is independent) and effect - coefficients is a list of numbers that
specifies the linear combination parameters in the null hypothesis. The contrast is a linear
function such that the elements of the coefficient vector sum to 0 for each effect. While using the
CONTRAST statements, following points should be kept in mind.
How many levels (classes) are there for that effect. If there are more levels of that effect in the
data than the number of coefficients specified in the CONTRAST statement, the PROC GLM
adds trailing zeros. Suppose there are 5 treatments in a completely randomized design denoted
as T1, T2, T3, T4, T5 and null hypothesis to be tested is
H
o: T2+T3 = 2T1 or 2T1+T2+T3 = 0
Suppose in the data treatments are classified using TRT as class variable, then effect name is
TRT CONTRAST ‘TIVS 2&3’ TRT 2 1 1 0 0; Suppose last 2 zeros are not given, the
trailing zeros can be added automatically. The use of this statement gives a sum of squares with
1 degree of freedom (d.f.) and F-value against error as residual mean squares until specified. The
name or label of the contrast must be 20 characters or less.
The available CONTRAST statement options are
E: prints the entire vector of coefficients in the linear function, i.e., contrast.
E = effect: specifies an effect in the model that can be used as an error term
ETYPE = n: specifies the types (1, 2, 3 or 4) of the E effect.
Multiple degrees of freedom contrasts can be specified by repeating the effect name and
coefficients as needed separated by commas. Thus the statement for the above example
CONTRAST ‘All’ TRT 2 1 1 0 0, TRT 0 1 -1 0 0;
SAS for Statistical Procedures
This statement produces two d.f. sum of squares due to both the contrasts. This feature can be
used to obtain partial sums of squares for effects through the reduction principle, using sums of
squares from multiple degrees of freedom contrasts that include and exclude the desired
contrasts. Although only t1 linearly independent contrasts exists for t classes, any number of
contrasts can be specified.
The ESTIMATE statement can be used to estimate linear functions of parameters that may or
may not be obtained by using CONTRAST or LSMEANS statement. For the specification of the
statement only word CONTRAST is to be replaced by ESTIMATE in CONTRAST statement.
Fractions in effects coefficients can be avoided by using DIVISOR = Common denominator as
an option. This statement provides the value of an estimate, a standard error and a t-statistic for
testing whether the estimate is significantly different from zero.
The LSMEANS statement produces the least square estimates of CLASS variable means i.e.
adjusted means. For one-way structure, there are simply the ordinary means. The least squares
means for the five treatments for all dependent variables in the model statement can be obtained
using the statement.
LSMEANS TRT / options;
Various options available with this statement are:
STDERR: gives the standard errors of each of the estimated least square mean and the t-statistic
for a test of hypothesis that the mean is zero.
PDIFF: Prints the p - values for the tests of equality of all pairs of CLASS means.
SINGULAR: tunes the estimability checking. The options E, E=, E-TYPE = are similar as
discussed under CONTRAST statement.
Adjust=T: gives the probabilities of significance of pairwise comparisons based on T-test.
Adjust=Tukey: gives the probabilities of significance of pairwise comparisons based on Tukey's
test
Lines: gives the letters on treatments showing significant and non-significant groups
When the predicted values are requested as a MODEL statement option, values of variable
specified in the ID statement are printed for identification besides each observed, predicted and
residual value. The OUTPUT statement produces an output data set that contains the original
data set values alongwith the predicted and residual values.
Besides other options in PROC GLM under MODEL statement we can give the option: 1.
solution 2. xpx (=X`X) 3 . I (g-inverse)
PROC GLM recognizes different theoretical approaches to ANOVA by providing four types of
sums of squares and associated statistics. The four types of sums of squares in PROC GLM are
called Type I, Type II, Type III and Type IV.
SAS for Statistical Procedures
The Type I sums of squares are the classical sequential sums of squares obtained by adding the
terms to the model in some logical sequence. The sum of squares for each class of effects is
adjusted for only those effects that precede it in the model. Thus the sums of squares and their
expectations are dependent on the order in which the model is specified.
The Type II, III and IV are ‘partial sums of squares' in the sense that each is adjusted for all other
classes of the effects in the model, but each is adjusted according to different rules. One general
rule applies to all three types: the estimable functions that generate the sums of squares for one
class of squares will not involve any other classes of effects except those that “contain” the class
of effects in question.
For example, the estimable functions that generate SS (AB) in a three- factor factorial will have
zero coefficients on main effects and the (A C) and (B C) interaction effects. They will
contain non-zero coefficient on the (A B C) interaction effects, because A B C
interaction “contains” A B interaction.
Type II, III and IV sums of squares differ from each other in how the coefficients are determined
for the classes of effects that do not have zero coefficients - those that contain the class of effects
in question. The estimable functions for the Type II sum of squares impose no restriction on the
values of the non-zero coefficients on the remaining effects; they are allowed to take whatever
values result from the computations adjusting for effects that are required to have zero
coefficients. Thus, the coefficients on the higher-order interaction effects and higher level
nesting effects are functions of the number of observations in the data. In general, the Type II
sums of squares do not possess of equitable distribution property and orthogonality characteristic
of balanced data.
The Type III and IV sums of squares differ from the Type II sums of squares in the sense that the
coefficients on the higher order interaction or nested effects that contain the effects in question
are also adjusted so as to satisfy either the orthogonality condition (Type III) or the equitable
distribution property (Type IV).
The coefficients on these effects are no longer functions of the nij and consequently, are the same
for all designs with the same general form of estimable functions. If there are no empty cells (no
nij = 0) both conditions can be satisfied at the same time and Type III and Type IV sums of
squares are equal. The hypothesis being tested is the same as when the data is balanced.
When there are empty cells, the hypotheses being tested by the Type III and Type IV sums of
squares may differ. The Type III criterion of orthogonality reproduces the same hypotheses one
obtains if effects are assumed to add to zero. When there are empty cells this is modified to “the
effects that are present are assumed to be zero”. The Type IV hypotheses utilize balanced
subsets of non-empty cells and may not be unique. For a 2x3 factorial for illustration purpose
adding the terms to the model in the order A, B, AB various types sums of squares can be
explained as follows:
Effect Type I Type II Type III Type IV
General Mean R() R()
A R(A/ ) R(A/ ,B) R(A/,B,AB)
SAS for Statistical Procedures
B R(B/,A) R(B/,A) R(B/,A,AB)
A*B R(A*B/ ,A,B) R(A*B/,A,B) R(AB/,A,B)
R (A/) is sum of squares adjusted for , and so on.
Thus in brief the four sets of sums of squares Type I, II, III & IV can be thought of respectively
as sequential, each - after-all others, -restrictions and hypotheses.
There is a relationship between the four types of sums of squares and four types of data
structures (balanced and orthogonal, unbalanced and orthogonal, unbalanced and non-orthogonal
(all cells filled), unbalanced and non-orthogonal (empty cells)). For illustration, let nIJ denote
the number of observations in level I of factor A and level j of factor B. Following table
explains the relationship in data structures and Types of sums of squares in a two-way classified
data.
Data Structure Type
1 2 3 4
Effect Equal nIJ Proportionate Disproportionate Empty Cell
nIJ non-zero nIJ
A I=II=III=IV I=II,III=IV III=IV
B I=II=III=IV I=II,III=IV I=II,III=IV I=II
A*B I=II=III=IV I=II=III=IV I=II=III=IV I=II=III=IV
In general,
I=II=III=IV (balanced data); II=III=IV (no interaction models)
I=II, III=IV (orthogonal data); III=IV (all cells filled data).
Proper Error terms: In general F-tests of hypotheses in ANOVA use the residual mean squares
in other terms are to be used as error terms. For such situations PROC GLM provides the TEST
statement which is identical to the test statement available in PROC ANOVA. PROC GLM also
allows specification of appropriate error terms in MEANS LSMEANS and CONTRAST
statements. To illustrate it let us use split plot experiment involving the yield of different
irrigation (IRRIG) treatments applied to main plots and cultivars (CULT) applied to subplots.
The data so obtained can be analysed using the following statements.
data splitplot;
input REP IRRIG CULT YIELD;
cards;
. . .
. . .
. . .
;
PROC print; run;
PROC GLM;
class rep, irrig cult;
SAS for Statistical Procedures
model yield = rep irrig rep*irrig cult irrig* cult;
test h = irrig e = rep * irrig;
contrast ‘IRRIGI Vs IRRIG2’ irrig 1 -1 / e = rep* irrig;
run;
As we know here that the irrigation effects are tested using error (A) which is sum of squares due
to rep* irrig, as taken in test statement and contrast statement respectively.
In Test statement H = numerator for - source of variation and
E = denominator source of variation
It may be noted here that the PROC GLM can be used to perform analysis of covariance as well.
For analysis of covariance, the covariate should be defined in the model without specifying
under CLASS statement.
PROC RSREG fits the parameters of a complete quadratic response surface and analyses the
fitted surface to determine the factor levels of optimum response and performs a ridge analysis to
search for the region of optimum response.
PROC RSREG < options >;
MODEL responses = independents / <options >;
RIDGE < options >;
WEIGHT variable;
ID variable;
By variable;
run;
The PROC RSREG and model statements are required. The BY, ID, MODEL, RIDGE, and
WEIGHT statements are described after the PROC RSREG statement below and can appear in
any order.
The PROC RSREG statement invokes the procedure and following options are allowed with the
PROC RSREG:
DATA = SAS - data-set : specifies the data to be analysed.
NOPRINT : suppresses all printed results when only the output
data set is required.
OUT : SAS-data-set: creates an output data set.
The model statement without any options transforms the independent variables to the coded data.
By default, PROC RSREG computes the linear transformation to perform the coding of variables
by subtracting average of highest and lowest values of the independent variable from the original
value and dividing by half of their differences. Canonical and ridge analyses are performed to the
model fit to the coded data. The important options available with the model statement are:
NOCODE : Analyses the original data.
ACTUAL : specifies the actual values from the input data set.
COVAR = n : declares that the first n variables on the independent side of the model are
simple linear regression (covariates) rather than factors in the quadratic
response surface.
LACKFIT : Performs lack of fit test. For this the repeated observations must appear
together.
NOANOVA : suppresses the printing of the analysis of variance and parameter
SAS for Statistical Procedures
estimates from the model fit.
NOOPTIMAL (NOOPT): suppresses the printing of canonical analysis for quadratic response
surface.
NOPRINT : suppresses both ANOVA and the canonical analysis.
PREDICT : specifies the values predicted by the model.
RESIDUAL : specifies the residuals.
A RIDGE statement computes the ridge of the optimum response. Following important options
available with RIDGE statement are
MAX: computes the ridge of maximum response.
MIN: computes the ridge of the minimum response.
At least one of the two options must be specified.
NOPRINT: suppresses printing the ridge analysis only when an output data set is required.
OUTR = SAS-data-set: creates an output data set containing the computed optimum ridge.
RADIUS = coded-radii: gives the distances from the ridge starting point at which to compute the
optimum.
PROC REG is the primary SAS procedure for performing the computations for a statistical
analysis of data based on a linear regression model. The basic statements for performing such an
analysis are
PROC REG;
MODEL list of dependent variable = list of independent variables/ model options;
RUN;
The PROC REG procedure and model statement without any option gives ANOVA, root mean
square error, R-squares, Adjusted R-square, coefficient of variation etc.
The options under model statement are
P: It gives predicted values corresponding to each observation in the data set. The estimated
standard errors are also given by using this option.
CLM: It yields upper and lower 95% confidence limits for the mean of subpopulation
corresponding to specific values of the independent variables.
CLI : It yields a prediction interval for a single unit to be drawn at random from a
subpopulation.
STB: Standardized regression coefficients.
XPX, I: Prints matrices used in regression computations.
NOINT: This option forces the regression response to pass through the origin. With this option
total sum of squares is uncorrected and hence R-square statistic are much larger than those for
the models with intercept.
However, if no intercept model is to be fitted with corrected total sum of squares and hence usual
definition of various statistic viz R2, MSE etc. are to be retained then the option RESTRICT
intercept = 0; may be exercised after the model statement.
For obtaining residuals and studentized residuals, the option ‘R’ may be exercised under model
statement and Cook’s D statistic.
SAS for Statistical Procedures
The ‘INFLUENCE’ option under model statement is used for detection of outliers in the data and
provides residuals, studentized residuals, diagonal elements of HAT MATRIX, COVRATIO,
DFFITS, DFBETAS, etc.
For detecting multicollinearity in the data, the options ‘VIF’ (variance inflation factors) and
‘COLLINOINT’ or ‘COLLIN’ may be used.
Besides the options for weighted regression, output data sets, specification error, heterogeneous
variances etc. are available under PROC REG.
PROC PRINCOMP can be utilized to perform the principal component analysis.
Multiple model statements are permitted in PROC REG unlike PROC ANOVA and PROC
GLM. A model statement can contain several dependent variables.
The statement model y1, y2, y3, y4=x1 x2 x3 x4 x5 ; performs four separate regression analyses of
variables y1, y2, y3 and y4 on the set of variables x1, x2, x3, x4, x 5.
Polynomial models can be fitted by using independent variables in the model as x1=x, x2=x**2,
x3=x**3, and so on depending upon the order of the polynomial to be fitted. From a variable,
several other variables can be generated before the model statement and transformed variables
can be used in model statement. LY and LX gives Logarithms of Y & X respectively to the base
e and LogY, LogX gives logarithms of Y and X respectively to the base 10.
TEST statement after the model statement can be utilized to test hypotheses on individual or any
linear function(s) of the parameters.
For e.g. if one wants to test the equality of coefficients of x1 and x2 in y=
o+
1x1+
2 x2
regression model, statement
TEST 1: TEST x1 - x2 = 0;
Label: Test < equation ..., equation >;
The fitted model can be changed by using a separate model statement or by using DELETE
The PROC REG provides two types of sums of squares obtained by SS1 or SS2 options under
model statement. Type I SS are sequential sum of squares and Types II sum of squares are
partial SS are same for that variable which is fitted at last in the model.
For most applications, the desired test for a single parameter is based on the Type II sum of
squares, which are equivalent to the t-tests for the parameter estimates. The Type I sum of
squares, however, are useful if there is a need for a specific sequencing of tests on individual
coefficients as in polynomial models.
PROC ANOVA and PROC GLM are general purpose procedures that can be used for a broad
range of data classification. In contrast, PROC NESTED is a specialized procedure that is useful
only for nested classifications. It provides estimates of the components of variance using the
analysis of variance method of estimation. The CLASS statement in PROC NESTED has a
SAS for Statistical Procedures
broader purpose then it does in PROC ANOVA and PROC GLM; it encompasses the purpose of
MODEL statement as well. But the data must be sorted appropriately. For example in a
laboratory microbial counts are made in a study, whose objective is to assess the source of
variation in number of microbes. For this study n1 packages of the test material are purchased
and n2 samples are drawn from each package i.e. samples are nested within packages. Let
logarithm transformation is to be used for microbial counts. PROPER SAS statements are:
PROC SORT; By package sample;
PROC NESTED;
CLASS package sample;
Var logcount;
run;
Corresponding PROC GLM statements are
PROC GLM;
Class package sample;
Model Logcount= package sample (package);
The F-statistic in basic PROC GLM output is not necessarily correct. For this RANDOM
statement with a list of all random effects in the model is used and Test option is utilized to get
correct error term. However, for fixed effect models same arguments for proper error terms hold
as in PROC GLM and PROC ANOVA. For the analysis of the data using linear mixed effects
model, PROC MIXED of SAS should be used. The best linear unbiased predictors and solutions
for random and fixed effects can be obtained by using option ‘s’ in the Random statement.
PROCEDURES FOR SURVEY DATA ANALYSIS
PROC SURVEYMEANS procedure produces estimates of population means and totals from
sample survey data. You can use PROC SURVEYMEANS to compute the following statistics:
estimates of population means, with corresponding standard errors and t tests
estimates of population totals, with corresponding standard deviations and t tests
estimates of proportions for categorical variables, with standard errors and t tests
ratio estimates of population means and proportions, and their standard errors
confidence limits for population means, totals, and proportions
data summary information
PROC SURVEYFREQ procedure produces one-way to n-way frequency and crosstabulation
tables from sample survey data. These tables include estimates of population totals, population
proportions (overall proportions, and also row and column proportions), and corresponding
standard errors. Confidence limits, coefficients of variation, and design effects are also available.
The procedure also provides a variety of options to customize your table display.
PROC SURVEYREG procedure fits linear models for survey data and computes regression
coefficients and their variance-covariance matrix. The procedure allows you to specify
classification effects using the same syntax as in the GLM procedure. The procedure also
provides hypothesis tests for the model effects, for any specified estimable linear functions of the
model parameters, and for custom hypothesis tests for linear combinations of the regression
parameters. The procedure also computes the confidence limits of the parameter estimates and
their linear estimable functions.
SAS for Statistical Procedures
PROC SURVEYLOGISTIC procedure investigates the relationship between discrete responses
and a set of explanatory variables for survey data. The procedure fits linear logistic regression
models for discrete response survey data by the method of maximum likelihood, incorporating
the sample design into the analysis. The SURVEYLOGISTIC procedure enables you to use
categorical classification variables (also known as CLASS variables) as explanatory variables in
an explanatory model, using the familiar syntax for main effects and interactions employed in the
GLM and LOGISTIC procedures.
The SURVEYSELECT procedure provides a variety of methods for selecting probability-based
random samples. The procedure can select a simple random sample or a sample according to a
complex multistage sample design that includes stratification, clustering, and unequal
probabilities of selection. With probability sampling, each unit in the survey population has a
known, positive probability of selection. This property of probability sampling avoids selection
bias and enables you to use statistical theory to make valid inferences from the sample to the
survey population.
PROC SURVEYSELECT provides methods for both equal probability sampling and sampling
with probability proportional to size (PPS). In PPS sampling, a unit's selection probability is
proportional to its size measure. PPS sampling is often used in cluster sampling, where you
select clusters (groups of sampling units) of varying size in the first stage of selection. Available
PPS methods include without replacement, with replacement, systematic, and sequential with
minimum replacement. The procedure can apply these methods for stratified and replicated
sample designs.
3. Exercises
Example 3.1: An experiment was conducted to study the hybrid seed production of bottle gourd
(Lagenaria siceraria (Mol) Standl) Cv. Pusa hybrid-3 under open field conditions during
Kharif-2005 at Indian Agricultural Research Institute, New Delhi. The main aim of the
investigation was to compare natural pollination and hand pollination. The data were collected
on 10 randomly selected plants from each of natural pollination and hand pollination on
number of fruit set for the period of 45 days, fruit weight (kg), seed yield per plant (g) and
seedling length (cm). The data obtained is as given below:
Grou
p
No. of fruit Fruit wei
g
ht Seed
ield/
lant Seedlin
g
len
g
th
1 7.0 1.85 147.70 16.86
1 7.0 1.86 136.86 16.77
1 6.0 1.83 149.97 16.35
1 7.0 1.89 172.33 18.26
1 7.0 1.80 144.46 17.90
1 6.0 1.88 138.30 16.95
1 7.0 1.89 150.58 18.15
1 7.0 1.79 140.99 18.86
1 6.0 1.85 140.57 18.39
1 7.0 1.84 138.33 18.58
SAS for Statistical Procedures
2 6.3 2.58 224.26 18.18
2 6.7 2.74 197.50 18.07
2 7.3 2.58 230.34 19.07
2 8.0 2.62 217.05 19.00
2 8.0 2.68 233.84 18.00
2 8.0 2.56 216.52 18.49
2 7.7 2.34 211.93 17.45
2 7.7 2.67 210.37 18.97
2 7.0 2.45 199.87 19.31
2 7.3 2.44 214.30 19.36
{Here 1 denotes natural pollination and 2 denotes the hand pollination}
1. Test whether the mean of the population of Seed yield/plant (g) is 200 or not.
2. Test whether the natural pollination and hand pollination under open field conditions are
equally effective or are significantly different.
3. Test whether hand pollination is better alternative in comparison to natural pollination.
Procedure:
For performing analysis, input the data in the following format. {Here Number of fruit (45
days) is termed as nfs45, Fruit weight (kg) is termed as fw, seed yield/plant (g) is termed as syp
and Seedling length (cm) is termed as sl. It may, however, be noted that one can retain the same
name or can code in any other fashion}.
data ttest1; /*one can enter any other name for data*/
input group nfs45 fw syp sl;
cards;
. . . . .
. . . . .
. . . . .
;
*To answer the question number 1 use the following SAS statements
proc ttest H0=200;
var syp;
run;
*To answer the question number 2 use the following SAS statements;
proc ttest;
class group;
var nfs45 fw syp sl;
run;
To answer the question number 3 one has to perform the one tail t-test. The easiest way to
convert a two-tailed test into a one-tailed test is take half of the p-value provided in the output of
2-tailed test output for drawing inferences. The other way is using the options sides in proc
SAS for Statistical Procedures
statement. Here we are interested in testing whether hand pollination is better alternative in
comparison to natural pollination, therefore, we may use Sides=L as
proc ttest sides=L;
class group;
var nfs45 fw syp sl;
run;
Similarly this option can also be used in one sample test and for right tail test Sides=U is used.
Exercise 3.2: A study was undertaken to find out whether the average grain yield of paddy of
farmers using laser levelling is more than the farmers using traditional land levelling methods.
For this study data on grain yield in tonne/hectare was collected from 59 farmers (33 using
traditional land levelling methods and 26 using new land leveller) and is given as:
3.67 3.6 3.79 3.95
4.04 3.7 3.17 5.3
3.49 5.3 3.58 5.8
2.75 4.4 4.08 2.8
2.63 5.4 4.25 3.0
2.46 3.4 5.21 4.78
2.50 3.5 5.63 4.07
2.88 8.2 3.42 4.88
2.45 7.5 3.88 4.37
2.46 7.6 3.29
2.67 7.0 3.92
2.38 7.4 2.25
2.42 3.4 2.58
2.54 3.6 3.25
3.88 5.6 3.46
3.88 5.6 3.79
3.42 5.4
Test whether the traditional land levelling and laser levelling give equivalent yields or are
significantly different.
Procedure:
For performing analysis, input the data in the following format. {Here traditional land levelling
is termed as LL, laser levelling as LL, method of levelling as MLevel and grain yield in t/ha as
gyld. It may, however, be noted that one can retain the same name or can code in any other
fashion}.
data ttestL; /*one can enter any other name for data*/
input MLevel gyld;
SAS for Statistical Procedures
cards;
. . . . .
. . . . .
. . . . .
;
*To answer the question number 1 use the following SAS statements
proc ttest data =ttestL;
var gyld;
run;
Exercise 3.3: The observations obtained from 15 experimental units before and after application
of the treatment are the following:
Unit No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Before 80 73 70 60 88 84 65 37 91 98 52 78 40 79 59
After 82 71 95 69 100 71 75 60 95 99 65 83 60 86 62
1. Test whether the mean score before application of treatment is 65.
2. Test whether the application of treatments has resulted into some change in the score of the
experimental units.
3. Test whether application of treatment has improved the scores.
Procedure:
data ttest;
input sn preapp postapp;
cards;
1 80 82
2 73 71
3 70 95
4 60 69
5 88 100
6 84 71
7 65 75
8 37 60
9 91 95
10 98 99
11 52 65
12 78 83
13 40 60
14 79 86
15 59 62
;
*For objective 1, use the following;
PROC TTEST H0=65;
VAR PREAPP;
RUN;
SAS for Statistical Procedures
*For objective 2, use the following;
PROC TTEST;
PAIRED PREAPP*POSTAPP;
RUN;
*For objective 3, use the following;
PROC TTEST sides=L;
PAIRED PREAPP*POSTAPP;
RUN;
Exercise 3.4: In F2 population of a breeding trial on pea, out of a total of 556 seeds, the
frequency of seeds of different shape and colour are: 315 rounds and yellow, 101 wrinkled and
yellow, 108 round and green , 32 wrinkled and green. Test at 5% level of significance whether
the different shape and colour of seeds are in proportion of 9:3:3:1 respectively.
Procedure:
/*rndyel=round and yellow, rndgrn=round and green, wrnkyel=wrinkled and yellow,
wrnkgrn=wrinkled and green*/;
data peas;
input shape_color \$ count;
cards;
rndyel 315
rndgrn 108
wrnkyel 101
wrnkgrn 32
;
proc freq data=peas order=data;
weight count ;
tables shape_color / chisq testp=(0.5625 0.1875 0.1875 0.0625);
exact chisq;
run;
Exercise 3.5: The educational standard of adoptability of new innovations among 600 farmers are given
as below: Educational standard
Draw the inferences whether educational standard has any impact on their adoptability of innovation.
Procedure:
data innovation;
input edu \$ adopt \$ count;
cards;
SAS for Statistical Procedures
;
proc freq order=data;
weight count ;
run;
Exercise 3.6: An Experiment was conducted using a Randomized complete block design in 5
treatments a, b, c, d & e with three replications. The data (yield) obtained is given below:
Treatment(TRT)
Replication(REP) a b c d e
1 16.9 18.2 17.0 15.1 18.3
2 16.5 19.2 18.1 16.0 18.3
3 17.5 17.1 17.3 17.8 19.8
1. Perform the analysis of variance of the data.
2. Test the equality of treatment means.
3. Test H0: 2T1=T2+T3, where as T1, T2, T3, T4 and T5 are treatment effects.
Procedure:
Prepare a SAS data file using
DATA Name;
INPUT REP TRT \$ yield;
Cards;
. . .
. . .
. . .
;
Print data using PROC PRINT. Perform analysis using PROC ANOVA, obtain means of
treatments and obtain pairwise comparisons using least square differences, Duncan’s New
Multiple range tests and Tukey’s Honest Significant difference tests. Make use of the following
statements:
PROC Print;
PROC ANOVA;
Class REP TRT;
Model Yield = REP TRT;
Means TRT/lsd;
Means TRT/duncan;
Means TRT/tukey;
Run;
Perform contrast analysis using PROC GLM.
Proc glm;
SAS for Statistical Procedures
Class rep trt;
Model yld = rep trt;
Means TRT/lsd;
Means TRT/duncan;
Means TRT/tukey
Contrast ‘1 Vs 2&3’ trt 2 -1 -1; Run;
Exercise 3.7: In order to select suitable tree species for Fuel, Fodder and Timber an experiment
was conducted in a randomized complete block design with ten different trees and four
replications. The plant height was recorded in cm. The details of the experiment are given below:
Plant Height (Cms): Place – Kanpur
Name of Tree Spacing Replications
1 2 3 4
A. Indica 4x4 144.44 145.11 104.00 105.44
D. Sisso 4x2 113.50 118.61 118.61 123.00
A. Procer 4x2 60.88 90.94 80.33 92.00
A. Nilotic 4x2 163.44 158.55 158.88 153.11
T. Arjuna 4x2 110.11 116.00 119.66 103.22
L. Loucoc 4x1 260.05 102.27 256.22 217.80
M. Alba 4x2 114.00 115.16 114.88 106.33
C. Siamia 4x2 91.94 58.16 76.83 79.50
E. Hybrid 4x1 156.11 177.97 148.22 183.17
A. Catech 4x2 80.2 108.05 45.18 79.55
Analyze the data and draw your conclusions.
Exercise 3.8: An experiment was conducted with 49 crop varieties (TRT) using a simple lattice
design. The layout and data obtained (Yld) is as given below:
REPLICATION (REP)-I
BLOCKS(BLK)
1 2 3 4 5 6 7
22(7) 10(12) 45(22) 37(25) 18(33) 30(33) 5(28)
24(20) 14(26) 44(21) 41(23) 19(17) 34(31) 6(74)
28(25) 8(42) 43(16) 40(11) 21(13) 35(10) 7(14)
27(68) 9(13) 47(37) 42(24) 17(10) 32(12) 2(14)
25(4) 13(10) 49(13) 36(30) 15(36) 29(22) 1(16)
26(11) 12(21) 48(21) 39(34) 20(30) 33(33) 3(11)
23(45) 11(11) 46(12) 38(15) 16(14) 31(18) 4(7)
REPLICATION (REP)-II
BLOCKS(BLK)
1 2 3 4 5 6 7
22(29) 18(64) 20(25) 23(45) 5(19) 3(13) 14(60)
8(127) 25(31) 27(71) 16(22) 19(47) 24(23) 49(72)
43(119) 46(85) 13(51) 2(13) 47(86) 17(51) 21(10)
SAS for Statistical Procedures
1(24) 11(51) 48(121) 37(85) 40(33) 10(30) 42(23)
36(58) 4(39) 41(22) 9(10) 12(48) 31(50) 35(54)
29(97) 39(67) 6(75) 30(65) 33(73) 38(30) 28(54)
15(47) 32(93) 34(44) 44(5) 26(56) 45(103) 7(85)
1. Perform the analysis of variance of the data. Also obtain Type II SS.
2. Obtain adjusted treatment means with their standard errors.
3. Test the equality of all adjusted treatment means.
4. Test whether the sum of 1 to 3 treatment means is equal to the sum of 4 to 6 treatments.
5. Estimate difference between average treatment 1 average of 2 to 4 treatment means.
6. Divide the between block sum of squares into between replication sum of squares and
between blocks within replications sum of squares.
7. Assuming that the varieties are a random selection from a population, obtain the genotypic
variance.
8. Analyze the data using block effects as random.
PROCEDURE
Prepare the DATA file.
DATA Name;
INPUT REP BLK TRT yield;
Cards;
. . . .
. . . .
. . . .
;
Print data using PROC PRINT. Perform analysis of 1 to 5 objectives using PROC GLM. The
statements are as follows:
Proc print;
Proc glm;
Class rep blk trt;
Model yld= blk trt/ss2;
Contrast ‘A’ trt 1 1 1 -1 -1 -1;
Estimate ‘A’ trt 3 -1 -1 -1/divisor=3;
Run;
The objective 6 can be achieved by another model statement.
Proc glm;
Class rep blk trt;
Model yield= rep blk (rep) trt/ss2;
run;
The objective 7 can be achieved by using the another PROC statement
Proc Varcomp Method=type1;
Class blk trt;
Model yield = blk trt/fixed = 1;
Run;
SAS for Statistical Procedures
The above obtains the variance components using Hemderson’s method. The methods of
maximum likelihood, restricted maximum likelihood, minimum quadratic unbiased estimation
can also be used by specifying method =ML, REML, MIVQE respectively.
Objective 8 can be achieved by using PROCMIXED.
Proc Mixed ratio covtest;
Class blk trt;
Model yield = trt;
Random blk/s;
Lsmeans trt/pdiff;
Store lattice;
Run;
PROC PLM SOURCE = lattice;
LSMEANS trt /pdiff lines;
RUN;
Exercise 3.9: Analyze the data obtained through a Split-plot experiment involving the yield of 3
Irrigation (IRRIG) treatments applied to main plots and two Cultivars (CULT) applied to
subplots in three Replications (REP). The layout and data (YLD) is given below:
Replication-I Replication -II Replication-III
I1 I2 I3 I1 I2 I3 I1 I2 I3
C1
(1.6) C1
(2.6) C1
(4.7) C1
(3.4) C1
(4.6) C1
(5.5) C1
(3.2) C1
(5.1) C1
(5.7)
C2
(3.3) C2
(5.1) C2
(6.8) C2
(4.7) C2
(1.1) C2
(6.6) C2
(5.6) C2
(6.2) C2
(4.5)
Perform the analysis of the data. (HINT: Steps are given in text).
Remark 3.9.1: Another way proposed for analysis of split plot designs is using replication as
random effect and analyse the data using PROC MIXED of SAS. For the above case, the steps
for using PROC MIXED are:
PROC MIXED COVTEST;
CLASS rep irrig cult;
MODEL yield = irrig cult irrig*cult / DDFM=KR;
RANDOM rep rep*irrig;
LSMEANS irrig cult irrig*cult / PDIFF;
STORE spd;
run;
/* An item store is a special SAS-defined binary file format used to store and restore information with a hierarchical
structure*/
/* The PLM procedure performs post fitting statistical analyses for the contents of a SAS item store that was
previously created with the STORE statement in some other SAS/STAT procedure*/
PROC PLM SOURCE = SPD;
LSMEANS irrig cult irrig*cult /pdiff lines;
RUN;
SAS for Statistical Procedures
Remark 3.9.2: In Many experimental situations, the split plot designs are conducted across
environments and a pooled is required. One way of analysing data of split plot designs with two
factors A and B conducted across environment is
PROC MIXED COVTEST;
CLASS year rep a b;
MODEL yield = a b a*b / DDFM=KR;
/* DDFM specifies the method for computing the denominator degrees of freedom for the tests of fixed effects
resulting from the MODEL*/
RANDOM year rep(year) year*a year*rep*a year*a*b;
LSMEANS a b a*b / PDIFF;
STORE spd1;
run;
PROC PLM SOURCE = SPD1;
LSMEANS a b a*b/pdiff lines;
RUN;
Exercise 3.10: An agricultural field experiment was conducted in 9 treatments using 36 plots
arranged in 4 complete blocks and a sample of harvested output from all the 36 plots are to be
analysed blockwise by three technicians using three different operations. The data collected is
given below: Block-1
Technician
Operation 1 2 3
Block-2
Technician
Operation 1 2 3
1 1(1.1) 2(2.1) 3(3.1) 1 1(2.1) 4(5.2) 7(8.3)
2 4(4.2) 5(5.3) 6(6.3) 2 2(3.2) 5(6.7) 8(9.9)
3 7(7.4) 8(8.7) 9(9.6) 3 3(4.5) 6(7.6) 9(10.3)
Block-3
Technician
Operation 1 2 3
Block-4
Technician
Operation 1 2 3
1 1(1.2) 6(6.3) 8(8.7) 1 1(3.1) 9(11.3) 5(7.8)
2 9(9.4) 2(2.7) 4(4.8) 2 6(8.1) 2(4.5) 7(9.3)
3 5(5.9) 7(7.8) 3(3.3) 3 8(10.7) 4(6.9) 3(5.8)
1. Perform the analysis of the data considering that technicians and operations are crossed with
each other and nested in the blocking factor.
2. Perform the analysis by considering the effects of technicians as negligible.
3. Perform the analysis by ignoring the effects of the operations and technicians.
Procedure:
Prepare the data file.
DATA Name;
INPUT BLK TECH OPER TRT OBS;
Cards;
. . . .
SAS for Statistical Procedures
. . . .
. . . .
;
Perform analysis of objective 1 using PROC GLM. The statements are as follows:
Proc glm;
Class blk tech oper trt;
Model obs= blk tech (blk) oper(blk) trt/ss2;
Lsmeans trt oper(blk)/pdiff;
Run;
Perform analysis of objective 2 using PROC GLM with the additional statements as follows:
Proc glm;
Class blk tech oper trt;
Model obs= blk oper(blk) trt/ss2;
run;
Perform analysis of objective 3 using PROC GLM with the additional statements as follows:
Proc glm;
Class blk tech oper trt;
Model obs = blk trt/ss2;
run;
Exercise 3.11: A greenhouse experiment on tobacco mossaic virus was conducted. The
experimental unit was a single leaf. Individual plants were found to be contributing significantly
to error and hence were taken as one source causing heterogeneity in the experimental material.
The position of the leaf within plants was also found to be contributing significantly to the error.
Therefore, the three positions of the leaves viz. top, middle and bottom were identified as levels
of second factor causing heterogeneity. 7 solutions were applied to leaves of 7 plants and
number of lesions produced per leaf was counted. Analyze the data of this experiment.
Plants
Leaf Position 1 2 3 4 5 6 7
Top 1(2) 2(3) 3(1) 4(5) 5(3) 6(2) 7(1)
Middle 2(4) 3(3) 4(2) 5(6) 6(4) 7(2) 1(1)
Bottom 4(3) 5(4) 6(7) 7(6) 1(3) 2(4) 3(7)
The figures at the intersections of the plants and leaf position are the solution numbers and the
figures in the parenthesis are number of lesions produced per leaf.
Procedure:
Prepare the data file.
DATA Name;
INPUT plant posi \$ trt count;
Cards;
. . . .
. . . .
. . . .
SAS for Statistical Procedures
;
Perform analysis using PROC GLM. The statements are as follows:
Proc glm;
Class plant posi trt count;
Model count= plant posi trt/ss2;
Lsmeans trt/pdiff; Run;
Exercise 3.12: The following data was collected through a pilot sample survey on Hybrid Jowar
crop on yield and biometrical characters. The biometrical characters were average Plant
Population (PP), average Plant Height (PH), average Number of Green Leaves (NGL) and Yield
(kg/plot).
1. Obtain correlation coefficient between each pair of the variables PP, PH, NGL and yield.
2. Fit a multiple linear regression equation by taking yield as dependent variable and
biometrical characters as explanatory variables. Print the matrices used in the regression
computations.
3. Test the significance of the regression coefficients and also equality of regression
coefficients of a) PP and PH b) PH and NGL
4. Obtain the predicted values corresponding to each observation in the data set.
5. Identify the outliers in the data set.
6. Check for the linear relationship among the biometrical characters.
7. Fit the model without intercept.
8. Perform principal component analysis.
No. PP PH NGL Yield
1 142.00 0.5250 8.20 2.470
2 143.00 0.6400 9.50 4.760
3 107.00 0.6600 9.30 3.310
4 78.00 0.6600 7.50 1.970
5 100.00 0.4600 5.90 1.340
6 86.50 0.3450 6.40 1.140
7 103.50 0.8600 6.40 1.500
8 155.99 0.3300 7.50 2.030
9 80.88 0.2850 8.40 2.540
10 109.77 0.5900 10.60 4.900
11 61.77 0.2650 8.30 2.910
12 79.11 0.6600 11.60 2.760
13 155.99 0.4200 8.10 0.590
14 61.81 0.3400 9.40 0.840
15 74.50 0.6300 8.40 3.870
16 97.00 0.7050 7.20 4.470
17 93.14 0.6800 6.40 3.310
18 37.43 0.6650 8.40 1.570
19 36.44 0.2750 7.40 0.530
20 51.00 0.2800 7.40 1.150
21 104.00 0.2800 9.80 1.080
22 49.00 0.4900 4.80 1.830
23 54.66 0.3850 5.50 0.760
24 55.55 0.2650 5.00 0.430
SAS for Statistical Procedures
25 88.44 0.9800 5.00 4.080
26 99.55 0.6450 9.60 2.830
27 63.99 0.6350 5.60 2.570
28 101.77 0.2900 8.20 7.420
29 138.66 0.7200 9.90 2.620
30 90.22 0.6300 8.40 2.000
31 76.92 1.2500 7.30 1.990
32 126.22 0.5800 6.90 1.360
33 80.36 0.6050 6.80 0.680
34 150.23 1.1900 8.80 5.360
35 56.50 0.3550 9.70 2.120
36 136.00 0.5900 10.20 4.160
37 144.50 0.6100 9.80 3.120
38 157.33 0.6050 8.80 2.070
39 91.99 0.3800 7.70 1.170
40 121.50 0.5500 7.70 3.620
41 64.50 0.3200 5.70 0.670
42 116.00 0.4550 6.80 3.050
43 77.50 0.7200 11.80 1.700
44 70.43 0.6250 10.00 1.550
45 133.77 0.5350 9.30 3.280
46 89.99 0.4900 9.80 2.690
Procedure:
Prepare a data file
Data mlr;
Input PP PH NGL Yield;
Cards;
. . . .
. . . .
;
For obtaining correlation coefficient, Use PROC CORR;
Proc Corr;
Var PP PH NGL Yield;
run;
For fitting of multiple linear regression equation, use PROC REG
Proc Reg;
Model Yield = PP PH NGL/ p r influence vif collin xpx i;
Test 1: Test PP =0; Test 2: Test PH=0;
Test 3: Test NGL=0;
Test 4: Test PP-PH=0;
Test 4a: Test PP=PH=0;
Test 5: Test PH-NGL=0;
Test 5a: Test PH=NGL=0;
SAS for Statistical Procedures
Model Yield = PP PH NGL/noint;
run;
Proc reg;
Model Yield = PP PH NGL;
Restrict intercept =0;
Run;
For diagnostic plots
Proc Reg plots(unpack)=diagnostics;
Model Yield = PP PH NGL;
run;
For variable selection, one can use the following option in model statement:
Selection=stepwise sls=0.10;
For performing principal component analysis, use the following:
PROC PRINCOMP;
VAR PP PH NGL YIELD;
run;
Example 3.13: An experiment was conducted at Division of Agricultural Engineering, IARI,
New Delhi for studying the capacity of a grader in number of hours when used with three
different speeds and two processor settings. The experiment was conducted using a factorial
completely randomised design in 3 replications. The treatment combinations and data obtained
on capacity of grader in hours given as below:
Replicatio
n speed Processor setting trt cgrade
r
1 1 1 1 1852
1 1 2 2 1848
1 1 3 3 1855
1 2 1 4 2270
1 2 2 5 2279
1 2 3 6 2272
1 3 1 7 3035
1 3 2 8 3042
1 3 3 9 3028
2 1 1 1 1845
2 1 2 2 1855
2 1 3 3 1860
2 2 1 4 2276
2 2 2 5 2275
2 2 3 6 2248
2 3 1 7 3036
2 3 2 8 3033
2 3 3 9 3038
3 1 1 1 1851
3 1 2 2 1840
3 1 3 3 1840
SAS for Statistical Procedures
3 2 1 4 2265
3 2 2 5 2280
3 2 3 6 2278
3 3 1 7 3040
3 3 2 8 3028
3 3 3 9 3040
Experimenter was interested in identifying the best combination of speed and processor setting
that gives maximum capacity of the grader in hours.
Solution: This data can be analysed as per procedure of factorial CRD and one can use the
following SAS steps for performing the nalysis:
Data ex1a;
/*here rep: replication; proset: processor setting and cgrader: capacity of the grader in hours*/
Cards;
1 1 1 1852
1 1 2 1848
1 1 3 1855
. . . .
. . . .
. . . .
3 3 1 3040
3 3 2 3028
3 3 3 3040
;
Proc glm data=ex1;
Class speed prost;
Lsmeans speed post speed*post/pdiff adjust=tukey lines;
Run;
The above analysis would identify test the significance of main effects of speed and processor
setting and their interaction. Through this analysis one can also identify the speed level
(averaged over processor setting) {Processor Setting (averaged over speed levels)} at which the
capacity of the grader is maximum. The multiple comparisons between means of combinations
of speed and processor setting would help in identifying the combination at which capacity of
Exercise 3.14: An experiment was conducted with five levels of each of the four fertilizer
treatments nitrogen, Phosphorus, Potassium and Zinc. The levels of each of the four factors and
yield obtained are as given below. Fit a second order response surface design using the original
data. Test the lack of fit of the model. Compute the ridge of maximum and minimum responses.
Obtain predicted residual Sum of squares.
N P2O5 K
2O Zn Yield
SAS for Statistical Procedures
40 30 25 20 11.28
40 30 25 60 8.44
40 30 75 20 13.29
40 90 25 20 7.71
120 30 25 20 8.94
40 30 75 60 10.9
40 90 25 60 11.85
120 30 25 60 11.03
120 30 75 20 8.26
120 90 25 20 7.87
40 90 75 20 12.08
40 90 75 60 11.06
120 30 75 60 7.98
120 90 75 60 10.43
120 90 75 20 9.78
120 90 75 60 12.59
160 60 50 40 8.57
0 60 50 40 9.38
80 120 50 40 9.47
80 0 50 40 7.71
80 60 100 40 8.89
80 60 0 40 9.18
80 60 50 80 10.79
80 60 50 0 8.11
80 60 50 40 10.14
80 60 50 40 10.22
80 60 50 40 10.53
80 60 50 40 9.5
80 60 50 40 11.53
80 60 50 40 11.02
Procedure:
Prepare a data file.
/* yield at different levels of several factors */
title 'yield with factors N P K Zn';
data dose;
input n p k Zn y ; label y = "yield" ;
cards;
. . . . .
. . . . .
. . . . .
;
*Use PROC RSREG.
ods graphics on;
proc rsreg data=dose plots(unpack)=surface(3d);
model y= n p k Zn/ nocode lackfit press;
SAS for Statistical Procedures
run;
ods graphics off; *If we do not want surface plots, then we may
proc rsreg;
model y= n p k Zn/ nocode lackfit press;
Ridge min max;
run;
Exercise 3.15: Fit a second order response surface design to the following data. Take
replications as covariate.
Fertilizer1 Fertilizer2 X1 X2 Yields(lb/plot)
Replication I Replication II
50 15
117.52 8.12
120 15 +1 112.37 11.84
50 25
1+1 13.55 12.35
120 25 +1 +1 16.48 15.32
35 20
208.63 9.44
134 20
+
20 14.22 12.57
85 13 0 27.90 7.33
85 27 0 +216.49 17.40
85 20 0 0 15.73 17.00
Procedure:
Prepare a data file.
/* yield at different levels of several factors */
title 'yield with factors x1 x2';
data respcov;
input fert1 fert2 x1 x2 yield ;
cards;
. . . . .
. . . . .
. . . . .
;
/*Use PROC RSREG.*/
ODS Graphics on;
proc rsreg plots(unpack)=surface(3d);
model yield = rep fert1 fert2/ covar=1 nocode lackfit ;
Ridge min max;
run;
ods graphics off;
Exercise 3.16: Following data is related to the length(in cm) of the ear-head of a wheat variety
9.3, 18.8, 10.7, 11.5, 8.2, 9.7, 10.3, 8.6, 11.3, 10.7, 11.2, 9.0, 9.8, 9.3, 10.3, 10, 10.1 9.6, 10.4.
Test the data that the median length of ear-head is 9.9 cm.
Procedure:
This may be tested using any of the three tests for location available in Proc Univariate viz.
Student’s test, the sign test, and the Wilcoxon signed rank test. All three tests produce a test
statistic for the null hypothesis that the mean or median is equal to a given value 0 against the
SAS for Statistical Procedures
two-sided alternative that the mean or median is not equal to 0. By default, PROC
UNIVARIATE sets the value of 0 to zero. You can use the MU0= option in the PROC
UNIVARIATE statement to specify the value of 0. If the data is from a normal population, then
we can infer using t-test otherwise non-parametric tests sign test, and the Wilcoxon signed rank
test may be used for drawing inferences.
Procedure:
data npsign;
input length;
cards;
9.3
18.8
10.7
11.5
8.2
9.7
10.3
8.6
11.3
10.7
11.2
9.0
9.8
9.3
10.3
10.0
10.1
9.6
10.4
;
PROC UNIVARIATE DATA=npsign MU0=9.9;
VAR length;
HISTOGRAM / NOPLOT ;
RUN;
QUIT;
Exercise 3.17: An experiment was conducted with 21 animals to determine if the four different
feeds have the same distribution of Weight gains on experimental animals. The feeds 1, 3 and 4
were given to 5 randomly selected animals and feed 2 was given to 6 randomly selected animals.
The data obtained is presented in the following table.
Feeds Weight gains (kg)
1 3.35 3.8 3.55 3.36 3.81
2 3.79 4.1 4.11 3.95 4.25 4.4
3 4 4.5 4.51 4.75 5
4 3.57 3.82 4.09 3.96 3.82
Procedure:
SAS for Statistical Procedures
data np;
input feed wt;
datalines;
1 3.35
1 3.80
1 3.55
1 3.36
1 3.81
2 3.79
2 4.10
2 4.11
2 3.95
2 4.25
2 4.40
3 4.00
3 4.50
3 4.51
3 4.75
3 5.00
4 3.57
4 3.82
4 4.09
4 3.96
4 3.82
;
PROC NPAR1WAY DATA=np WILCOXON; /*for performing Kruskal-Walis test*/;
VAR wt;
CLASS feed;
RUN;
Example 3.18: Finney (1971) gave a data representing the effect of a series of doses of carotene
(an insecticide) when sprayed on Macrosiphoniella sanborni (some obscure insects). The Table
below contains the concentration, the number of insects tested at each dose, the proportion dying
and the probit transformation (probit+5) of each of the observed proportions.
Concentratio
n (mg/1) No. of
insects (n) No. of
affected (r) %kill (P) Log
concentration
(x)
Empirical
probit
10.2 50 44 88 1.01 6.18
7.7 49 42 86 0.89 6.08
5.1 46 24 52 0.71 5.05
3.8 48 16 33 0.58 4.56
2.6 50 6 12 0.41 3.82
0 49 0 0 - -
Perform the probit analysis on the above data.
SAS for Statistical Procedures
Procedure
data probit;
input con n r;
datalines;
10.2 50 44
7.7 49 42
5.1 46 24
3.8 48 16
2.6 50 6
0 49 0
;
ods html;
Proc Probit log10 ;
Model r/n=con/lackfit inversecl;
title ('output of probit analysis');
run;
ods html close;
Model Information
Data Set WORK.PROBIT
Events Variable r
Trials Variable n
Number of Observations 5
Number of Events 132
Number of Trials 243
Name of Distribution Normal
Log Likelihood -120.0516414
Number of Observations Used 5
Number of Events 132
Number of Trials 243
Algorithm converged.
Goodness-of-Fit Tests
Statistic Value DF Pr > ChiSq
Pearson Chi-Square 1.7289 30.6305
L.R. Chi-Square 1.7390 30.6283
Response-Covariate Profile
Response Levels 2
Number of Covariate Values 5
Since the chi-square is small (p > 0.1000), fiducial limits will be calculated using a t value of 1.96
Type III Analysis of Effects
Effect DF Wald
Chi-Square Pr > ChiSq
Log10(con) 177.5920 <.0001
SAS for Statistical Procedures
Analysis of Parameter Estimates
Parameter DF Estimate Standard
Error 95% Confidence
Limits Chi-Square Pr > ChiSq
Intercept 1 -2.8875 0.3501 -3.5737 -2.2012 68.01 <.0001
Log10(con) 1 4.2132 0.4783 3.2757 5.1507 77.59 <.0001
Probit Model in Terms of
Tolerance Distribution
MU SIGMA
0.68533786 0.23734947
Estimated Covariance Matrix for
Tolerance Parameters
MU SIGMA
MU 0.000488 -0.000063
SIGMA -0.000063 0.000726
Probit Analysis on Log10(con)
Probability Log10(con) 95% Fiducial Limits
0.01 0.13318 -0.03783 0.24452
0.02 0.19788 0.04453 0.29830
0.03 0.23893 0.09668 0.33253
0.04 0.26981 0.13584 0.35834
0.05 0.29493 0.16764 0.37940
0.06 0.31631 0.19466 0.39737
0.07 0.33506 0.21832 0.41316
0.08 0.35184 0.23946 0.42733
0.09 0.36711 0.25866 0.44026
0.10 0.38116 0.27631 0.45218
0.15 0.43934 0.34898 0.50192
0.20 0.48558 0.40618 0.54202
0.25 0.52525 0.45467 0.57700
0.30 0.56087 0.49759 0.60904
0.35 0.59388 0.53666 0.63942
0.40 0.62521 0.57295 0.66905
0.45 0.65551 0.60716 0.69861
0.50 0.68534 0.63983 0.72870
0.55 0.71516 0.67142 0.75986
0.60 0.74547 0.70240 0.79265
0.65 0.77679 0.73330 0.82766
0.70 0.80980 0.76480 0.86563
0.75 0.84543 0.79777 0.90761
0.80 0.88510 0.83352 0.95533
0.85 0.93133 0.87427 1.01188
0.90 0.98951 0.92456 1.08401
0.91 1.00357 0.93658 1.10155
0.92 1.01883 0.94960 1.12065
0.93 1.03562 0.96387 1.14170
0.94 1.05436 0.97976 1.16526
0.95 1.07574 0.99783 1.19218
0.96 1.10086 1.01898 1.22388
0.97 1.13174 1.04490 1.26294
0.98 1.17279 1.07924 1.31498
0.99 1.23750 1.13315 1.39721
SAS for Statistical Procedures
Probit Analysis on con
Probability con 95% Fiducial Limits
0.01 1.35888 0.91657 1.75599
0.02 1.57718 1.10799 1.98745
0.03 1.73353 1.24935 2.15043
0.04 1.86129 1.36724 2.28215
0.05 1.97212 1.47110 2.39553
0.06 2.07163 1.56554 2.49671
0.07 2.16302 1.65317 2.58917
0.08 2.24825 1.73565 2.67506
0.09 2.32868 1.81410 2.75586
0.10 2.40526 1.88932 2.83257
0.15 2.75005 2.23349 3.17629
0.20 3.05900 2.54788 3.48353
0.25 3.35157 2.84884 3.77571
0.30 3.63808 3.14478 4.06477
0.35 3.92538 3.44084 4.35935
0.40 4.21897 3.74068 4.66710
0.45 4.52389 4.04724 4.99582
0.50 4.84549 4.36343 5.35423
0.55 5.18995 4.69265 5.75260
0.60 5.56506 5.03963 6.20374
0.65 5.98127 5.41132 6.72450
0.70 6.45363 5.81830 7.33883
0.75 7.00531 6.27722 8.08377
0.80 7.67532 6.81590 9.02252
0.85 8.53758 7.48633 10.27723
0.90 9.76143 8.40534 12.13411
0.91 10.08243 8.64132 12.63428
0.92 10.44313 8.90434 13.20233
0.93 10.85466 9.20181 13.85792
0.94 11.33346 9.54469 14.63036
0.95 11.90537 9.95006 15.56609
0.96 12.61427 10.44674 16.74479
0.97 13.54388 11.08927 18.32046
0.98 14.88655 12.00168 20.65263
0.99 17.27807 13.58779 24.95808
Interpretation: The goodness-of-fit tests (p-values = 0.6305, 0.6283) suggest that the
distribution and the model fits the data adequately. In this case, the fitting is done on normal
equivalent deviate only without adding 5. Therefore, log LD50 or lof ED50 corresponds to the
value of Probit=0. Log LD50 is obtained as 0.685338. Therefore, the stress level at which the
50% of the insects will be killed is (100.685338=4.845 mg/l). Similarly the stress level at which
65% of the insects will be killed is (100.776793 = 5.981 mg/l). Although both values are given in
the table above.
4. Discussion
We have initiated a link “Analysis of Data” at Design Resources Server
(www.iasri.res.in/design) to provide steps of analysis of data generated from designed
experiments by using statistical packages like SAS, SPSS, MINITAB, and SYSTAT, MS-
SAS for Statistical Procedures
EXCEL etc. For details and live examples one may refer to the link Analysis of data at
http://www.iasri.res.in/design/Analysis%20of%20data/Analysis%20of%20Data.html.
How to see SAS/STAT Examples?
One can learn from the examples available at
http://support.sas.com/rnd/app/examples/STATexamples.html
How to use HELP?
Help SAS help and Documentation Contents Learning to use SAS Sample SAS
Programs SAS/STAT …
5. Strengthening Statistical Computing for NARS
NAIP Consortium on Strengthening Statistical Computing for NARS (www.iasri.res.in/sscnars)
targets at providing
research guidance in statistical computing and computational statistics and creating sound
and healthy statistical computing environment
Providing advanced, versatile, and innovative and state-of the art high end statistical
packages to enable them to draw meaningful and valid inferences from their research.
The efforts also involve designing of intelligent algorithms for implementing statistical
techniques particularly for analysing massive data sets, simulation, bootstrap, etc.
The objectives of the consortium are:
To strengthen the high end statistical computing environment for the scientists in NARS;
To organize customized training programmes and also to develop training modules and
manuals for the trainers at various hubs; and
To sensitize the scientists in NARS with the statistical computing capabilities available
for enhancing their computing and research analytics skills.
This consortium has provided the platform for closer interactions among all NARS
organizations.
Capacity Building
For capacity building of researchers in the usage of high end statistical computing facility and
statistical techniques,
209 trainers have been trained through 30 working days training programmes;
2166 researchers have been trained through 104 training programmes of one week duration
each in the usage.
The capacity building efforts have paved the way for publishing research papers in the high
impact factor journals.
Indian NARS Statistical Computing Portal
For providing service oriented computing, developed and established Indian NARS Statistical
Computing portal, which is available to NARS users through IP authentication at
http://stat.iasri.res.in/sscnarsportal. Any researcher from Indian NARS may obtain User name
and password from Nodal Officers of their respective NARS organizations, list available at
SAS for Statistical Procedures
www.iasri.res.in/sscnars. It is a paradigm of computing techniques that operate on software-as-
a-service). There is no need of installation of statistical package at client side. Following 24
different modules of analysis of data are available on this portal, which have been classified
Basic Statistics
Descriptive Statistics
Univariate Distribution Fitting
Test of Significance based on t-test
Test of Significance based on Chi-square test
Correlation Analysis
Regression Analysis
Designs of Experiments
Completely randomized designs
Block Designs (includes both complete and incomplete block designs)
Combined Block Designs
Augmented Block Designs
Resolvable Block Designs
Nested Block Designs
Row-Column Designs
Cross Over Designs
Split Plot Designs
Split-Split-Plot Designs
Split Factorial (main A, sub B C) designs
Split Factorial (main AB, sub CD) designs
Strip Plot Designs
Response Surface Designs
Multivariate Analysis
Principal Component Analysis
Linear Discriminant Analysis
Statistical Genetics
Estimation of Heritability from half- sib data
Estimation of variance-Covariance matrix from Block Designs
The above modules can be used by uploading *.xlsx, *.csv and *.txt files and results can be
saved as *.RTF or *.pdf files. This has helped them in analyzing their data in an efficient
manner without losing any time.
SAS for Statistical Procedures
Requirements of Excel Files during analysis over Indian NARS Statistical Computing
Portal
1. Excel file must have the .xls, .xlsx, .csv or .txt extensions
2. This system will only consider the first sheet of the excel file which has name appearing
first in lexicographic order. It will not analyze the data which lies in subsequent sheets in
excel file.
3. Do not put period (.) or Zero (0) to display missing values in the treatment. It will not
consider as missing. Please leave the missing observations as blank cells.
4. If you are getting some wrong analysis then kindly check your excel file. Go to First
Column, first cell and then press Ctrl+Shift+End. It will select all the filled rows and
columns. If it selects some missing rows and columns then kindly delete those rows and
columns otherwise it will give wrong analysis result.
5. Do not use special characters in the variable/column names. Also variable names should
6. Do not use any formatting to the Excel sheet including formats or expressions to the cell
values. It should be data value.
7. If the First row cells has been merged then it will not detect as Column/Variable names.
8. If any rows or columns are hidden then it will be displayed during the analysis.
Basic Statistics
SAS for Statistical Procedures
9. Descriptive Statistics: The data file should contain at least one quantitative analysis
variable.
10. Univariate Distribution Fitting: The data file should contain at least one quantitative
numeric variable.
11. Test of Significance based on t-distribution: The data file should contain at least one
quantitative variable name and one classificatory variable.
12. Chi-Square Test: The data file should contain at least one categorical variable and
weights or frequency counts variable if frequencies are entered in a separate column.
Data may also have classificatory in it.
13. Correlation: The data file should contain at least two quantitative variables.
14. Regression Analysis: The data file should contain at least one Dependent and one
Independent variable.
Design of Experiments
15. Unblock Design: Prepare a data file containing one variable to describe the Treatment
details and at least one response/ dependent variable in the experimental data to be
analyzed. Also, the treatment details may be coded or may have actual names (i.e. data
values, for variable describing treatment column may be in numeric or character). The
maximum length of treatment value is 20 characters. The variables can be entered in any
order.
16. Block Design: Prepare a data file containing two variables to describe the block and
treatment details. There should be at least one response/ dependent variable in the
experimental data to be analyzed. Also, the block/treatment details may be coded or may
have actual names (i.e. data values, for variables describing block and treatment column
may be in numeric or character). The maximum length of treatment value is 20 character.
The variables can be entered in any order. (These conditions are applicable to other
similar experimental designs also)
17. Combined Block Design: The data file should contain three variables to describe
Environment, Block, Treatment variables and at least one Dependent variable.
18. Augmented Block Design: The data file should contain two variables to describe Block
& Treatment variables and at least one Dependent variable. At present, Portal supports
only numeric treatment and block variables for augmented designs. An augmented block
design involves two sets of treatments known as check or control and test treatments. The
treatments should be numbered in such a fashion that the check or control treatments are
numbered first followed by test treatments. For example, if there are 4 control treatments
and 8 test treatments, then the control treatments are renumbered as 1, 2, 3, 4 and tests are
renumbered as 5, 6, 7, 8, 9, 10, 11, 12.
19. Resolvable Block Design: The data file should contain three variables to describe the
Replication, Block, Treatment variables and at least one Dependent/ response variable.
20. Nested Block Design: The data file should contain three variables to describe Block,
SubBlock, Treatment variables and at least one Dependent variable.
21. Row Column Design: The data file should contain three variables to describe Row,
Column, Treatment variables and at least one Dependent variable.
22. Crossover Design: Create a data file with at least 5 variables, one for units, one for
periods, one treatments, one for residual, and one for the dependent or analysis variable.
For performing analysis using the portal, please rearrange the data in the following order:
animal numbers as units; periods can be coded as 1, 2, 3, and so on, treatments as
SAS for Statistical Procedures
alphabets or numbers (coding could be done as follows: for every first period the number
one has assigned (fixed) and for other periods code 1 to 3 are given according to the
treatment received by the unit in the previous period) and residual effect as residual. It
may, however, be noted that one can retain the same name or can code in any other
fashion. A carry-over or residual term has the special property as a factor, or class
variate, of having no level in the first period because the treatment in the first period is
not affected by any residual or carry over effect of any treatment. When we consider the
residual or carryover effect in practice the fact that carry-over or residual effects will be
adjusted for period effects (by default all effects are adjusted for all others in these
analysis). As a consequence, any level can be assigned to the residual variate in the first
period, provided the same level is always used. An adjustment for periods then removes
this part of the residual term. (For details a reference may made to Jones, B. and
Kenward,M.G. 2003. Design and Analysis of Cross Over Trials. Chapman and
Hall/CRC. New York . Pp: 212)
23. Split Plot Design: The data file should contain three variables to describe Replication,
Main Plot, Sub Plot variables and at least one Dependent variable.
24. Split Split Plot Design: The data file should contain four variables to describe
Replication, Main Plot, Sub Plot, and Sub-Sub Plot Treatment variables and at least one
Dependent variable.
25. Split Factorial (Main A, Sub B×C) Plot Design The data file should contain four
variables to describe Replication, Main Plot, Sub Plot(1){levels of factor 1 in sub plot} ,
and Sub Plot(2) ){levels of factor 21 in sub plot} Treatment variables and at least one
Dependent variable.
26. Split Factorial (Main A×B, Sub C×D) Plot Design: Create a data file with at least 6
variables, one for block or replication, one for main plot- treatment factor 1, one main
plot- treatment factor 2, one for subplot- treatment factor 1, one for subplot- treatment
factor 2 and at least one for the dependent or analysis variable. If the data on more than
one dependent variable is collected in the same experiment, the data on all variables may
be entered in additional columns. One may give actual levels used for different factors
applied in main plot-treatment factor 1, main plot- treatment factor 2, subplot- treatment
factor 1 and subplot- treatment factor 2. Please remember that there should not be any
space between a single data value. Main plot- treatment factor 1, main plot- treatment
factor 2, subplot- treatment factor 1, subplot- treatment factor 2 treatments and block
numbers may be coded as 1, 2, 3 and so on. One can have character values also.
27. Strip Plot Design: The data file should contain at least 4 variables to describe
Replication, Horizontal Strip, Vertical Strip variables and at least one Dependent
variable.
28. Response Surface Design: The data file should contain at least one treatment factor
variable and at least one dependent variable
Multivariate Analysis
29. Principal Component Analysis: The data file should contain at least one quantitative
analysis variable.
30. Discriminant Analysis: The data file should contain at least one quantitative analysis
variable and a classificatory variable.
Statistical Genetics
SAS for Statistical Procedures
31. Genetic Variance Covariance: Create a data file with at least 4 variables, one for
blocking variable, one for treatments and at least two analysis variable.
32. Heritability Estimation from Half-Sib Data: The data file should contain at least one
quantitative analysis variable and a classificatory variable.
Other IP Authenticated Services
Following can also be accessed through IP authenticated networks:
Web Report Studio: http://stat.iasri.res.in/sscnarswebreportstudio
BI DashBoard: http://stat.iasri.res.in/sscnarsbidashboard
Web OLAP Viewer: http://sas.iasri.res.in:8080/sscnarswebolapviewer
E-Miner 6.1: http://sas.iasri.res.in:6401/AnalyticsPlatform
E-Miner 7.1: http://stat.iasri.res.in/SASEnterpriseMinerJWS/Status
Accessing SAS E-Miner through URL (IP Authenticated Services)
For Accessing E-miner 6.1 and 7.1 through URLs, following ports should be open
Server Ports
r
8561
2) Object spawner 8581
3) Table Server 2171
4) Remote Server 5091
5) SAS App. Olap Server 5451
6) SAS Deployment Tester Server 10021
7) Analytics Platform Server 6411
8) Framework Server 22031
However, if you are accessing only E-miner 6.1, then following port need not be opened.
Framework Server 22031
Steps for accessing SAS Enterprise Miner 6.1 and SAS Enterprise Miner 7.1 separately
SAS Enterprise Miner 6.1
Pre-requisite:
JRE 1.5 Update 15
If Firewall and proxy has been implemented then kindly open following ports:
Server Ports
2) Object spawner 8581
3) Table Server 2171
4) Remote Server 5091
5) SAS App. OLAP Server 5451
6) SAS Deployment Tester Server 10021
7) Analytics Platform Server 6411
Steps to be followed:
If you have installed multiple Java Runtime Environment then
SAS for Statistical Procedures
Go to Control Panel Java Java tab View Keep check on JRE 1.5.0_15 and
Uncheck all others
Check the entry of the sas.iasri.res.in in the host file, if not then open host file
C:\Windows\System32\drivers\etc and edit the host file by entering the IP as shown
below or specify the internal/external IP given by IASRI. Internal IP is to be specified
only at IASRI, New Delhi. All other NARS organizations should specify external IP
only which is: 203.197.217.209 sas.iasri.res.in sas as shown below
Now Go to URL: http://sas.iasri.res.in:6401/AnalyticsPlatform
Click on Launch and then Run
SAS Enterprise Miner 7.1
Pre-requisite:
JRE 1.6 Update 16 or higher
If Firewall and/or proxy has been implemented then kindly open the following ports:
Server Ports
2) Object spawner 8581
3) Framework Server 22031
4) Remote Server 5091
5) SAS App. Olap Server 5451
6) SAS Deployment Tester Server 10021
Steps to be followed:
If you have installed multiple Java Runtime Environment then
Go to Control Panel Java Java tab View Keep check on JRE 1.6.0_16 or
higher available version and Uncheck all other
Check the entry of the stat.iasri.res.in in the host file, if not then open host file
C:\Windows\System32\drivers\etc and edit the host file by entering the IP as shown below
or specify the internal/external IP given by IASRI, New Delhi. Internal IP is to be
SAS for Statistical Procedures
specified only at IASRI, New Delhi. All other NARS organizations should specify external
IP only which is: 14.139.56.156 stat.iasri.res.in stat (earlier 203.197.217.221 stat.iasri.res.in
stat) as shown below stat.iasri.res.in stat as shown below
Now Go to URL: http://stat.iasri.res.in/SASEnterpriseMinerJWS/Status
Click on Launch and then Run
Please note: You cannot run both E-Miner 6.1 and E-Miner 7.1 together. If you want to
run JMP 6.1 then JAVA 1.5.0_15 should be available and for running JMP 7.1, JAVA
version 1.6 onwards should be available on your system.
Indian NARS Statistical Computing Portal and other IP authenticated services are best viewed in
Internet Explorer 6 to 8 and Firefox 2.0.0.11 and 3.0.6
Macros Developed
Macros have been developed for some commonly used statistical analysis and made available at
Project Website www.iasri.res.in/sscnars. Following macros have been developed:
1. Analysis of data from Augmented Block designs
http://www.iasri.res.in/sscnars/augblkdsgn.aspx
2. Analysis of data from Split Factorial ( main A, Sub B C) designs
http://www.iasri.res.in/sscnars/spltfctdsgn.aspx
3. Analysis of data from Split Factorial (Main AB, Sub C) designs
http://www.iasri.res.in/sscnars/spltfctdsgnm2s1.aspx
4. Analysis of data from Split Factorial ( main AB, Sub C D) designs
http://www.iasri.res.in/sscnars/spltfactm2s2.aspx
5. Analysis of data from Split Split Plot designs
http://www.iasri.res.in/sscnars/spltpltdsgn.aspx
6. Analysis of data from Strip Plot designs
http://www.iasri.res.in/sscnars/StripPlot.aspx
7. Analysis of data from Strip-Split Plot designs
SAS for Statistical Procedures
http://www.iasri.res.in/sscnars/stripsplit.aspx.
8. Econometric Analysis ((diversity indices, instability index, compound growth rate, Garret
scoring technique and Demand analysis using LA-AIDS model) and available at
http://www.iasri.res.in/sscnars/ecoanlysis.aspx
9. Estimation of heritability along with its standard error from half sib data
http://www.iasri.res.in/sscnars/heritability.aspx
10. Generation of Polycross designs
http://www.iasri.res.in/sscnars/polycrossdesign.aspx
11. Generation of TFNBCB designs
http://www.iasri.res.in/sscnars/TFNBCBdesigns.aspx
How to see updated version of reference manual?
Reference manual is updated regularly and updated version may be downloaded from
http://www.iasri.res.in/sscnars/contentmain.htm
How to Renew License Files for SAS 9.2M2?
1. Go to http://stat.iasri.res.in/sscnarsportal/public
2. Click on SAS License Downloads 2011-12. It will redirect to New Page. It will start the
show Yellow Bar below the URL bar. Click on the Yellow Bar and Select Download
File. Dialog box showing Open/Save/Cancel would appear. Click on Save and Browse the
desired Location for saving the file.
3. Click on Portal Page link which is on top of the Page to go back to the main page.
4. Click on How to apply License Files?. Again it will redirect to the New Page and will start the
automatically, then it would show Yellow Bar below the URL bar. Click on the Yellow Bar
and Select Download File. Dialog box showing Open/Save/Cancel would appear. Click on
Save and Browse the desired Location for saving the file.
http://support.sas.com/kb/31/187.html
Following link is only for Windows 7 and Windows Vista:
http://support.sas.com/kb/31/290.html
SAS 9.3
In SAS 9.3, the default destination in the SAS windowing environment is HTML, and ODS
Graphics is enabled by default. These new defaults have several advantages. Graphs are
integrated with tables, and all output is displayed in the same HTML file using a new style. This
new style, HTML Blue, is an all-color style that is designed to integrate tables and modern
statistical graphics. The default settings in the Results tab are as follows:
The Create listing check box is not selected, so LISTING output is not created.
The Create HTML check box is selected, so HTML output is created.
The Use WORK folder check box is selected, so both HTML and graph image files are
saved in the WORK folder (and not your current directory).
The default style, HTMLBlue, is selected from the Style drop-down list.
SAS for Statistical Procedures
The Use ODS Graphics check box is selected, so ODS Graphics is enabled.
Internal browser is selected so results are viewed in an internal SAS browser
We can view and modify the default settings by selecting
ToolsOptionsPreferencesResult Tab from the menu at the top of the SAS window
usually known as TOPR pronounced "topper". Snap shot is as under.
To get SAS listing instead of HTML, Select check box Create listing option and deselect
Create HTML check box.
Once HTML checkbox is deselected "Use work folder " get deselected automatically.
Select View results as they are generated , if ODS Graphics is not required as default
output. In many cases, graphs are an integral part of a data analysis. If we do not need
graphics, ODS Graphics should be disabled, which will improve the performance of our
program in terms of time and memory. One can disable and re-enable ODS Graphics in
our SAS programs with the ODS GRAPHICS OFF and ODS GRAPHICS ON statements.
References
Littel, R.C., Freund, R.J. and Spector, P.C. (1991). SAS System for Linear Models, Third
Edition. SAS Institute Inc.
Searle, S.R. (1971). Linear Models. John Wiley & Sons, New York.
Searle, S.R., Casella, G and McCulloch, C.E. (1992). Analysis of Variance Components. John
Wiley & Sons, New York.
www.sas.com
www.support.sas.com
www.iasri.res.in/design
www.iasri.res.in/sscnars
http://stat.iasri.res.in/sscnarsportal
ResearchGate has not been able to resolve any citations for this publication.
Article
Estimating components of covariance Modeling variance components as covariances Criteria-based procedures Summary Exercises
All other NARS organizations should specify external IP only which is: 14in stat) as shown below stat.iasri.res.in stat as shown below ‐ Now Go to URL: http://stat.iasri.res.in/SASEnterpriseMinerJWS/Status ‐ Click on Launch and then Run Please note: You cannot run both E-Miner 6
• Iasri Specified
• New Delhi
specified only at IASRI, New Delhi. All other NARS organizations should specify external IP only which is: 14.139.56.156 stat.iasri.res.in stat (earlier 203.197.217.221 stat.iasri.res.in stat) as shown below stat.iasri.res.in stat as shown below ‐ Now Go to URL: http://stat.iasri.res.in/SASEnterpriseMinerJWS/Status ‐ Click on Launch and then Run Please note: You cannot run both E-Miner 6.1 and E-Miner 7.1 together. If you want to run JMP 6.1 then JAVA 1.5.0_15 should be available and for running JMP 7.1, JAVA version 1.6 onwards should be available on your system.
Computing Portal and other IP authenticated services are best viewed in
• Nars Indian
• Statistical
Indian NARS Statistical Computing Portal and other IP authenticated services are best viewed in
Generation of Polycross designs http://www.iasri.res.in/sscnars/polycrossdesign.aspx 11
• R C Freund
• R J Spector
Estimation of heritability along with its standard error from half sib data http://www.iasri.res.in/sscnars/heritability.aspx 10. Generation of Polycross designs http://www.iasri.res.in/sscnars/polycrossdesign.aspx 11. Generation of TFNBCB designs http://www.iasri.res.in/sscnars/TFNBCBdesigns.aspx References Littel, R.C., Freund, R.J. and Spector, P.C. (1991). SAS System for Linear Models, Third Edition. SAS Institute Inc.
All other NARS organizations should specify external IP only which is: 14.139.56.156 stat.iasri.res.in stat (earlier 203.197.217.221 stat.iasri.res.in stat) as shown below stat.iasri.res.in stat as shown below-Now
• Iasri At
• New Delhi
at IASRI, New Delhi. All other NARS organizations should specify external IP only which is: 14.139.56.156 stat.iasri.res.in stat (earlier 203.197.217.221 stat.iasri.res.in stat) as shown below stat.iasri.res.in stat as shown below-Now Go to URL: http://stat.iasri.res.in/SASEnterpriseMinerJWS/Status-Click on Launch and then Run Please note: You cannot run both E-Miner 6.1 and E-Miner 7
Test: The data file should contain at least one categorical variable and weights or frequency counts variable if frequencies are entered in a separate column
• Chi-Square
Chi-Square Test: The data file should contain at least one categorical variable and weights or frequency counts variable if frequencies are entered in a separate column. Data may also have classificatory in it.