Special Missing Values in SAS (Part 2)
In previous section, we introduced the distinctive features and characteristics of special missing values in SAS. In this section, we will present the neat tips of using special missing values and illustrate with some real work examples.
For example, if we want to use special missing values as filters to subset the data, the below DATA step gives a variety of choices. The MISSING function can detect all kinds of missing values, including both standard and special missing values. We can use the IF…THEN clause to test special missing values and create an indicator called Special. Please note, when we perform calculations and create new numeric variables, all the special missing values will be propagated as standard missing values. If we want to retain these special values, we need to use the second IF…THEN clause to create the new variable Net_Income. In this way, the special missing values will be retained in the derived Net_Income variable, otherwise they will be changed to standard missing values.
****************** Tips of using special missing values. ****************;
data Tips;
set Survey;
/*where Age=.R;*/
/*where Age in (.I, .E);*/
/*where Age in (., .I, .E, .R);*/
/*where missing(Age);*/
/*where missing(Age) and Income=.R;*/
/*where missing(Age) and Age ^=.;*/
/*if Age=.R;*/
/*where Age > ._;*/
if missing(Age) and Age ^=. then Special =1;
else if missing(Age) and Age =. then Special =0;
if not missing(Income) then Net_Income=Income-9800;
else Net_Income=Income;
run;
Code language: SAS (sas)
A real work example of using special missing values is in statistical model vetting. As everyone knows, credit risk scores usually have a range from 300 to 900. We often need to perform risk score validation through decile analysis . However, real input data usually contains invalid values which represent the No Hit or No Score reasons. For decile analysis and other score validations, we must exclude these No Hit or No Score records from analysis. However, for a complete reporting, we need to include and present them in separate tables. To accomplish this task, missing values provide a good solution again.
As shown below, the input Sample data (derived from the SAS course data Develop) has a character variable Score (model score) and a numeric variable Bad (customer performance indicator). Score has values of No Hit, No Score or a number ranging from 300 to 900. In the beginning DATA step, we first create a new numeric score variable named Score_Num based on the Score variable. No Hit and No Score categories are represented by special missing values of .H and .S respectively.
**Work Example: Use of Special Missing Values in Model Score Validation.**;
data Prep;
set Sample;
if Score="No Hit" then Score_Num=.H;
else if Score="No Score" then Score_Num=.S;
else Score_Num=input(Score, 3.);
run;
proc format;
value Score
.H="No Hit"
.S="No Score"
0<-high="Scorable";
value Bad
0="Good"
1="Bad";
value Rank
.H="No Hit"
.S="No Score";
run;
proc freq data=Prep nlevels;
format Score_Num Score. ;
tables Score_Num /missing;
run;
Code language: SAS (sas)
If we apply PROC FREQ along with the custom Score format on this variable, we will get the overall frequency distribution of model scores shown in Table 11.
Then PROC RANK is used to rank the valid model scores and divide them into 10 score buckets. With the use of special missing values, PROC RANK will automatically exclude No Hit and No Score records from binning and set them as separate categories, because it does NOT rank missing values by default. Table 12 demonstrates the frequency distributions of the 10 score buckets along with the No Hit/ No Score categories obtained by using PROC FREQ procedure.
******************** Decile Analysis on Model Score. ********************;
proc rank data=Prep out=Ranked groups=10 ;
var Score_Num;
ranks Rank;
run;
proc freq data=Ranked ;
format Rank Rank.;
tables Rank /missing;
tables Rank*Bad/nocol;
run;
Code language: SAS (sas)
If we do a two-way cross tabulation on Rank and Bad, but without the MISSING option for TABLES statement, it will produce a decile analysis table Table 13. We can see that No Hit and No Score records are excluded from the decile analysis table, and the highlighted numbers are the bad rate for each decile. Please be noted that: in this case, we cannot use numeric missing codes such as negative or zero values to represent the special missing categories, otherwise SAS will include them into the ranking calculations and create problems.
Conclusion
Special values are unique features with SAS, they have many valuable uses in leveraging and analyzing big data. If we have in-depth understanding of them and utilize them appropriately, they can bring significant benefits to our data analysis work with SAS programming.
Note: This article is part of my 2016 SAS paper: Missing Values, They Are NOT Nothing , which I presented it twice at the 2016 SAS global forum at Las Vegas and won the Certificate of Appreciation:
“For sharing your fresh ideas, knowledge and deep expertise with the SAS® users community to make this year’s conference a success. We couldn’t have done it without your valuable contribution.”
This paper uses real-world examples to demonstrate the creative uses of missing values in data analysis and SAS programming. The illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields. Please read the paper for more skills and details.
Responses