Special Missing Values in SAS (Part 1)
Special missing values are a type of numeric missing values that enable us to represent different categories of missing data. They are something quite unique in SAS, and very useful too under some circumstances. Unfortunately, R and Python don’t have similar things. This is one of the beauties of SAS as a commercial software, though it is very expensive.
In data analysis, sometimes we need to use special missing values to analyze and distinguish different types of missing data. Under these circumstances, the different categories of missing data have their own business meanings and provide useful information for our analytics. If we ignore them or combine them into one category, it will result in a loss of information or analytic bias.
There are totally 27 special missing values in SAS by using a period and the letters A-Z or an underscore, namely, ._, .A, .B, .C … .X, .Y, .Z. The sorting order of these special values is: ._ < .[standard missing] < .A < .B < …. <.Z, and they are all treated as FALSE in Boolean logic. We can use any of the 26 letters of the alphabet (uppercase or lowercase, not case sensitive) or the underscore (_), but only one single letter. Please note that special missing values are only available for numeric variables. These special values are not only processed correctly as missing in DATA step and procedure calculations, but also distinguish among different types of missing data.
For example, in clinical trials, data censoring is a common problem. Data censoring refers to the missing data that occurs when participating patients fail to complete the study and drop out without further measurements. Handling these missing data is usually both challenging and complicated. There are different reasons for data censoring, such as death, adverse reactions, unpleasant study procedures, lack of improvement, early recovery, and other factors related or unrelated to trial procedure and treatments. These different categories of missing data are informative in nature, analysis of them may provide useful information for the clinical study and helps to reduce study bias . If we ignore these missing values or treat them the same, we may lose important information for our study. Instead, we must record and characterize them with different indicators for further data analysis. In this circumstance, we must use special missing values to represent and analyze them.
As shown below, we use a Survey data set to illustrate the use of special missing values. In this survey, Age and Income are private and sensitive information. They may have missing values for different reasons, such as:
1) Invalid or out-of-range data. For example, zero, negative or extreme values (>200) for age.
2) Skip patterns.
3) The survey subject refuses to provide.
4) The interviewer forgot to ask.
5) Any other reasons.
These different categories of missing values have their own meanings, analyzing them helps to understand the validness of the study and reduce the bias of analytic results. We cannot ignore or treat them as all the same.
To represent special missing values in SAS, we must use the MISSING statement to read in data from raw text files or from CARDS and DATALINES. It will assign characters in input data to represent special missing values for numeric data. Please note that the MISSING statement is a global statement, it can appear anywhere in a SAS program. Table 8 shows the print out of the Survey data set.
********************** Special Missing Values.***************************;
data Survey;
infile datalines missover;
input Survey_ID Age Income @@;
format Income dollar12.0;
missing F I N R ;
datalines;
1001 I 35000 1002 62 R 1003 36 42000 1004 F R
1005 47 76300 1006 29 N 1007 R 48200 1008 55 63000
1009 43 F 1010 R 58600 1011 . 39200 1012 25 .
;
run;
proc format;
value Age
.="Skip pattern" .I="Invalid" .F="Forgot to ask"
.R="Refuse to provide" low-<20="< 20" 20-40="20-40"
41-60="41-60" 61-High="> 60";
value Income
.="Skip pattern" .F="Forgot to ask" .N="Not applicable"
.R="Refuse to provide" low-<40000="< $40K" 40000-60000="$40-60K"
60000<-high="> $60K";
run;
proc freq data=Survey;
format Age Age. Income Income. ;
tables Age Income /missing;
run;
proc means data=Survey N Nmiss Min Max Mean STD maxdec=1;
var Age Income;
run;
Code language: SAS (sas)
As shown above, the missing data is represented by special missing values I, F R. These special missing values allow us to distinguish among different categories of missing data and perform calculations on them. If we apply PROC FREG (with MISSING option) and PROC MEANS on it along with the created custom formats, we will get below results.
It is important to note that the use of special missing values has distinctive advantages over numeric missing codes such as Forgot to ask= -9, Invalid= -8, Refuse to provide = -7 etc. These numeric missing codes will trigger below problems and lead to incorrect analytic results:
- SAS will NOT treat them as missing when performing calculations or analysis, instead, SAS will include them in numeric calculations.
- They may be included accidentally in format ranges.
Therefore we must use special missing values as a solution. Unfortunately, R and Python don’t have the counterpart, which may bring inconvenience in work. This is one of the beauties of SAS as a commercial software.
cool article
Thank you! Any questions?