Home › Forums › Main Forums › Big Data › How to count the number of NAs in a dataset by SAS/SQL/R/Python?
-
How to count the number of NAs in a dataset by SAS/SQL/R/Python?
-
A student had a job interview with a bank yesterday. They asked her an interesting but challenging question:
How do you count the total number of NAs (missing values) in a dataset with a lot of variables by SAS/SQL/R/Python?
This question is not easy to answer, especially for SAS and SQL. Let’s discuss about it.
-
It is very easy to count the total number of missing values in a data frame with R and Python. Below is the code. Suppose we have a data frame named df.
R code
sum( is.na(df) )
df[ is.na(df) ]<- -999 ### we can replace all the missing values in this way.Python code: we can use either isnull() or is.na() function. Please note: isna()/notna() are same as isnull()/notnull() in Python.
### Below code counts the number of missing values of each column.
df.isnull().sum()
df.isna().sum()
### Below code counts the number of missing values of the whole data frame.
df.isnull().sum().sum()
df.isna().sum().sum()For SAS programming, it is not an easy job. Some people may think of using macro to do it. SAS macro will work it out certainly, but it is too much work. My suggestion is to use macro only when you have to. Actually, we have a much easier and better solution if we use the CMISS() function in SAS. Below gives the SAS code.
/*cmiss() function: This function counts the number of missing arguments.*/
/* argument*: specifies a constant, variable, or expression. Argument can be either a character value or a numeric value.*/
data AAA;
set P1.New_Wireless End=EOF ;
miss=CMISS(of acctno -- sales);
Total_Miss + Miss;
if EOF=1 then output;
run;
************* The correct answer is 371,468 for the above input data. ********;The most challenging task is to use SQL to do it. It is very complicated and we may have to use the “information schema” to do it. Does any one have a better solution?
- This reply was modified 3 years, 10 months ago by Justin.
-
For SQL, we can extract all column names from information_schema. Then concatenate the column names with ‘IS NULL’ in where clause to get the number.
SET @query = CONCAT(
'SELECT COUNT(*) FROM sakila.staff WHERE CONCAT(',
(SELECT GROUP_CONCAT(COLUMN_NAME)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'sakila' AND
TABLE_NAME = 'staff'),
') IS NULL');
PREPARE CHECK_NULL from @query;
EXECUTE CHECK_NULL;
- This reply was modified 3 years, 10 months ago by Wong.
Log in to reply.