How to delete variables with huge percent of missings in table in SAS?

410 Views Asked by At

I have table in SAS with missing values like below:

col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111  |      | abc  | ... | abc
222  | 11   | C1   | ... | 11
333  | 18   |      | ... | 12
...  | ...  | ...  | ... | ...

And I need to delete from above table variables where is more than 80% missing values (>=80%).

How can I do taht in SAS ?

1

There are 1 best solutions below

1
Stu Sztukowski On BEST ANSWER

The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.

%macro get_missing_vars(lib=, dsn=, threshold=);
    %global drop_vars;

    /* Generate a select statement that calculates the proportion missing:
       nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */   
    proc sql noprint;
        select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
        into :calculate_pct_missing separated by ','
        from dictionary.columns
        where     libname = upcase("&lib")
              AND memname = upcase("&dsn")
        ;
    quit;

    /* Calculate the percent missing */
    proc sql;
        create table pct_missing as
            select &calculate_pct_missing.
            from &lib..&dsn.
        ;
    quit;

    /* Convert to a long table */
    proc transpose data=pct_missing out=drop_list;
        var _NUMERIC_;
    run;

    /* Get a list of variables to drop that are >= the drop threshold */
    proc sql noprint;
        select _NAME_
        into :drop_vars separated by ' '
        from drop_list
        where COL1 GE &threshold.
        ;
    quit;
%mend;

It has three parameters:

lib: Library of your dataset

dsn: Dataset name without the library

threshold: Proportion of missing values a variable must meet or exceed to be dropped

For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.

data have;
    array col[10];
    do i = 1 to 10;
        do j = 1 to 10;
            col[j] = i;
            if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
        end;
        
        output;
    end;

    drop i j;
run; 

We'll run the macro and check the log:

%get_missing_vars(lib=work, dsn=have, threshold=0.8);

%put &drop_vars;

The log shows:

col1 col2 col3

Now we can pass this into a simple data step.

data want;
    set have;
    drop &drop_vars;
run;