median with hash table in sas 9.4

52 Views Asked by At

I want to calculate a median by groups for 100 variables with hash tables.

I have find this to calculate the median of invoice on sashelp.cars, but if I want to do the median by Make and Model, for example, how do I adapt it?

data percentiles ; 
keep percentile Invoice ;
format percentile percent5.;

dcl hash ptiles (dataset: "sashelp.cars(where=(Invoice gt 0))",multidata:"Y",ordered:"A");
ptiles.definekey("Invoice");
ptiles.definedone();

declare hiter iterP ("ptiles");

array _ptiles(6) _temporary_ (.5 .05 .1 .25 .75 .95);
call sortn(of _ptiles(*));

num_items=ptiles.num_items;

do i=1 to dim (_ptiles);
    percentile=_ptiles(i);
    do while (Counter lt percentile*num_items);
    Counter+1;
    iterP.next();
end;
output;
end;
stop;
set sashelp.cars;
run;

In fact, in my reel data i want to calculate the median of 100 variables. Actually , i do this with proc univariate but it is so long (>12hours)

2

There are 2 best solutions below

0
Quentin On

This is not a good idea. You are very unlikely to write a DATA step like this that will be faster than PROC MEANS:

proc means data=sashelp.cars p5 p10 p25 p50 p75 p95;
  var Invoice ;
run ;

Note PROC MEANS may be significantly faster than PROC UNIVARIATE, because it does less work. You could also try a simple SQL step.

If your UNIVARIATE step is taking a long time, you probably have lots of data. On My PC, I ran your DATA step and PROC MEANS using as input:

data cars ;
  set sashelp.cars ;
  do i=1 to 100000 ;
    output ;
  end ;
run ;

Your DATA step took 8 seconds, PROC MEANS took 2 seconds.

Also note that your DATA step has a very simplistic approach for calculating percentiles, which is not handling ties. If you compare the results from your step they will not match results from PROC UNIVARIATE or PROC MEANS.

There are lots of efficiency factors to consider when working with SAS (where is the data? Is there a network involved? etc.) But generally you don't want to hand-code something that is provided by SAS as a feature.

0
Stu Sztukowski On

Rather than use hash tables calculating the percentile for every value, consider using hpsummary or means to calculate the median with the option qmethod=p2 to greatly improve efficiency (link). hpsummary is designed for big data, but means is also multithreaded, so try both and see which has the best performance. Check out the performance on this 10M row table with 100 variables using qmethod=p2:

data have;
    array var[100];
    do i = 1 to 10000000;
        do j = 1 to 100;
            var[j] = rand('normal');
        end;
        output;
    end;
run;

proc hpsummary data=have qmethod=p2;
    var var:;
    output out=want
        median=;
quit;

This took 1 minute and 5 seconds with 4 threads on a machine with 16GB of RAM.

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 1 observations and 102 variables.
NOTE: PROCEDURE HPSUMMARY used (Total process time):
      real time           1:05.69
      cpu time            4:05.92

You can also use qmethod=p2 with proc means:

proc means data=have qmethod=p2 noprint;
    var var:;
    output out=want
        median=;
quit;
NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 1 observations and 102 variables.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           46.33 seconds
      cpu time            3:05.50