I have a dataset which has an account number a billing cycle date and transaction date . Every account should have first two occurrences for the billing cycle date and then the transaction date has to checked if it lies between those two billing cycle dates.
Data that I have .
input acct_num biiling_date tran_date ;
DATALINES;
1111 23FEB2020 27FEB2020
1111 23MAR2020 27FEB2020
2222 17MAR2020 22MAR2020
2222 27APR2020 22MAR2020
2222 27MAY2020 22MAR2020
3333 23JUL2018 02JUN2022
3333 23AUG2018 02JUN2022
Data that I want.
1111 23FEB2020 27FEB2020
1111 23MAR2020 27FEB2020
2222 17MAR2020 22MAR2020
2222 27APR2020 22MAR2020
If you see in the final dataset, third occurrence of account 2222 has been removed and entire rows of account 3333 has been removed because no tran_date was between the first and second billing cycle date for that account.
Thanks.
The main problem here is that you want to look ahead one row for each group, then make the decision to keep or drop both rows simultaneously. We can make this easier by transposing the data into a long format then merging it back with the original data:
All we care about are the first two rows.
COL1andCOL2represent those first two rows.Now we can merge this back to our original dataset and check if the transaction date is between
COL1andCOL2for everyacct_num. Since we only need to keep the first two rows of data, we're going to count the number of rows,n, and keep track of it for every group. At the start of each group we'll reset the counter.We will only output if:
nis < 3tran_dateis betweenCOL1andCOl2To do this with lags, you'll need to still do this in two steps: first to create a list of values to drop, and second to create the final filtered list.
If you want a really crazy way of doing this in one step with hash tables, here's that method: