Calculating Cumulative Number of Library Books Returned

42 Views Asked by At

I created this table ("date_ranges") in Python and uploaded it to an SQL Server:

import pandas as pd
from dateutil.relativedelta import relativedelta

def generate_dates(start_year, end_year):
   
    dates = pd.date_range(start=f'{start_year}-01-01', end=f'{end_year}-12-01', freq='MS')
    
    formatted_dates = dates.strftime('%Y-%m-%d')
    
    return formatted_dates

dates1_df = pd.DataFrame({'Date': pd.to_datetime(generate_dates(2010, 2011)), 'year': 2009, 'start': pd.to_datetime('2010-01-01')})
dates2_df = pd.DataFrame({'Date': pd.to_datetime(generate_dates(2011, 2012)), 'year': 2010, 'start': pd.to_datetime('2011-01-01')})
dates3_df = pd.DataFrame({'Date': pd.to_datetime(generate_dates(2012, 2013)), 'year': 2011, 'start': pd.to_datetime('2012-01-01')})

final_df = pd.concat([dates1_df, dates2_df, dates3_df])


final_df['diff'] = (final_df['Date'] - final_df['start']).dt.days

#rename
date_ranges = final_df
 

    Date  year      start  diff
0  2010-01-01  2009 2010-01-01     0
1  2010-02-01  2009 2010-01-01    31
2  2010-03-01  2009 2010-01-01    59
3  2010-04-01  2009 2010-01-01    90
4  2010-05-01  2009 2010-01-01   120
..        ...   ...        ...   ...
19 2013-08-01  2011 2012-01-01   578
20 2013-09-01  2011 2012-01-01   609
21 2013-10-01  2011 2012-01-01   639
22 2013-11-01  2011 2012-01-01   670
23 2013-12-01  2011 2012-01-01   700

I also have this table of library books ("my_table"):

CREATE TABLE my_table (
    name VARCHAR(50),
    date_library_book_returned DATE,
    year_book_taken_out INT,
    library_book_due_date DATE
);


INSERT INTO  my_table (name, date_library_book_returned, year_book_taken_out, library_book_due_date)
VALUES
    ('john', '2010-05-01', 2009, '2010-03-01'),
    ('john', '2011-07-02', 2010, '2011-03-01'),
    ('john', '2012-05-01', 2011, '2012-03-01'),
    ('jack', '2010-02-01', 2009, '2010-03-01'),
    ('jack', '2011-02-02', 2010, '2011-03-01'),
    ('jack', '2012-02-01', 2011, '2012-03-01'),
    ('jason', NULL, 2009, '2010-03-01'),
    ('jason', NULL, 2010, '2011-03-01'),
    ('jason', NULL, 2011, '2012-03-01'),
    ('jeff', '2013-05-05', 2009, '2010-03-01'),
    ('jeff', '2013-05-05', 2010, '2011-03-01'),
    ('jeff', '2013-05-05', 2011, '2012-03-01');


  name date_library_book_returned year_book_taken_out library_book_due_date
  john                 2010-05-01                2009            2010-03-01
  john                 2011-07-02                2010            2011-03-01
  john                 2012-05-01                2011            2012-03-01
  jack                 2010-02-01                2009            2010-03-01
  jack                 2011-02-02                2010            2011-03-01
  jack                 2012-02-01                2011            2012-03-01
 jason                       NULL                2009            2010-03-01
 jason                       NULL                2010            2011-03-01
 jason                       NULL                2011            2012-03-01
  jeff                 2013-05-05                2009            2010-03-01
  jeff                 2013-05-05                2010            2011-03-01
  jeff                 2013-05-05                2011            2012-03-01

I am trying to accomplish the following:

  • for all books taken out in 2009: what percent (and number) of them were returned by 2010-01-01, what percent (and number) of them were returned by 2010-02-01, what percent (and number) of them were returned by 2010-03-01, etc. all the way to 2012-01-01 (i.e. 2 years)

  • for all books taken out in 2010: what percent (and number) of them were returned by 2011-01-01, what percent (and number) of them were returned by 2011-02-01, what percent (and number) of them were returned by 2011-03-01, etc. all the way to 2013-01-01 (i.e. 2 years)

  • repeat for books taken out in 2011

Originally I was doing this manually, but it was taking too long:

SELECT 
    COUNT(*) AS total_books,
    SUM(CASE WHEN date_library_book_returned <= '2010-01-01' THEN 1 ELSE 0 END) AS returned_by_20100101,
    SUM(CASE WHEN date_library_book_returned <= '2010-02-01' THEN 1 ELSE 0 END) AS returned_by_20100201,
    #### etc etc ####
FROM 
    my_table
WHERE 
    year_book_taken_out = 2009;

I tried to do everything at once with the following code:

SELECT 
    dr.*, 
    COUNT(mt.name) AS num_returned, 
    (SELECT COUNT(*) FROM my_table WHERE year_book_taken_out = dr.year - 1) AS total_books,
    COUNT(mt.name) * 100.0 / (SELECT COUNT(*) FROM my_table WHERE year_book_taken_out = dr.year - 1) AS percent_returned
FROM 
    date_ranges dr 
LEFT JOIN 
    my_table mt 
ON 
    dr.Date >= mt.date_library_book_returned AND mt.year_book_taken_out = dr.year - 1
WHERE 
    dr.year IN (2009, 2010, 2011)
GROUP BY 
    dr.Date
ORDER BY 
    dr.Date;

Is this the correct way to do this?

Note that Netezza is an old SQL language that doesn't support functions like generate_series, list_agg, cross joins (in Netezza we do cross joins on 1=1), recursive queries, correlated queries. This is why I created the reference table in Python prior to the analysis.

0

There are 0 best solutions below