Finding records which might indicate fraudulent activities with SQL

87 Views Asked by At

I have following tables:

CREATE TABLE IF NOT EXISTS auth_user
(
    id SERIAL PRIMARY KEY,
    username VARCHAR(30),
    first_name VARCHAR(30),
    last_name VARCHAR(30)
);

CREATE TABLE IF NOT EXISTS statistics_restaccesslog
(
    id SERIAL PRIMARY KEY,
    user_id INTEGER,
    url VARCHAR(200),
    course_id INTEGER,
    date_visited TIMESTAMP,
    ip_address VARCHAR(16)
);

Consider below records:

INSERT INTO auth_user (username, first_name, last_name)
VALUES
    ('user1', 'John', 'Doe'),
    ('user2', 'Jane', 'Smith'),
    ('user3', 'Michael', 'Johnson'),
    ('user4', 'Emily', 'Brown'),
    ('user5', 'David', 'Miller'),
    ('user6', 'Olivia', 'Davis'),
    ('user7', 'Daniel', 'Wilson'),
    ('user8', 'Sophia', 'Anderson'),
    ('user9', 'Andrew', 'Taylor'),
    ('user10', 'Emma', 'Thomas');

INSERT INTO statistics_restaccesslog (user_id, url, course_id, date_visited, ip_address)
VALUES
    (1, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.1'),
    (1, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.2'),
    (3, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.2'),
    (4, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.3'),
    (5, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.4'),
    (6, 'https://example.com/page', 2, CURRENT_TIMESTAMP, '192.168.0.5'),
    (7, 'https://example.com/page', 2, CURRENT_TIMESTAMP, '192.168.0.6'),
    (8, 'https://example.com/page', 3, CURRENT_TIMESTAMP, '192.168.0.7'),
    (9, 'https://example.com/page', 4, CURRENT_TIMESTAMP, '192.168.0.8'),
    (10, 'https://example.com/page', 5, CURRENT_TIMESTAMP, '192.168.0.9');

Point 1: Note the records with same combination of (user_id, url) for two different ip_addresses:

    (1, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.1')
    (1, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.2')

Point 2: Note the records with same ip_address with two different combination of (user_id, url):

    (1, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.2')
    (3, 'https://example.com/page', 1, CURRENT_TIMESTAMP, '192.168.0.2')

I want to build SQL query which will return all such records. I tried following:

SELECT rat.id, rat.user_id, rat.url, rat.course_id, rat.date_visited, 
       rat.ip_address, u.username, CONCAT(u.first_name, ' ', u.last_name) AS name
    FROM statistics_restaccesslog rat, 
        auth_user u
    WHERE ((url, user_id) IN (
        SELECT url, user_id
        FROM statistics_restaccesslog
        GROUP BY url, user_id
        HAVING COUNT(DISTINCT ip_address) > 1
    ) 
    OR ip_address IN (
        SELECT ip_address
        FROM statistics_restaccesslog
        GROUP BY ip_address
        HAVING COUNT(DISTINCT user_id) > 1
    ))
    AND
    rat.user_id = u.id
    ;

On db-fiddle, it correctly returns first three records:

enter image description here

My real database have following records:

enter image description here

(The greyed out urls are exactly the same.)

However, when I run above query, it returns only first record, that is one with id=17. Record with id=18 is not returned, even though it satisfies point 2. Why is this so? What I am missing?

Update

I have added four more records to statistics_restaccesslog:

-- same ip_address, two different (user_id, url)
(17, '/module1/api/rest1/', 1019, CURRENT_TIMESTAMP, '10.0.0.3'),
(18, '/module1/api/rest1/', 1019, CURRENT_TIMESTAMP, '10.0.0.3'),

-- same (user_id, url), two different ip_address
(19, '/module1/api/rest1/', 1019, CURRENT_TIMESTAMP, '10.0.0.4'),
(19, '/module1/api/rest1/', 1019, CURRENT_TIMESTAMP, '10.0.0.5');

First two satisfy point 2 and last two satisfy point 1. So all four records should appear in the query output. But they dont. The query output stays the same! I am not getting why these records are not coming in the output.

Also below is equivalent query with JOIN. It also results in similar behavior:

SELECT s.user_id, s.url, s.course_id, s.date_visited, s.ip_address, u.username, u.first_name, u.last_name
FROM statistics_restaccesslog AS s
JOIN auth_user AS u ON s.user_id = u.id
WHERE (s.url, s.user_id) IN (
    SELECT url, user_id
    FROM statistics_restaccesslog
    GROUP BY url, user_id
    HAVING COUNT(DISTINCT ip_address) > 1
)
OR s.ip_address IN (
    SELECT ip_address
    FROM statistics_restaccesslog
    GROUP BY ip_address
    HAVING COUNT(DISTINCT user_id) > 1
);

PS: Am using postgresql 11

1

There are 1 best solutions below

1
marcothesane On

Find the "keys" of the ip-address, resp. the user_id/url combination that matches the criteria using a GROUP BY Query and inner join that group by query back to the input table.

You need to formulate a clever HAVING clause:

  • COUNT(*) > 1 AND ...
  • MAX() <> MIN() of the differing information.

For a user_id >< url combination, you need to make a single expression of the two to make a MIN() / MAX() out of it. I used casting the user_id to VARCHAR and concatenating that with url .

I am almost completely sure you don't get the data you're looking for because there are no auth_user rows with the id as the user_ids returned from the two CTE-s.

To make my point, I left join the result tables to auth_user, and add the user_id column to the result table:

WITH
mult_usr_url AS (
  SELECT
    user_id
  , url
  FROM statistics_restaccesslog
  GROUP BY 1,2
  HAVING COUNT(*) > 1
     AND MAX(ip_address) <> MIN(ip_address)
)
,
mult_ip_addr AS (
  SELECT
    ip_address
  FROM statistics_restaccesslog
  GROUP BY 1
  HAVING COUNT(*) > 1
     AND MAX(user_id::VARCHAR(3)||ip_address) <> MIN(user_id::VARCHAR(3)||ip_address) 
)
SELECT
  s.id
, s.user_id
, s.url
, s.course_id
, s.date_visited
, s.ip_address
, u.username
, u.first_name||' '||u.last_name AS fullname
FROM statistics_restaccesslog s
JOIN mult_usr_url USING(user_id,url)
LEFT JOIN auth_user u ON u.id=s.user_id
UNION
SELECT
  s.id
, s.user_id
, s.url
, s.course_id
, s.date_visited
, s.ip_address
, u.username
, u.first_name||' '||u.last_name AS fullname
FROM statistics_restaccesslog s
JOIN mult_ip_addr USING(ip_address)
LEFT JOIN auth_user u ON u.id=s.user_id
ORDER BY 1
;
id user_id url course_id date_visited ip_address username fullname
1 1 https://example.com/page 1 2023-06-03 23:55:57.122415 192.168.0.1 user1 John Doe
2 1 https://example.com/page 1 2023-06-03 23:55:57.122415 192.168.0.2 user1 John Doe
3 3 https://example.com/page 1 2023-06-03 23:55:57.122415 192.168.0.2 user3 Michael Johnson
11 17 /module1/api/rest1/ 1,019 2023-06-03 23:55:57.122415 10.0.0.3 (null) (null)
12 18 /module1/api/rest1/ 1,019 2023-06-03 23:55:57.122415 10.0.0.3 (null) (null)
13 19 /module1/api/rest1/ 1,019 2023-06-03 23:55:57.122415 10.0.0.4 (null) (null)
14 19 /module1/api/rest1/ 1,019 2023-06-03 23:55:57.122415 10.0.0.5 (null) (null)