What to do with groups of a very small size during EDA?

15 Views Asked by At

I'm a beginner data analyst (Python) and I'm currently working on two study projects. I'm stuck with this issue: while exploring different categories of my datasets I find groups of a very small size (fewer than a 100 observations in a dataset of 20 000 rows). The question is: what shall I do with these groups? Shall I delete them, keep them, or put them in a separate DataFrame?

The projects are the following:

Project 1. There's data on bank clients (age, education, family status, number of children, total income, source of income, what the loan was for (e.g. car or flat), whether there's been a failure to pay it back (Y/N)) - 21 500 rows.

The task to find out what influences whether the loan will be repaid on time. The findings will be used to create a loan scoring model. (Do I understand it right that this dataset should be then prepared to use a classification algorithm on it?)

In this dataset, for example, there are only 5 clients with 5 children and only 2 entrepreneurs. Shall I just delete them and mention 'there's not enough data on these groups of customers'?

Project 2. There's data on real estate projects that were sold (total area, price, number of rooms, location, etc) - 23 700 rows. The task is to establish the parameters that determine the market price of a given property. Then an anomaly detection model should be built.

In this dataset, for example, there are three columns about the type of the property, is_studio, is_apartment and open_plan, where True values constitute about 1% only. Shall I just drop the columns? But what if there something important for anomaly detection?

I would greatly appreciate some help with this issue as I haven't found any answers on the internet and I don't remember it being mention in the courses that I've taken. Thanks!

0

There are 0 best solutions below