I am trying to solve a ML problem if a person will deliver an order or not. Highly Imbalance dataset. Here is the glimpse of my dataset
[{'order_id': '1bjhtj', 'Delivery Guy': 'John', 'Target': 0},
{'order_id': '1aec', 'Delivery Guy': 'John', 'Target': 0},
{'order_id': '1cgfd', 'Delivery Guy': 'John', 'Target': 0},
{'order_id': '1bceg', 'Delivery Guy': 'Tom', 'Target': 0},
{'order_id': '1a2fg', 'Delivery Guy': 'Tom', 'Target': 0},
{'order_id': '1cbsf', 'Delivery Guy': 'Tom', 'Target': 1},
{'order_id': '1bc5', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1a22', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1bzc5', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1av22', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1bsc5', 'Delivery Guy': 'Jay', 'Target': 1},
{'order_id': '1a2t2', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1bc5b', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1a22a', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '1c5bv', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': 'vb2er', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '1bs5s', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '1a22n', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '122a', 'Delivery Guy': 'James', 'Target': 1},
{'order_id': '1cw5bv', 'Delivery Guy': 'James', 'Target': 0},
{'order_id': 'vb=er', 'Delivery Guy': 'James', 'Target': 0},
{'order_id': '1b5s', 'Delivery Guy': 'James', 'Target': 0},
{'order_id': '1a2n', 'Delivery Guy': 'James', 'Target': 1}]
This is my table :
| order_id | Delivery Guy | Target |
|----------|--------------|--------|
| 1bjhtj | John | 0 |
| 1aec | John | 0 |
| 1cgfd | John | 0 |
| 1bceg | Tom | 0 |
| 1a2fg | Tom | 0 |
| 1cbsf | Tom | 1 |
| 1bc5 | Jay | 0 |
| 1a22 | Jay | 0 |
| 1bzc5 | Jay | 0 |
| 1av22 | Jay | 0 |
| 1bsc5 | Jay | 1 |
| 1a2t2 | Jay | 0 |
| 1bc5b | Jay | 0 |
| 1a22a | Mary | 0 |
| 1c5bv | Mary | 0 |
| vb2er | Mary | 0 |
| 1bs5s | Mary | 0 |
| 1a22n | Mary | 0 |
| 122a | James | 1 |
| 1cw5bv | James | 0 |
| vb=er | James | 0 |
| 1b5s | James | 0 |
| 1a2n | James | 1 |
I want my machine learning model to understand each person attributes and predict these two
cases: will deliver "0" and will not deliver "1"
I want to split my train and test in such a way that it should preserver few rows of name and few rows of Target class so that it learns all the patterns.
I have used this so far
X = df.drop(columns = "Target")
y = df.Target
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,stratify=y)
It does give me output of each Delivery Guy but it misses the part where we can split 'James' in such way that "1" will be there in train another "1" will be in test. Could anyone help me approach this problem in different way.
Here's an approach to ensure that:
Every
"Delivery Guy"is represented in both the training and test sets. Each"Target" classis adequately represented in both sets.Step 1: Split Data by "Delivery Guy" :
Step 2: For Each Group, Further Split by "Target"
Step 3 : Step 3: Allocate Train/Test Data
Step 4: Combine Data Back
After allocating both "Target" classes for each "Delivery Guy" to both sets, combine these allocations back into your final training and test sets.
Here's how you could implement this in Python: