Right now i am working with a excel sheet, which contains some columns which presents links to previous tasks and following tasks of an project shedule.
My goal is to analyze these links with the Panda library in Python, unfortunately these examples tends to fill columns with a large entry of values.
For a better Overview Jupyter is also included. As an IDE i use Visual Studio Code.
Here is an example how these columns looks, these are the first two entries of the table:
| Vorgänger | Nachfolger |
|---|---|
| 221;215;715EA+2 Tage;811EA+3 Tage;816;254;261;268;275;282;289;296;311;437;362EA+6 Tage;524;450;454;352EA+2 Tage;372EA+18 Tage;381EA+12 Tage;663;652;469EA+1 Tag;687;402EA+8 Tage;499;540;637EA+1 Tag;666;669;672;674;685EA+3 Tage;737EA+3 Tage;786;927EA+6 Ta... | 54;20 |
| 26;944 |
I got several entries in both columns as far as i can see the first has the largest quantity in that table.
With the following code i read in the sheet:
import pandas as pd
import numpy as np
from datetime import datetime
from pandas import *
# Einlesen der Excel, 04.01.: 100 Einträge der Spalte Nachfolger von Analyse-DCMA-RKE Neu -07-2017.xlsx Sheet Tabelle2
excel_file = 'F:\\Google Drive\\Matthias\\Arbeit\\ISQM\\08 - Tool Menü - Skripte DD 19.12.2023\\Excel\\Gesamtauszug Primavera - dd 27.10.2023ohne3.xlsx'
xlsx = ExcelFile(excel_file)
excel_Tabelle = xlsx.sheet_names[0]
df = xlsx.parse(excel_Tabelle)
For the following section i separate the columns i work on:
# Speichern Spalte Nachfolger und Vorgänger, Komplette/Inkomplette Tasks
df_1 = df['Nachfolger'].to_frame()
df_2 = df['Vorgänger'].to_frame()
df_3 = df['%'].to_frame()
After some cleaning and concatenations i use the the following code to split these entries into columns and save it into another frame
# Aufsplitten der Spalten
df_ms_1 = df_ms.assign(Vorgänger =df_cpr['Vorgänger'].str.split(';'),
Nachfolger =df_cpr['Nachfolger'].str.split(';')).explode('Vorgänger').explode('Nachfolger')
df_ms_1['Vorgänger'] = df_ms_1['Vorgänger'].fillna(0)
df_ms_1['Nachfolger'] = df_ms_1['Nachfolger'].fillna(0)
At the the first glance this operation worked well, the entries got separated into columns and are saved in the frame. But then i noticed not every entrie from the source table has been processed. For an example the first entries i showed in the table doesnt even appear in the frame df_ms_1 . The Frame starts practically at row 12 from the source and leaves the first 11 out.
The output Frame df_ms_1 in Excel looks like this:
| | Budget | Vorgänger | Nachfolger |
| -- | ------ | ----------- | ---------- |
| 0 | 1000 | 0 | 0 |
| 1 | 1000 | 0 | 0 |
| 2 | 1000 | 0 | 0 |
| 3 | 1000 | 0 | 0 |
| 4 | 0 | 0 | 0 |
| 5 | 1000 | 0 | 0 |
| 6 | 1000 | 0 | 0 |
| 7 | 1000 | 0 | 0 |
| 8 | 1000 | 0 | 0 |
| 9 | 1000 | 0 | 0 |
| 10 | 1000 | 61 | 0 |
| 10 | 1000 | 67 | 0 |
| 10 | 1000 | 73 | 0 |
| 10 | 1000 | 78 | 0 |
| 10 | 1000 | 84 | 0 |
| 10 | 1000 | 134EA-1 Tag | 0 |
| 10 | 1000 | 97 | 0 |
| 10 | 1000 | 100 | 0 |
| 10 | 1000 | 103 | 0 |
| 10 | 1000 | 106 | 0 |
| 10 | 1000 | 107 | 0 |
| 10 | 1000 | 109 | 0 |
| 10 | 1000 | 110 | 0 |
The frame begins at 12th entry of the source table, it has been transformed from this column:
| Vorgänger | Nachfolger |
|---|---|
| 61;67;73;78;84;134EA-1 Tag;97;100;103;106;107;109;110 |
I took a look in the documentation of the Panda library, but dataframe.explode doesnt have a lot of options.
It appears to me this operation cant split columns with large quantity of values. So i have the following questions:
Did anyone experienced the same issues ?
Are there different approaches in Panda to solve this setup?