I have a Dataframe containing a point column with multiple values separated using commas, hyphens, alphabets and other values are text wrapped meaning there are more than two pairs of values(see highlighted rows). I have managed to split the values separated by hyphens, and those separated by alphabets using my python script below:
import os
import pandas as pd
data_path= r'pppppp'
table = 'ddddd.xlsx'
file = os.path.join(data_path, table)
df = pd.read_excel(file)
point = df.iloc[:, 1].copy() # Create a copy as a DataFrame
print(point)
def split_column(value):
# Handle NaN values
if pd.isna(value):
return pd.NaT, pd.NaT
#Remove any '?' and characters after it
value = str(value).rstrip('...').split('...')[0]
# Function to split the values based on the provided criteria
if '-' in value:
parts = value.split('-')
left, right = parts[-1], parts[0]
return right[-7:], left[-7:]
elif ',' in value:
parts = value.split(',')
left, right = parts[-1].strip(), parts[0].strip()
return right[-7:], left[-7:]
elif any(c.isalpha() for c in value):
# Find the index of the first alphabet character
index = next((i for i, c in enumerate(value) if c.isalpha()), None)
if index is not None:
left = str(value)[:index]
right = str(value)[index + 1:index + 8]
return left[-7:], right
else:
return str(value)[-7:], str(value)[:-7]
# Apply the split_column function to the specified column
points = point.dropna().explode().str.split('\n').explode()
pointseparated = points.apply(split_column).apply(pd.Series)
print(pointseparated)
# Rename the resulting columns
singlepoint = pointseparated.rename(columns={0: 'start', 1:'end'})
#If only one value is found, assign it to start
if singlepoint.shape[1] == 1:
singlepoint.column = ['start']
#Reset the index of both DataFrames
df.reset_index(drop=True, inplace=True)
singlepoint.reset_index(drop=True, inplace=True)
# Concatenate the original Dataframe and the new columns
results_df = pd.concat([df, singlepoint], axis=1)
# Export the Dataframe to an Excel Worksheet
results_df.to_excel(test.xlsx, index=False)
I generated the start and end columns by splitting the values in the point column as shown in the code above. The challenge I now have is in splitting the point column row values which have line breaks meaning having more than one set of values in them in the point column as shown below. In the example below, I would like to return the start and end values of the point values after the linebreak.
number point start end
06-102.1 5103023A5104021A 5103023 5104021
5104055…
06-221 4411020-4411030 4411020 (4411141)
4411140A4411141A
19-083 4114057, 4114057 4114010
4114048O4114010O
19-130 5002038, 5002040, 5002042, 5002043, 5002044 5002038 (5002044)
20-038 4514011, 4714026, 4517019 4514011 (4517019)
Also I am trying to write all the values in the point column that are separated by a comma, only to the start column and not in the end column as shown above.The last two values in the end column enclosed with brackets() should not be written instead all the point values should be returned on the start column.
How can I add these changes in my script?