I am trying to separate two team names that are conjoined in each cell in a column. I am hoping for some help in coming up with a way to separate them.
As you can see from the code below, I am importing the data from a website and cleaning up the dataframe.
What I want to achieve is creating a new column that is the df_games_2023['text_only_away'] column stripped of the df_games_2023['text_only'] column. So the new column df_games_2023 ['new_text_column'] would be "Mississippi Valley St.", "Brescia", "Pacific", etc..
import pandas as pd
import re
# URL of the CSV file on the website
url = "https://www.barttorvik.com/2023_results.csv" # Replace with the actual URL
#name the columns columns
colnames = ['matchup', 'date','home_team', 'xyz', 'xyz1','away_score','home_score','xyz2','xyz3','xyz4','xyz5']
# Read CSV data into pandas DataFrame
df_games_2023 = pd.read_csv(url, names = colnames)
#eliminate columns from dataframe
df_games_2023 = df_games_2023[['matchup','date', 'home_team', 'away_score','home_score']]
#name the dataframe columns
#df_games_2023.columns = ['matchup', 'date','home_team', 'xyz', 'xyz1','away_score','home_score','xyz2','xyz3','xyz4','xyz5']
#clean up the home_team data
# Extract only text using regex
df_games_2023['text_only'] = df_games_2023['home_team'].apply(lambda x: re.sub(r'\d+', '', x))
# Define the phrases to drop
phrases_to_drop = [',','-','.','(',')','%']
# Drop the specified phrases from the column
for phrase in phrases_to_drop:
df_games_2023['text_only'] = df_games_2023['text_only'].str.replace(phrase, '', regex=True)
#Clean up away team
df_games_2023['text_only_away'] = df_games_2023['matchup'].apply(lambda x: re.sub(r'\d+', '', x))
#we are removing a random '-' with this string of code
df_games_2023['text_only_away'] = df_games_2023['text_only_away'].apply(lambda x: x.rstrip('-'))
# Now you have your DataFrame ready
df_games_2023
The above code works just fine, but the issue is when I try to use logic to isolate one team name from the df_games_2023['text_only_away'] column. The following is the code that I am using to create a new column by stripping ['text_only'] from ['text_only_away']:
def remove_data(row):
text_column = row['text_only_away']
phrase = row['text_only']
if text_column.endswith(phrase):
return text_column[:-len(phrase)].rstrip()
else:
return text_column
# Apply the function to each row and create a new column
df_games_2023['new_text_column'] = df_games_2023.apply(remove_data, axis=1)
Any help as to how to create a new column with the team that is not the team listed in the ['text_only'] would be very helpful.
I am hoping for a new column on
df_games_2023 = pd.DataFrame({'new_text_column': ['Mississippi Valley St.', 'Brescia', 'Pacific', etc..])
You didn't state the actual symptom of what was failing, but your code to remove punctuation characters from the end of each string was throwing an exception since '(' by itself is illegal as a regex.
Anyway here's a more compact regex syntax to do all the replacement in one go:
Note also it's recommended practice to protest your regex with r'' rawstring.