I am trying to replace 'yyyy-MM' with 'yyyy-MM'+'-01' below is my code and I am not getting it right. Note, I am working on databricks:
from pyspark.sql.functions import col, concat, lit, when
# Show the DataFrame
new_df.show(5)
from pyspark.sql.functions import col, concat, lit, when
# Create new columns with replaced values
new_df = clinicaltrial_df.withColumn(
'Start_new',
when(
col('Start').contains('-'),
col('Start')
).otherwise(
concat(col('Start'), lit('-01'))
)
).withColumn(
'Complete_new',
when(
col('Completion').contains('-'),
col('Completion')
).otherwise(
concat(col('Completion'), lit('-01'))
)
)
# Show the DataFrame
new_df.show(5)
Your code is aimed at appending
'-01'to values in theStartandCompletioncolumns of a DataFrame if they do not already contain a '-'. However, it seems you want to specifically target strings formatted as'yyyy-MM'and ensure they become'yyyy-MM-01'. To achieve this, you need to identify strings that precisely match the'yyyy-MM'format.You could use the
regexp_replacefunction from PySpark'ssql.functionsmodule. This function can search for a regular expression pattern and replace matching parts of the string with a specified replacement string. For your case, you can look for strings that match the pattern of a year and month ('yyyy-MM') and do not end with a day. Then, append'-01'to these strings to standardize them as full dates ('yyyy-MM-01').Here's how you can adjust your code:
The approach shown above ensures that only strings formatted exactly as
'yyyy-MM'are altered, accurately targeting the described requirement and leveraging the powerful pattern matching capabilities of regular expressions to implement the desired transformation.Example
Here's an example of the above solution for some dummy data:
Note: the dates only get appended if both the year and month are present on the string in that respective order. If you also have dates that only contain the year (
'YYYY'), dates that have orderings of year/month other than'YYYY-MM', or dates that do not use'-'as year/month separator (e.g.,'YYYY/MM'), these values will remain unmodified.Regex pattern breakdown
If you're not familiar with regex string patterns, here's a breakdown of the pattern being used in the code above:
regexp_replaceis used with a regular expression patternr'^(\d{4}-\d{2})$':^asserts the start of the string.(\d{4}-\d{2})matches and captures a group consisting of four digits (representing the year), followed by a hyphen, and then two digits (representing the month).$asserts the end of the string.