I have a dataset (in .tsv format) where one of the columns is date of birth. However, the data is old, and not all dates are in YYYY-MM-DD
format. Some entries just have their year of birth (month and day of birth are missing) and are in the format YYYY-##-##
(literally ##
's are already inserted in the data wherever only year is known). I wish to load this dataset into my postgres database, with the date of birth column with data type as date
and not string
, so that I can do comparison on dates.
A small sample is shown below. (Irrelevant columns of data are not shown)
1924-##-##
1965-09-04
1944-11-05
1951-##-##
-388-##-##
1893-01-26
1037-##-##
Directly bulk loading the dataset obviously gives an error
ERROR: invalid input syntax for type date: "1924-##-##"
LINE 1: insert into d values ('1924-##-##');
^
The dataset is quite large, having around 6 crore entries. Currently I am thinking of running a script for replacing these ##
's with 01
and then inserting the modified data into the database. But I don't like this idea for -
- This is time consuming.
- This is disk-space consuming (as I would like to keep the original "umtampered" data)
- Also, not all my data would be genuine in my database.
Is there any way I can ask postgres to somehow just take the dates as it is, by just ignoring the `##'s (and just keeping the year where the months and days are missing) ?
Or can there any better solution to this problem?
You have two options here