I have a dataframe df = DataFrame(CSV.File("file.csv", delim=";;")).
The dataframe has three columns (column1 = Date, column2 = String31, column3 = String15).
column1 | column2 | column3
date | String31 | String15
2022-06-29 | Test | 100.00
Only column1 has the right datatype. I would like to change both column2 (to just String) and column3 (to Real or Float64). I managed to change column two, but when I tried to change column3 I got that I can't change string to real.
How would I go about to change these two columns?
On
column2I would recommend leaving it asString31unless you run into an issue with that (and if you do maybe raise an issue with theInlineStrings.jlpackage).String31is a datatype mainly aimed at data analysis workflows where large number of strings are created in memory (such as in a long DataFrame column), which puts a lot of pressure on Julia's garbage collector. Working with InlineStrings likeString31is therefore likely to speed up the analysis in many cases (this won't matter if your data set is small).For
column3, if you want to get a number from a string you need toparseit:You can apply this to the whole column by broadcasting:
That said, this operation is likely to fail, because if it would work CSV.jl would have parsed the column as numeric already. The fact that the column is
Stringtells you that there's likely something in there which can't be parsed as a number - one popular example is a thousands separator (e.g. in files that came from Excel).parsewill however tell you where it failed:In this case you would to
parse.(Float64, replace.(df.column2, "," => ""))to remove thousands separators.[If
parsejust works without any changes you might have discovered a bug inCSV.jls type detection algorithm which might be worth filing an issue for.]