I am working on a beginner's case study, and I have uploaded the relevant data in R. But there is an issue when I checked the data types in multiple columns.
I wish to change the character format to numeric in 3 columns, namely:
1)started_at
2)ended_at
3)ride_length
Initially, I successfully converted the formats but received an error - NA's by coercion. I also tried to change the format of the CSV file and uploaded it again, but this didn't work. Could you please help me rectify this issue?
I have attached a screenshot for your consideration.
Regards, Shanawaz
As per instructions from one of the queries in stack overflow, I used the following code:
cols.num <- c("started_at","ended_at","ride_length")
jan_2022[cols.num] <- sapply(jan_2022[cols.num], as.numeric)
sapply(jan_2022,class)
summary(jan_2022)
It did change the datatype to numeric but I received an error - NA's by coercion. Screenshot
Edit: Sharing the data for your consideration.
structure(list(ride_id = c("C2F7DD78E82EC875", "A6CF8980A652D272",
"BD0F91DFF741C66D", "CBB80ED419105406", "DDC963BFDDA51EEA"),
rideable_type = c("electric_bike", "electric_bike", "classic_bike",
"classic_bike", "classic_bike"), started_at = c("1/13/2022 11:59",
"1/10/2022 8:41", "1/25/2022 4:53", "1/4/2022 0:18", "1/20/2022 1:31"
), ended_at = c("1/13/2022 12:02", "1/10/2022 8:46", "1/25/2022 4:58",
"1/4/2022 0:33", "1/20/2022 1:37"), start_station_name = c("Glenwood Ave & Touhy Ave",
"Glenwood Ave & Touhy Ave", "Sheffield Ave & Fullerton Ave",
"Clark St & Bryn Mawr Ave", "Michigan Ave & Jackson Blvd"
), start_station_id = c("525", "525", "TA1306000016", "KA1504000151",
"TA1309000002"), end_station_name = c("Clark St & Touhy Ave",
"Clark St & Touhy Ave", "Greenview Ave & Fullerton Ave",
"Paulina St & Montrose Ave", "State St & Randolph St"), end_station_id = c("RP-007",
"RP-007", "TA1307000001", "TA1309000021", "TA1305000029"),
start_lat = c(42.0128005, 42.012763, 41.92560188, 41.983593,
41.87785), start_lng = c(-87.665906, -87.6659675, -87.65370804,
-87.669154, -87.62408), end_lat = c(42.01256012, 42.01256012,
41.92533, 41.961507, 41.88462107), end_lng = c(-87.67436712,
-87.67436712, -87.6658, -87.671387, -87.62783423), member_casual = c("casual",
"casual", "member", "casual", "member"), ride_length = c("0:02:57",
"0:04:21", "0:04:21", "0:14:56", "0:06:02"), day_of_week = c(5L,
2L, 3L, 3L, 5L)), row.names = c(NA, 5L), class = "data.frame")
With the data that you provided, I could see that the columns
started_at,ended_atandride_lengthare actually representing Dates and Times. They are not well represented as numeric in R but rather have special classes.There is an easy way to get this right from the start:
If you are using RStudio, you can read the data via a graphical interface that allows you to manually set the data type of columns beforehand.
To open this interface, you go to File -> Import Dataset -> From Text (readr)...
Then you can choose your
.csv-File by clicking the "Browse" button.After the preview appears, you should be able to select the desired data type from a drop down menu right below the corresponding column name. In your case, you should select the format "DateTime" for
started_atandended_at, and "Time" forride_length.In the bottom right, code is automatically generated. You can copy it to include it in your script and hence make the process reproducible.
Background Info
NAs are generated whenever the text data is not automatically convertable to your chosen type. This might be correct behavior, and is pretty useful.But if you think that there should not be an NA in that place, it might have happened due to:
having chosen a wrong data type for the column
the data containing weird text strings or other unconventional formatting.