CSV file data manipulation in R

30 Views Asked by At

I have been trying to clean a dataset named logbook.csv. The dataset focuses on analyzing fuel usage of users Globally. The first step is to clean a column named "date_fueled" which consists of the date that the users purchased fuel. This column has dates in the format e.g; "Apr 12 2020" but also has non-date values that have also have commas in them e.g; "Cooling System, Heating System, Lights, Spark Plugs". I have been trying to clean this data using various libraries namely: lubridate, parsedate, dplyr and readr but I keep getting either errors or all my dates get turned into NA values. I restarted my RStudio and tried to start over and realised that I get a warning message after importing my dataset.

The warning message is as follows:

> library(readr)
> logbook <- read_csv("C:/Users/theet/Downloads/logbook.csv")
Rows: 1174870 Columns: 9                                                                                                
── Column specification ─────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): date_fueled, date_captured, cost_per_gallon, total_spent, user_url
dbl (3): gallons, mpg, miles
num (1): odometer

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> View(logbook)

After reading the above I ran the "problems(dat)" code and received the following feedback:

problems(logbook)
# A tibble: 398 × 5
     row   col expected actual     file                                
   <int> <int> <chr>    <chr>      <chr>                               
 1  5409     4 a double 8,583.478  C:/Users/theet/Downloads/logbook.csv
 2  5790     8 a double 1,182.5    C:/Users/theet/Downloads/logbook.csv
 3  9681     8 a double 1,888.2    C:/Users/theet/Downloads/logbook.csv
 4 12023     4 a double 10,738.000 C:/Users/theet/Downloads/logbook.csv
 5 12140     7 a double 1,049.2    C:/Users/theet/Downloads/logbook.csv
 6 12140     8 a double 2,713.3    C:/Users/theet/Downloads/logbook.csv
 7 13609     8 a double 132,388.0  C:/Users/theet/Downloads/logbook.csv
 8 16234     4 a double 2,817.502  C:/Users/theet/Downloads/logbook.csv
 9 20879     4 a double 16,378.667 C:/Users/theet/Downloads/logbook.csv
10 26262     8 a double 49,725.2   C:/Users/theet/Downloads/logbook.csv
# ℹ 388 more rows
# ℹ Use `print(n = ...)` to see more rows

The link to my dataset is: https://drive.google.com/file/d/18TbpdmNS7hsBtUU-wkItEK9IBEfy9Hqr/view?usp=drive_link

Here is the code I wrote using the lubridate library:

library(parsedate)
library(lubridate)
library(dplyr)
library(readr)

logbook2 <- read_csv("C:/Users/theet/Downloads/logbook.csv")

# Convert date_fueled to actual date objects
logbook2  <- logbook2 %>% 
  mutate(date_fueled = as.Date(date_fueled, format = "%b %d %Y")

# Replace NA values in date_fueled with NA

logbook2 <- logbook2 %>% 
  mutate(date_fueled = ifelse(is.na(date_fueled), NA, date_fueled))

head(logbook2)

the above code gave me this error:

Error: unexpected symbol in:
"#Replace NA values in date_fueled with NA
logbook2"
> 

Please help me fix this error and also notify if there might be additional mistakes in my code.

0

There are 0 best solutions below