Removing punctuation and all capitalization in newly generated columns (RStudio)

Question

Removing punctuation and all capitalization in newly generated columns (RStudio)

36 Views Asked by Malaxis28 At 16 November 2022 at 07:12

I am new to R, and while I do know some of the basics, I've been unable to figure out how to add new columns (preferably using the mutate() function) to a table which lack any punctuation or capitalization.

I exported around 20,000 observations from the citizen science network iNaturalist in an effort to determine which species are most commonly misidentified. To accomplish this, my goal is to have R compare the value for each observation in the species_guess column (which consists of variably punctuated and capitalized common and scientific names) to the corresponding name in either the taxon_species_name column (standardized, uniform scientific names) and the common_name column (which contains standardized, uniform common names). Every time the species_guess matches one of the latter two columns, I'd like to have either TRUE or FALSE printed in a new column: correct_identification.

I expect that accomplishing this would require the following:

the creation of three new columns which are the same as species_guess, taxon_species_name, and common_name but are all lowercase and have no punctuation.
the creation of a correct_identification column which reads TRUE or FALSE depending on whether the new species_guess matches taxon_species_name or common_name. I think I can do this step myself.

species_guess sample

Please don't hesitate to ask clarifying questions as needed. I am happy to provide more code samples. As requested, the output from the dput function (specifically using the code provided by @IRTFM) has been pasted at the bottom.

I found information on grep() and tolower(), but I really have no idea how to use them to create a new column. There's a lot on removing punctuation from a string, but I'm not sure how those methods would be applicable to an entire column in a dataset.

Thanks!


structure(list(id = c(99512L, 190432L, 207211L, 276566L, 298366L, 
380464L), observed_on_string = c("Fri Jul 06 2012 14:35:33 GMT-0400 (EDT)", 
"2009-09-19", "2012-06-13", "6/23/2010", "2013-06-13", "2013-08-27"
), observed_on = c("2012-07-06", "2009-09-19", "2012-06-13", 
"2010-06-23", "2013-06-13", "2013-08-27"), time_observed_at = c("2012-07-06 18:35:33 UTC", 
NA, NA, NA, NA, NA), time_zone = c("Eastern Time (US & Canada)", 
"Eastern Time (US & Canada)", "Eastern Time (US & Canada)", "Eastern Time (US & Canada)", 
"Eastern Time (US & Canada)", "Eastern Time (US & Canada)"), 
    user_id = c(2179L, 12610L, 13594L, 12035L, 12610L, 13406L
    ), user_login = c("charlie", "susanelliott", "bheitzman", 
    "sfaccio", "susanelliott", "hobiecat"), user_name = c("Charlie Hohn", 
    "Susan Elliott", "Bob Heitzman", "Steve Faccio", "Susan Elliott", 
    NA), created_at = c("2012-07-07 19:56:36 UTC", "2013-02-02 16:19:29 UTC", 
    "2013-03-01 02:00:25 UTC", "2013-05-23 19:32:44 UTC", "2013-06-13 18:57:38 UTC", 
    "2013-08-28 03:04:18 UTC"), updated_at = c("2019-01-08 21:22:48 UTC", 
    "2020-02-13 19:16:34 UTC", "2021-06-27 23:36:32 UTC", "2016-09-20 02:53:33 UTC", 
    "2017-09-26 01:21:35 UTC", "2020-02-12 01:23:48 UTC"), quality_grade = c("research", 
    "research", "research", "research", "research", "research"
    ), license = c("CC0", "CC-BY-NC", "CC-BY-NC", NA, "CC-BY-NC", 
    "CC-BY-NC"), url = c("http://www.inaturalist.org/observations/99512", 
    "http://www.inaturalist.org/observations/190432", "http://www.inaturalist.org/observations/207211", 
    "http://www.inaturalist.org/observations/276566", "http://www.inaturalist.org/observations/298366", 
    "http://www.inaturalist.org/observations/380464"), image_url = c("https://inaturalist-open-data.s3.amazonaws.com/photos/144232/medium.jpg", 
    "https://inaturalist-open-data.s3.amazonaws.com/photos/244969/medium.jpg", 
    "https://inaturalist-open-data.s3.amazonaws.com/photos/262914/medium.JPG", 
    "http://static.inaturalist.org/photos/342086/medium.JPG", 
    "https://inaturalist-open-data.s3.amazonaws.com/photos/369424/medium.jpg", 
    "https://inaturalist-open-data.s3.amazonaws.com/photos/475664/medium.jpg"
    ), sound_url = c(NA, NA, NA, NA, NA, NA), tag_list = c(NA, 
    "Spiranthes, ladies tresses, plant", "Spiranthes, lucida, orchid, Vermont", 
    NA, NA, NA), description = c(NA, NA, "S. lucida can be found in heavily scoured sections of the river banks, generally on the downstream side of boulders, where they are protected during floods.  Very hardy, stout plants, with distinctive thick leaf whorls.\nFlower spikes are distinctive in mid-June, with 6-20 blossoms in a spiral.", 
    "Many blooming around pond edge.", NA, "Ladies' Tresses   "
    ), num_identification_agreements = c(2L, 0L, 2L, 1L, 1L, 
    1L), num_identification_disagreements = c(0L, 0L, 0L, 0L, 
    0L, 0L), captive_cultivated = c("false", "false", "false", 
    "false", "false", "false"), oauth_application_id = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), place_guess = c("United States", "Vermont, US", "Vermont, US", 
    "Vermont, US", "Vermont, US", "Grand Isle, VT"), latitude = c(43.6243306384, 
    44.7147801982, 43.6528495032, 43.9558655593, 43.8546044617, 
    44.75182), longitude = c(-73.2028825367, -71.933891759, -72.2231645845, 
    -72.5525452841, -73.1619811058, -73.30593), positional_accuracy = c(5L, 
    NA, NA, NA, 166L, NA), private_place_guess = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), private_latitude = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), private_longitude = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), public_positional_accuracy = c(27443L, 
    27285L, 27443L, 27396L, 27396L, NA), geoprivacy = c("obscured", 
    "obscured", "obscured", NA, NA, NA), taxon_geoprivacy = c("obscured", 
    NA, "obscured", "obscured", "obscured", NA), coordinates_obscured = c("true", 
    "true", "true", "true", "true", "false"), positioning_method = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), positioning_device = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), place_town_name = c(NA, NA, NA, NA, NA, "Grand Isle"), 
    place_county_name = c("Rutland", "Essex", "Windsor", "Orange", 
    "Addison", "Grand Isle"), place_state_name = c("Vermont", 
    "Vermont", "Vermont", "Vermont", "Vermont", "Vermont"), species_guess = c("Northern Slender Ladies'-tresses", 
    "Sphinx ladies’ tresses", "Spiranthes lucida", "Shining Ladies' Tresses", 
    "Shining Ladies' Tresses", "Sphinx ladies’ tresses"), scientific_name = c("Spiranthes lacera lacera", 
    "Spiranthes incurva", "Spiranthes lucida", "Spiranthes lucida", 
    "Spiranthes lucida", "Spiranthes incurva"), common_name = c("Northern Slender Ladies'-tresses", 
    "Sphinx ladies’ tresses", "Shining Ladies' Tresses", "Shining Ladies' Tresses", 
    "Shining Ladies' Tresses", "Sphinx ladies’ tresses"), iconic_taxon_name = c("Plantae", 
    "Plantae", "Plantae", "Plantae", "Plantae", "Plantae"), taxon_id = c(243059L, 
    773387L, 62254L, 62254L, 62254L, 773387L), taxon_subfamily_name = c("Orchidoideae", 
    "Orchidoideae", "Orchidoideae", "Orchidoideae", "Orchidoideae", 
    "Orchidoideae"), taxon_tribe_name = c("Cranichideae", "Cranichideae", 
    "Cranichideae", "Cranichideae", "Cranichideae", "Cranichideae"
    ), taxon_subtribe_name = c("Spiranthinae", "Spiranthinae", 
    "Spiranthinae", "Spiranthinae", "Spiranthinae", "Spiranthinae"
    ), taxon_genus_name = c("Spiranthes", "Spiranthes", "Spiranthes", 
    "Spiranthes", "Spiranthes", "Spiranthes"), taxon_species_name = c("Spiranthes lacera", 
    "Spiranthes incurva", "Spiranthes lucida", "Spiranthes lucida", 
    "Spiranthes lucida", "Spiranthes incurva"), taxon_hybrid_name = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), taxon_variety_name = c("Spiranthes lacera lacera", 
    NA, NA, NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

UPDATE: found a solution!

spiranthes<-spiranthes %>%
  mutate(standardized_species_guess = gsub('[[:punct:] ]+',' ',tolower(species_guess)))
view(spiranthes)

Hopefully this helps anyone else who may be struggling with the same thing.

Original Q&A

Removing punctuation and all capitalization in newly generated columns (RStudio)

There are 0 best solutions below

Related Questions in R

Related Questions in DATA-CLEANING

Related Questions in DATA-WRANGLING

Related Questions in PUNCTUATION

Related Questions in CAPITALIZATION

Trending Questions

Popular # Hahtags

Popular Questions