I have columns with different ratings from 1-5 with descriptors next to the number. The format is "number dash descriptor", ex. "1 - very happy" or "5 - hungry". I want to replace these with just the number, but there are a lot of different descriptors and too many to recode all manually.
Because they all include a dash, I'm sure there must be a way to do something like replace all instances of cells that contain "1 -" with "1", but I can't seem to make anything simple work.
Any help is appreciated!
I can use str_contains to find cells that contain a dash, but can't make that work with replace in dplyr.
To extract numbers from text strings in R, I would use the
{stringr}package.First, lets reproduce your data in a simple dataframe:
We can use
str_extractfrom the{stringr}package to extract the first single character from a string, using the regex syntax for any character (.) at the beginning of the string (^):But this won't work if there are numbers with more than a single digit. So, we can use regex for any number of any length (
\\d+) instr_extractto extract only numbers from a string, no matter in which part of the string they are in:This method allows us to also find any number that is before a dash symbol:
Note that we have to remove the dash afterwards. This can be avoided using what is called I regex a positive lookahead, that is, find things that match the criteria but also come before other things, like extracting any number that comes before a space and dash symbols:
Finally, other packages such as
{readr}have functions that help with these kind of data cleaning tasks: