Full list symbols stripped with str.strip() by default

133 Views Asked by At

As said in documentation:

str.strip([chars])
Return a copy of the string with the leading and trailing characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace.

What is whitespace?

   import string    
   print(string.whitespace)

gives smth like ' \t\n\r\x0b\x0c'

in the same time,

'\t\n\r\f\x85\x1c\x1d\v\u2028\u2029'.strip() 

gives '' too.

So the question is: what is the full list of symbols striped by default with str.strip()?
Sorry but GPT says rubbish on it.

3

There are 3 best solutions below

3
user2357112 On BEST ANSWER

string.whitespace is documented as

A string containing all ASCII characters that are considered whitespace. This includes the characters space, tab, linefeed, return, formfeed, and vertical tab.

It only includes ASCII whitespace, not all whitespace.

As documented under str.isspace,

A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.

This is the definition of whitespace used by str.strip. All characters with the listed Unicode properties will be stripped from the ends of the string. The code to check for this is generated from the Unicode character database and hardcoded into the Python interpreter, so it will reflect whatever version of the Unicode character database was used to build the Python version you're running.

3
Ferret On
\n = Line break
\t = Tab
\r = Return

For the others, I recommend looking up the table @ https://www.scaler.com/topics/escape-sequence-in-python/

2
Sash Sinha On

From the cypthon source in unicodetype_db.h with inline comments for each character name added by me:

/* Returns 1 for Unicode characters having the bidirectional
 * type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise.
 */
int _PyUnicode_IsWhitespace(const Py_UCS4 ch)
{
    switch (ch) {
    case 0x0009:  // HORIZONTAL TAB
    case 0x000A:  // LINE FEED
    case 0x000B:  // VERTICAL TAB
    case 0x000C:  // FORM FEED
    case 0x000D:  // CARRIAGE RETURN
    case 0x001C:  // FILE SEPARATOR
    case 0x001D:  // GROUP SEPARATOR
    case 0x001E:  // RECORD SEPARATOR
    case 0x001F:  // UNIT SEPARATOR
    case 0x0020:  // SPACE
    case 0x0085:  // NEXT LINE
    case 0x00A0:  // NO-BREAK SPACE
    case 0x1680:  // OGHAM SPACE MARK
    case 0x2000:  // EN QUAD
    case 0x2001:  // EM QUAD
    case 0x2002:  // EN SPACE
    case 0x2003:  // EM SPACE
    case 0x2004:  // THREE-PER-EM SPACE
    case 0x2005:  // FOUR-PER-EM SPACE
    case 0x2006:  // SIX-PER-EM SPACE
    case 0x2007:  // FIGURE SPACE
    case 0x2008:  // PUNCTUATION SPACE
    case 0x2009:  // THIN SPACE
    case 0x200A:  // HAIR SPACE
    case 0x2028:  // LINE SEPARATOR
    case 0x2029:  // PARAGRAPH SEPARATOR
    case 0x202F:  // NARROW NO-BREAK SPACE
    case 0x205F:  // MEDIUM MATHEMATICAL SPACE
    case 0x3000:  // IDEOGRAPHIC SPACE
        return 1;
    }
    return 0;
}