https://www.w3.org/TR/xpath-functions/#func-tokenize explains about the single argument version of tokenize
:
The one-argument form of this function splits the supplied string at whitespace boundaries.
and then goes on to define or explain that with
calling
fn:tokenize($input)
is equivalent to callingfn:tokenize(fn:normalize-space($input), ' '))
where the second argument is a single space character (x20)
However, when I try count(tokenize('1 2 3')), count(tokenize('1 2 3'))
with Saxon or BaseX or XmlPrime I get 3 3
while the supposedly equivalent count(tokenize('1 2 3', ' ')), count(tokenize('1 2 3', ' '))
in all three implementations gives me 3 1
.
So all three implementations seem to do with tokenize($s)
what the textual explanation says ("splits the supplied string at whitespace boundaries") but it doesn't seem that the equivalence of fn:tokenize($input)
and fn:tokenize(fn:normalize-space($input), ' '))
given in the spec holds up, if a space is literally passed in then only that single space is used as a separator and not whitespace boundaries.
Is that equivalence given in the spec as a definition of the single argument version wrong?
The call on
normalize-space()
replaces newlines by x20 space characters. So whilecount(tokenize('1 2 3', ' '))
gives 1,count(tokenize(normalize-space('1 2 3'), ' '))
gives 3.The substitution of newlines and tabs by single spaces could have been achieved using a smarter regular expression, but the key thing that the call on
normalize-space()
achieves is to trim leading and trailing whitespace. For exampletokenize(" red green blue ", "\s+")
gives 5 tokens, buttokenize(" red green blue ")
gives 3.