Why does np.array([1, "a"]) consume Unicode String of 21 characters?

715 Views Asked by At

When checking the data type of string with one character, i am getting dtype as <U1 as expected.

print(numpy.array(["a"]).dtype)

Output : <U1

But after adding an integer to the array, why does it consume 21 characters ?

print(numpy.array([1,"a"]).dtype)

Output : <U21
1

There are 1 best solutions below

0
Dani Mesejo On BEST ANSWER

Why does it consume 21 characters?

Because the elements are being promoted, this means numpy transforms the elements to

the smallest size and smallest scalar kind to which both type1 and type2 may be safely cast.

For example if we use promote_types:

print(np.promote_types('i8', '<U1'))

Output

<U21

Regarding the U21, it consists of two parts, as you already know, the U which denotes Unicode and the 21 denotes the number of elements it can hold, see more on this answer.

So as 8 can be cast to int64, and it can hold at least 20 characters (platform dependent though), it's being transformed to U21. The know the number of characters a number can have you can do:

ii64 = np.iinfo(np.int64)
print(ii64)

Output

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

In particular:

print(len(str(ii64.min)))

Output

20

You can keep U1, by doing:

print(np.array(["a", 1]).dtype) # put the string first

Output

<U1

See more on this GitHub issue.