Pandas slicing by index inconsistency

61 Views Asked by At

I have the following DataFrames/Series exhibiting very surprising [] slicing behaviour:

# slicing by integers force iloc-like indexing even if index is integer
In [1]: pd.DataFrame({"a": {3: 1, 1: 2}})["a"][2:]
Out[1]: Series([], Name: a, dtype: int64)

# slicing by index element uses sort order of index,
# and in this case dict insertion order is NOT respected
In [2]: pd.DataFrame({"a": {"d": 1, "b": 2}})["a"]["c":]
Out[2]: 
d    1
Name: a, dtype: int64

# if index is not sorted,
# slicing by index element that is not present
# should trigger an exception
In [3]: pd.DataFrame({"a": [1, 2]}, index=["d", "b"])["a"]["c":]
Out[3]: 
b    2
Name: a, dtype: int64

Isn't the last one a bug in Pandas as it is supposed to trigger an Exception?

Moral of the story: never use [] on a DataFrame or Series, especially with slices...

1

There are 1 best solutions below

3
e-motta On BEST ANSWER

Maybe you're overlooking the differences in two types of selection supported in Pandas:

  • Selection by position: works like a regular integer-based indexing. When you select with iloc or simply with Series[:2] (integer index), this will be used. Read more here.

  • Selection by label: if the index is sorted, Pandas will include in the slice anything that is between the start and stop labels, and exclude anything that is not. When you select with loc or Series['c':] (label index), this will be used. Read more here.


Your first example:

pd.DataFrame({"a": {3: 1, 1: 2}})["a"][2:]
  1. You select using [2:].
  2. This will select anything starting at position 2 in a zero-index based array. Nothing is returned, since the index only has 2 elements.
Series([], Name: a, dtype: int64)

Compare this with selecting from position 1:

pd.DataFrame({"a": {3: 1, 1: 2}})["a"].iloc[1:]
1    2
Name: a, dtype: int64

Your second and last examples (they both give the same result to me):

pd.DataFrame({"a": [1, 2]}, index=["d", "b"])["a"]["c":]
  1. You select a slice beginning at label 'c', using ["c":].
  2. Since 'd' and 'b' are ordered in decreasing order, this will select anything beginning at label 'c' (including) up until the end of the index, which is 'b':
b    2
Name: a, dtype: int64

Compare this with an unordered index:

pd.DataFrame({"a": [0, 1, 2]}, index=["a", "d", "b"])["a"]["c":]

This will raise a KeyError: 'c'.