I want to sort a pandas frame by multiple columns. The constraint I have which is giving me trouble ist, that one of the columns (the first) needs to be natural sorted, so I tried the following:
sortedFrame = inFrame.sort_values(by=['Col_Arg', 'Col_Step'],
key=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg'])))
but this code results in the frame only being sorted by Col_Arg. E.g. the input frame
| Col_Arg | Col_Step |
|---|---|
| 1 First | 20 |
| 2 Second | 10 |
| 1 First | 10 |
results in
| Col_Arg | Col_Step |
|---|---|
| 1 First | 20 |
| 1 First | 10 |
| 2 Second | 10 |
You can imagine Col_Arg as an indexed headline. Inside that indexed headline are steps to execute (Col_2). Since Col_Arg is an string which cannot be transformed in an integer, I want to use natsort, which is working fine to sort for Col_Arg alone, but is not working with multiple column names. The easy way is just to introduce an additional index for the headlines. Then I could just easily use:
sortedFrame = inFrame.sort_values(['Col_Arg_Idx', 'Col_2'])
Since I am quite new to python and pandas I am curious and I want to understand what is my misconception and how you would do it, since I think I should be possible. I can imagine that it has to do with the usage of key:
key: keycallable, optional Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
But that does not mean it is applied on all... I am a little confused.
In order to introduce intermediate steps later, the indexes are initially incremented by 10 starting from 10.
Thanks in advance.
So the problem with your code is that when you use multiple columns inside
pandas.DataFrame.sort_values, alongside thekeyparameter, pandas takes each series in the order you've defined inside the parameterby=["col1", "col2"]and calls that function passing the column values to it in the order they appeared before callingsort_values.For example, let's define a simple function that only prints the arguments it receives, and use it as our
keyparameter:So basically
pandas.DataFrame.sort_valuespasses each column as a series, to your function, and it expects your function to do some transformation to make the column "sortable". Here's the parameter description from pandas documentation:key:callable, optionalDescription:
Apply the
keyfunction to the values before sorting. This is similar to the key argument in the builtinsorted()function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.In other words, if you want to sort both columns in the same
pandas.DataFrame.sort_valuesoperation, you need to pass in a function that's able to convert'Col_Arg'to a numeric form, while returning'Col_Step'unmodified. Additionally, by usinginFrameinkey=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg']))instead of passingx, the key function will sort values based on theinFrameindexes in the order they existed prior to calling thesort_valuesfunction. Here's an example:So, the first time the
keyfunction gets called it sorts the dataframe indexes using[3 1 2 5 4 0], then it applies the same order as it did before, but now all indexes have already been moved, so it ends up ruining the sort operation.Quick Fix
As stated previously, the
keyfunction takes each column value in the order they exist prior to the sort operation. So we need to create a function that converts'Col_Arg'values to numbers, instead of trying to sort inside the key function. There's a package called number-parser that does that for you. To install it, run the following code:Then, you can create a function to use inside
keylike so:Solution 2: another option would be to do something like this: