input1
import pandas as pd
import numpy as np
np.random.seed(0)
data = {'item': np.random.choice(['skirt', 'shirt', 'coat'], 6),
'size': np.random.choice(['S', 'M', 'L', 'XL'], 6)}
df1 = pd.DataFrame(data)
df1:
item size
0 skirt S
1 shirt XL
2 skirt L
3 shirt S
4 shirt S
5 coat S
when i sort by size
df1.sort_values('size')
out:
item size
2 skirt L
0 skirt S
3 shirt S
4 shirt S
5 coat S
1 shirt XL
The data is sorted by the size column, and when the values of the size column are the same, the rows that were originally higher remain higher.
input2
import pandas as pd
import numpy as np
pd.options.display.max_rows = 6
np.random.seed(0)
data1 = {'item': np.random.choice(['skirt', 'shirt', 'coat'], 1000000),
'size': np.random.choice(['S', 'M', 'L', 'XL'], 1000000)}
df2 = pd.DataFrame(data1)
df2
item size
0 skirt M
1 shirt L
2 skirt M
... ... ...
999997 coat S
999998 shirt S
999999 skirt L
[1000000 rows x 2 columns]
df2 has 1M rows
when i sort by size
df2.sort_values('size')
out:
item size
999999 skirt L <- why top?
645704 shirt L
645714 shirt L
... ... ...
822256 coat XL
699230 coat XL
400737 skirt XL
[1000000 rows x 2 columns]
I don't know why 999999 row is at the top in df2.
Shouldn't the existing order be followed if size is the same?
What you want is a "stable" sort. "Stable" means it maintains the current order when the keys are identical. The default algorithm, quicksort, is not stable.