I have two pandas dataframes in which each row is a person and their response data in the form of a list:
df_1 = pd.DataFrame({'ID': ['a', 'b', 'c', 'd', 'e', 'f'], 'response': [["apple", "berry", "cherry"],
["pear", "pineapple", "plum"],
["blue_berry"],
["orange", "lemon"],
["tomato", "pumpkin"],
["avocado", "strawberry"]], 'group': [1, 2, 1, 2, 1, 2]})
df_2 = pd.DataFrame({'ID': ['A', 'B','C', 'D', 'E', 'F'], 'response': [["pear", "plum", "cherry"],
["orange", "lemon", "lime", "pineapple"],
["pumpkin"],
["tomato", "strawberry"],
["avocado", "apple"],
["berry", "cherry", "apple"]], 'group': [1, 2, 1, 2, 1, 2]})
I am trying to construct a matrix in which each column and row indices is an ID and group, but where each cell of the matrix is the pair-wise Jensen-Shannon Divergence score calculated from response. My eventual goal is to visualize this as a heatmap to assess reliability between people's responses, but first am struggling to put my data into the correct matrix form.
I am not sure how to convert these dataframes into squareform and then calculate the JSD using the below function:
def jsdiv(P, Q):
"""Compute the Jensen-Shannon divergence between two probability distributions.
Input
-----
P, Q : array-like
Probability distributions of equal length that sum to 1
"""
def _kldiv(A, B):
return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])
P = np.array(P)
Q = np.array(Q)
M = 0.5 * (P + Q)
return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))
First of all, you need to combine the two dataframes that you have. I suggest the following approaach
which gives you a data frame of the type:
You can the apply your function,
which will give you
However, I do not understand your function. You do know that you could be using
directly, right?
This will give you
Your definition of
jsdivis overcomplicating things.