I'm trying to apply Bayes theorem to a problem of my own so I can understand the methodology and how to set up the numbers.
Essentially, I've got some data on four artists and the number of objects that they make every month over 80 periods.
I'm interested in 'h' and believe there are three possibilities - equally likely - for the last update: 1) Have left work, 2) Have been promoted and have split time between making and managing others, 3) Have been working on a project.
I've used Allen Downey's Think Bayes code to work through the process.
from empiricaldist import Pmf
# Define hypotheses
hypos = ["left", "managing", "project"]
# Define prior probabilities - they are all equally likely
prior = Pmf(1/len(hypos), hypos)
# Display prior probabilities
print("Prior probabilities:")
print(prior)
The result:
Prior probabilities:
left 0.333333
managing 0.333333
project 0.333333
dtype: float64
Code:
# Normalize the data to calculate the likelihoods
normalized_data = df.div(df.sum(axis=1), axis=0)
Normalized Data:
h j n t
0 0.000000 1.000000 0.000000 0.0
1 0.666667 0.333333 0.000000 0.0
2 0.571429 0.428571 0.000000 0.0
3 0.769231 0.230769 0.000000 0.0
4 0.700000 0.300000 0.000000 0.0
Now I get confused.
from empiricaldist import Pmf
# Define hypotheses
hypos = ["left", "managing", "project"]
# Calculate the likelihoods using the normalized data
likelihoods = {}
for hypo in hypos:
likelihoods[hypo] = normalized_data.apply(lambda row: row if hypo == "left" else 1, axis=1)
# Perform Bayesian update to obtain the posterior probabilities
posterior_history = [prior]
for hypo in hypos:
posterior = prior.copy() # Create a copy of the prior probabilities
if hypo == "left":
# Ensure alignment of row labels and perform element-wise multiplication
for index, row in likelihoods[hypo].iterrows():
if index in posterior.index:
posterior.loc[index] *= row
posterior /= posterior.sum() # Normalize the posterior probabilities
posterior_history.append(posterior)
The output is:
[left 0.333333
managing 0.333333
project 0.333333
dtype: float64,
left 0.333333
managing 0.333333
project 0.333333
dtype: float64,
left 0.333333
managing 0.333333
project 0.333333
dtype: float64,
left 0.333333
managing 0.333333
project 0.333333
dtype: float64]
I was confused by the output for two reasons. 1) the posterior is the same as the prior. 2) There are are four outputs.
Maybe I'm over-complicating this and should just update the values just using the one column of the normalized data.
What can I try next?
I've created a dict of the data, data_to_dict, thus:
{'h': {0: 0.0,
1: 2.0,
2: 4.0,
3: 10.0,
4: 7.0,
5: 6.0,
6: 4.0,
7: 10.0,
8: 11.0,
9: 3.0,
10: 4.0,
11: 6.0,
12: 3.0,
13: 4.0,
14: 8.0,
15: 9.0,
16: 6.0,
17: 5.0,
18: 6.0,
19: 5.0,
20: 4.0,
21: 1.0,
22: 3.0,
23: 4.0,
24: 0.0,
25: 2.0,
26: 6.0,
27: 4.0,
28: 8.0,
29: 2.0,
30: 4.0,
31: 2.0,
32: 2.0,
33: 3.0,
34: 2.0,
35: 3.0,
36: 2.0,
37: 3.0,
38: 3.0,
39: 1.0,
40: 4.0,
41: 2.0,
42: 1.0,
43: 3.0,
44: 3.0,
45: 1.0,
46: 1.0,
47: 1.0,
48: 5.0,
49: 2.0,
50: 2.0,
51: 4.0,
52: 4.0,
53: 2.0,
54: 3.0,
55: 4.0,
56: 2.0,
57: 2.0,
58: 1.0,
59: 4.0,
60: 3.0,
61: 3.0,
62: 3.0,
63: 1.0,
64: 3.0,
65: 2.0,
66: 2.0,
67: 4.0,
68: 2.0,
69: 2.0,
70: 1.0,
71: 0.0,
72: 5.0,
73: 0.0,
74: 3.0,
75: 3.0,
76: 2.0,
77: 2.0,
78: 2.0,
79: 4.0,
80: 1.0,
81: 2.0,
82: 0.0},
'j': {0: 2.0,
1: 1.0,
2: 3.0,
3: 3.0,
4: 3.0,
5: 2.0,
6: 1.0,
7: 9.0,
8: 7.0,
9: 4.0,
10: 0.0,
11: 3.0,
12: 6.0,
13: 2.0,
14: 5.0,
15: 4.0,
16: 1.0,
17: 2.0,
18: 2.0,
19: 3.0,
20: 6.0,
21: 6.0,
22: 3.0,
23: 4.0,
24: 5.0,
25: 3.0,
26: 2.0,
27: 1.0,
28: 4.0,
29: 0.0,
30: 1.0,
31: 0.0,
32: 0.0,
33: 2.0,
34: 2.0,
35: 1.0,
36: 0.0,
37: 4.0,
38: 2.0,
39: 0.0,
40: 0.0,
41: 2.0,
42: 2.0,
43: 1.0,
44: 2.0,
45: 1.0,
46: 1.0,
47: 2.0,
48: 0.0,
49: 1.0,
50: 1.0,
51: 2.0,
52: 0.0,
53: 0.0,
54: 0.0,
55: 1.0,
56: 2.0,
57: 1.0,
58: 0.0,
59: 1.0,
60: 0.0,
61: 1.0,
62: 1.0,
63: 1.0,
64: 2.0,
65: 0.0,
66: 2.0,
67: 2.0,
68: 5.0,
69: 1.0,
70: 2.0,
71: 2.0,
72: 3.0,
73: 0.0,
74: 3.0,
75: 0.0,
76: 1.0,
77: 2.0,
78: 5.0,
79: 3.0,
80: 1.0,
81: 4.0,
82: 2.0},
'n': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0,
20: 0.0,
21: 0.0,
22: 0.0,
23: 0.0,
24: 0.0,
25: 0.0,
26: 0.0,
27: 0.0,
28: 0.0,
29: 0.0,
30: 0.0,
31: 0.0,
32: 0.0,
33: 0.0,
34: 0.0,
35: 0.0,
36: 0.0,
37: 0.0,
38: 0.0,
39: 0.0,
40: 0.0,
41: 0.0,
42: 0.0,
43: 0.0,
44: 0.0,
45: 0.0,
46: 0.0,
47: 0.0,
48: 0.0,
49: 0.0,
50: 0.0,
51: 0.0,
52: 0.0,
53: 0.0,
54: 0.0,
55: 0.0,
56: 0.0,
57: 0.0,
58: 0.0,
59: 0.0,
60: 0.0,
61: 0.0,
62: 0.0,
63: 0.0,
64: 0.0,
65: 0.0,
66: 0.0,
67: 0.0,
68: 0.0,
69: 0.0,
70: 0.0,
71: 0.0,
72: 0.0,
73: 1.0,
74: 3.0,
75: 6.0,
76: 8.0,
77: 2.0,
78: 3.0,
79: 2.0,
80: 2.0,
81: 5.0,
82: 2.0},
't': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 6.0,
9: 3.0,
10: 4.0,
11: 8.0,
12: 2.0,
13: 5.0,
14: 5.0,
15: 3.0,
16: 7.0,
17: 3.0,
18: 4.0,
19: 2.0,
20: 5.0,
21: 1.0,
22: 2.0,
23: 2.0,
24: 2.0,
25: 1.0,
26: 1.0,
27: 6.0,
28: 4.0,
29: 5.0,
30: 2.0,
31: 3.0,
32: 6.0,
33: 1.0,
34: 2.0,
35: 1.0,
36: 2.0,
37: 1.0,
38: 2.0,
39: 1.0,
40: 0.0,
41: 2.0,
42: 2.0,
43: 2.0,
44: 2.0,
45: 2.0,
46: 3.0,
47: 0.0,
48: 2.0,
49: 5.0,
50: 3.0,
51: 4.0,
52: 0.0,
53: 1.0,
54: 1.0,
55: 0.0,
56: 3.0,
57: 1.0,
58: 1.0,
59: 0.0,
60: 1.0,
61: 1.0,
62: 1.0,
63: 2.0,
64: 0.0,
65: 1.0,
66: 1.0,
67: 0.0,
68: 0.0,
69: 0.0,
70: 0.0,
71: 0.0,
72: 0.0,
73: 0.0,
74: 0.0,
75: 0.0,
76: 0.0,
77: 0.0,
78: 0.0,
79: 0.0,
80: 0.0,
81: 0.0,
82: 0.0}}
df = pd.DataFrame(df_to_dict)