I am currently desperate regarding my master thesis in which I am working with R. I hope someone can help me!
I have a dataframe with about 70,000 rows and 38 columns. Now I want to calculate the entropy for one of the columns (variables), which consist of character strings. The entropy should then be displayed as an extra column(variable) in the table (see image - Extract from the table).
The variable Verbatim for which I want to calculate the entropy contains the following string, for example:
"A LIGHT STOMACH" or "LEFT ANKLE FRACTURE" or "WORSENING INCREASED CREATININE". So these are always different sentences for which I want to calculate the entropy.
I have tried the following code among others, but it always gives the same entropy value for each of the same sentences (Verbatim).
DistEventsAllInfo_NOOUTL$ENTROPY <- entropy(DistEventsAllInfo_NOOUTL$VERBATIM)
Tank you for your help in advance!
Sandra, as mentioned, you will find a lot of friends here, if you provide a minimal workable example. Read up on how to create one.
Entropy (and derived information) is (are) defined for a probability distribution over states of a system. Thus, you define the states and measure the probability of the occurrence of one state for the whole population.
I. dummy data
Let's create a data frame of cases:
This yields:
II. data crunching - which
entropy()function are you using?It is unclear which package you are using for the
entropy()function or whether you have written this function yourself.From the
{entropy}package, the functionentropy()requires a numeric variable (think column of your data frame).Thus applying
entropy(df$GROUP)will throw an error.If you have written a function, please post it here. This way we can trouble shoot what the function does.
III. data crunching with the
{entropy}package functionentropy()What you get when you run a function on a vector without grouping
Here the full population is considered as 1 group, as you only supply the VALUE variable/colum to the the
entropy()function.We can calculate the
entropyby "grouping" the cases (Note: I renamed VERBATIM to GROUP to make this clearer for you).{dplyr}calculates the group-wise entropy and injects this in the new colum ENTROPY. Similar to above, the calculated entropy value is inserted as a vector per group (i.e. you will get the same values for each group member).You want to use
dplyr::summarise()to simplify the output