Calculate Entropy of various text strings in an extra column od a table in Rstudio

120 Views Asked by At

I am currently desperate regarding my master thesis in which I am working with R. I hope someone can help me!

I have a dataframe with about 70,000 rows and 38 columns. Now I want to calculate the entropy for one of the columns (variables), which consist of character strings. The entropy should then be displayed as an extra column(variable) in the table (see image - Extract from the table).

The variable Verbatim for which I want to calculate the entropy contains the following string, for example:

"A LIGHT STOMACH" or "LEFT ANKLE FRACTURE" or "WORSENING INCREASED CREATININE". So these are always different sentences for which I want to calculate the entropy.

I have tried the following code among others, but it always gives the same entropy value for each of the same sentences (Verbatim).

DistEventsAllInfo_NOOUTL$ENTROPY <-  entropy(DistEventsAllInfo_NOOUTL$VERBATIM)

Tank you for your help in advance!

1

There are 1 best solutions below

0
Ray On

Sandra, as mentioned, you will find a lot of friends here, if you provide a minimal workable example. Read up on how to create one.

Entropy (and derived information) is (are) defined for a probability distribution over states of a system. Thus, you define the states and measure the probability of the occurrence of one state for the whole population.

  • Below I create a dummy data sample - adapt this, e.g. the names, etc to your case.
  • I also use the {tidyverse} family of packages to help you see how this works (i.e. defining the groups/cases ~ states you are interested).

I. dummy data

Let's create a data frame of cases:

library(dplyr)   # or library(tidyverse) - dplyr is one package for data crunching

# our dummy data
# we abbreviate DistEventsAllInfo_NOOUTL to df!
# to make the case, we name VERBATIM as GROUP!
# the variable VALUE is an arbitrary description
# we do not know your case, e.g. days of treatment?
# VALUE is a metric of your state!
df <- data.frame(
    GROUP = c("A LIGHT STOMACH", "A LIGHT STOMACH"
            , "LEFT ANKLE FRACTURE", "LEFT ANKLE FRACTURE",                                                          
              "WORSENING INCREASED CREATININE", "WORSENING INCREASED 
              CREATININE","WORSENING INCREASED CREATININE")
   ,VALUE = c(17, 11, 36, 48, 42, 15, 19)
)

This yields:

df
                           GROUP VALUE
1                A LIGHT STOMACH    17
2                A LIGHT STOMACH    11
3            LEFT ANKLE FRACTURE    36
4            LEFT ANKLE FRACTURE    48
5 WORSENING INCREASED CREATININE    42
6 WORSENING INCREASED CREATININE    15
7 WORSENING INCREASED CREATININE    19

II. data crunching - which entropy() function are you using?

It is unclear which package you are using for the entropy() function or whether you have written this function yourself.

From the {entropy} package, the function entropy() requires a numeric variable (think column of your data frame).

Thus applying entropy(df$GROUP) will throw an error.

If you have written a function, please post it here. This way we can trouble shoot what the function does.

III. data crunching with the {entropy} package function entropy()

What you get when you run a function on a vector without grouping

library(entropy)
# we stress the package by using the entropy:: notation

df |> mutate(ENTROPY = entropy::entropy(VALUE))

                           GROUP VALUE  ENTROPY
1                A LIGHT STOMACH    17 1.816692
2                A LIGHT STOMACH    11 1.816692
3            LEFT ANKLE FRACTURE    36 1.816692
4            LEFT ANKLE FRACTURE    48 1.816692
5 WORSENING INCREASED CREATININE    42 1.816692
6 WORSENING INCREASED CREATININE    15 1.816692
7 WORSENING INCREASED CREATININE    19 1.816692

Here the full population is considered as 1 group, as you only supply the VALUE variable/colum to the the entropy() function.

We can calculate the entropy by "grouping" the cases (Note: I renamed VERBATIM to GROUP to make this clearer for you).

df |> 
  group_by(GROUP) |>    # dplyr's grouping
  mutate(ENTROPY = entropy::entropy(VALUE))

# A tibble: 7 × 3
# Groups:   GROUP [3]
  GROUP                          VALUE ENTROPY
  <chr>                          <dbl>   <dbl>
1 A LIGHT STOMACH                   17   0.670
2 A LIGHT STOMACH                   11   0.670
3 LEFT ANKLE FRACTURE               36   0.683
4 LEFT ANKLE FRACTURE               48   0.683
5 WORSENING INCREASED CREATININE    42   0.995
6 WORSENING INCREASED CREATININE    15   0.995
7 WORSENING INCREASED CREATININE    19   0.995

{dplyr} calculates the group-wise entropy and injects this in the new colum ENTROPY. Similar to above, the calculated entropy value is inserted as a vector per group (i.e. you will get the same values for each group member).

You want to use dplyr::summarise() to simplify the output

df |> 
  group_by(GROUP) |> 
  summarise(ENTROPY = entropy::entropy(VALUE))

# A tibble: 3 × 2
  GROUP                          ENTROPY
  <chr>                            <dbl>
1 A LIGHT STOMACH                  0.670
2 LEFT ANKLE FRACTURE              0.683
3 WORSENING INCREASED CREATININE   0.995