How to get proper value instead of "NA_character_" in pandas dataframe while calling R function from Python?

75 Views Asked by At

I'm calling a r-function from python script to apply smote on a dummy dataset. Here the majority class is 0(90%) and minority class is 1(10%). While calling r function directly giving me proper output but getting NA_character_ from same function calling from python. Below is the r function -

# file r_test.r
library(performanceEstimation)

rtest <- function(r_df, over_val, under_val) {
  set.seed(0)
  new_df <- smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val,  k = 5)
  table(new_df$y)
  return(new_df)
}

below is the python code to call this function -

import os
import numpy as np
import pandas as pd

import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

from sklearn.datasets import make_classification

def function2(r_df, over_val, under_val):
    r=ro.r
    r.source(path)
    p=r.rtest(r_df, over_val, under_val)
    return p

path=os.path.join(os.getcwd(), "r_test.r")

X, y = make_classification(n_classes=2,
    class_sep=2, 
    weights=[0.90, 0.10], 
    n_informative=4, 
    n_redundant=1, 
    flip_y=0,
    n_features=5, 
    n_clusters_per_class=1,
    n_samples=100,
    random_state=10)

df = pd.DataFrame(X, columns = ["x1", "x2", "x3", "x4", "x5"])
df['y'] = y
df['y'].value_counts()

Output -

0    90
1    10
Name: y, dtype: int64
base = importr('base')

with localconverter(ro.default_converter + pandas2ri.converter):
    r_from_pd_df = ro.conversion.py2rpy(df)
    
with localconverter(ro.default_converter + pandas2ri.converter):
    pd_from_r_df = ro.conversion.rpy2py(function2(r_from_pd_df, 5, 2))

pd_from_r_df['y'].value_counts()

Output -

0                100
NA_character_     50
1                 10
Name: y, dtype: int64

Number of NA_character_ is the exact number of minority class samples this smote function should generate. What mistake I'm making with the above code and instead of NA_character_, how could I get 1s? Note - completely new to R-language. If there is any problem in R code then please specify it with complete example.

1

There are 1 best solutions below

1
margusl On BEST ANSWER

Try converting that y column to factor first. Some other implementations (like themis::smote() ) will treat you with a nice informative error if types don't match.

Walk-through with reticulate, Python from R:

library(reticulate)
library(performanceEstimation)

# original:
rtest <- function(r_df, over_val, under_val) {
  set.seed(0)
  new_df <- smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val,  k = 5)
  table(new_df$y)
  return(new_df)
}

py_run_string('
from sklearn.datasets import make_classification
import pandas as pd

X, y = make_classification(n_classes=2,
    class_sep=2, 
    weights=[0.90, 0.10], 
    n_informative=4, 
    n_redundant=1, 
    flip_y=0,
    n_features=5, 
    n_clusters_per_class=1,
    n_samples=100,
    random_state=10)

df = pd.DataFrame(X, columns = ["x1", "x2", "x3", "x4", "x5"])
df["y"] = y
df["y"].value_counts()')

# py$ to access objects in reticulate python environment
# check initial state
str(py$df)
#> 'data.frame':    100 obs. of  6 variables:
#>  $ x1: num  -0.00637 2.47159 3.32977 2.38089 3.59025 ...
#>  $ x2: num  1.78 -1.34 -3.3 -1.92 -1.51 ...
#>  $ x3: num  -1.6937 -0.0247 1.3269 0.0854 0.5175 ...
#>  $ x4: num  1.8 2.57 1.72 1.97 1.93 ...
#>  $ x5: num  0.407 -1.455 -2.571 -2.15 -2.427 ...
#>  $ y : num  0 0 0 0 0 0 0 0 0 0 ...
#>  - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
table(py$df$y)
#> 
#>  0  1 
#> 90 10

# apply rtest
new_df <- rtest(py$df, 5, 2)
# and check results
str(new_df)
#> 'data.frame':    160 obs. of  6 variables:
#>  $ x1: num  2.479 1.694 2.314 2.774 0.626 ...
#>  $ x2: num  -1.5 -2.82 -1.62 -2.01 -1.54 ...
#>  $ x3: num  0.65 0.496 -0.714 0.336 -1.025 ...
#>  $ x4: num  1.7 1.2 2.74 1.38 2.41 ...
#>  $ x5: num  -1.31 -2.21 -2.38 -2.72 -1.37 ...
#>  $ y : chr  "0" "0" "0" "0" ...
#>  - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
table(new_df$y)
#> 
#>   0   1 
#> 100  10

# but there should be 160 observations in total ...
# letch check the tail
tail(new_df)
#>            x1        x2          x3       x4       x5    y
#> 451 -2.601792 -2.428654 -0.29214031 2.509291 2.282252 <NA>
#> 461 -2.553342 -2.445119 -0.22325568 2.487501 2.303546 <NA>
#> 471 -2.334285 -2.270024 -0.12400004 2.349623 2.256293 <NA>
#> 48  -2.228444 -2.429856  0.08238596 2.431736 2.391834 <NA>
#> 491 -2.636799 -2.416758 -0.34191319 2.525036 2.266866 <NA>
#> 50  -2.070363 -2.569577  0.43053504 2.234752 2.470430 <NA>

# so apparently there are NA values,
# but table() does not include those by default
table(new_df$y, useNA = "ifany") 
#> 
#>    0    1 <NA> 
#>  100   10   50

Let's modify that function for a better match with examples in ?smote , i.e. turn response into factor:

rtest2 <- function(r_df, over_val, under_val) {
  set.seed(0)
  r_df$y <- as.factor(r_df$y)
  smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val,  k = 5)
}

new_df2 <- rtest2(py$df, 5, 2)
str(new_df2)
#> 'data.frame':    160 obs. of  6 variables:
#>  $ x1: num  2.479 1.694 2.314 2.774 0.626 ...
#>  $ x2: num  -1.5 -2.82 -1.62 -2.01 -1.54 ...
#>  $ x3: num  0.65 0.496 -0.714 0.336 -1.025 ...
#>  $ x4: num  1.7 1.2 2.74 1.38 2.41 ...
#>  $ x5: num  -1.31 -2.21 -2.38 -2.72 -1.37 ...
#>  $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)

# and lets check our new response distribution:
table(new_df2$y, useNA = "ifany") 
#> 
#>   0   1 
#> 100  60

# counts from python (`r.` to access R objects):
py_eval("r.new_df2['y'].value_counts()")
#> 0    100
#> 1     60
#> Name: y, dtype: int64

Created on 2023-09-30 with reprex v2.0.2