I'm trying to integrate PyDeequ with PySpark in my Streamlit application to perform comprehensive data quality checks on a CSV file. I want to use PyDeequ's functionalities to perform various tests including completeness, correctness, uniqueness, outlier detection, and date format correctness. However, I'm encountering an error that says the 'JavaPackage' object is not callable. Here's the relevant code snippet, the specific tests I'm trying to perform, and the error message:
import streamlit as st
from pyspark.sql import SparkSession
from pydeequ import AnalysisRunner
from pydeequ.analyzers import Completeness
def create_spark_session():
return SparkSession.builder.appName("DataQualityCheck").getOrCreate()
def read_csv_data(spark, uploaded_file):
df = spark.read.csv(uploaded_file, header=True, inferSchema=True)
return df
def main():
st.title("Data Quality Checker")
uploaded_file = st.file_uploader("Choose a CSV file:", key="csv_uploader", type="csv")
if uploaded_file is not None:
spark = create_spark_session()
df = read_csv_data(spark, uploaded_file)
analysis_runner = AnalysisRunner(spark)
analysis_result = analysis_runner.onData(df).addAnalyzer(Completeness("MRN")).run()
completeness_results = analysis_result['Completeness']
completeness_mrn = completeness_results['MRN']
completeness_percent_mrn = completeness_mrn['completeness']
missing_count_mrn = completeness_mrn['count']
if __name__ == "__main__":
main()
TypeError: 'JavaPackage' object is not callable
Traceback:
File "E:\Deequ\pydeequ_env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 542, in _run_script
exec(code, module.__dict__)
File "E:\data_quality.py", line 43, in <module>
completeness_mrn = completeness_results['MRN']
File "E:\Deequ\pydeequ_env\lib\site-packages\pydeequ\analyzers.py", line 52, in onData
return AnalysisRunBuilder(self._spark_session, df)
File "E:\Deequ\pydeequ_env\lib\site-packages\pydeequ\analyzers.py", line 124, in __init__
self._AnalysisRunBuilder = self._jvm.com.amazon.deequ.analyzers.runners.AnalysisRunBu
Data Quality Tests:
- Completeness: Ensure that certain columns (e.g., "MRN" and "Date of Admission") have complete data.
- Correctness: Verify that data in specific columns adhere to certain format or correctness rules (e.g., "MRN" format correctness).
- Uniqueness: Check if certain columns contain unique values (e.g., "MRN" uniqueness).
- Outlier Detection: Identify any outliers in numerical columns (e.g., "Billing Amount").
- Date Future Format: Ensure that dates in a certain column (e.g., "Date of Admission") are not in the future.
I have installed PyDeequ version 1.2.0 and PySpark downgraded version 3.3.1 in my environment. Could someone please help me understand why I'm encountering this error and how to resolve it?