I doing some tests using a Python Jupyter Notebook on Visual Code to connect pyspark local session to my localhost PostgreSQL, running as a Docker container.
from pyspark.sql import SparkSession
# create a spark instance
spark = SparkSession.builder \
.appName("ETL_PostgreSQL") \
.config("spark.master", "local") \
.config("spark.jars.packages", "org.postgresql:postgresql:42.5.4") \
.getOrCreate()
# Source PostgreSQL database connection settings
source_url = "jdbc:postgresql://localhost:5430/chinook"
source_properties = {
"user": "root",
"password": "****",
"driver": "org.postgresql.Driver"
}
table_df = spark.read.jdbc(url=source_url, table="genre", properties=source_properties)
table_df.show()
spark.stop()
I get the following error on the spark.read command: ... Py4JJavaError: An error occurred while calling o1946.jdbc. : java.lang.ClassNotFoundException: org.postgresql.Driver ...
I already cheched the java ("1.8.0_401"), the Windows system variables Java_Home and PythonPath, and the Py4J installation. I also tried different "config("spark.jars", ..) configurations. There is no problem to connect to the db using psycopg2 lib.
Can you please help me on this error? Thank you!