Problems reading delta lake files in Databricks

103 Views Asked by At

In another issue here I was having difficulty reading all the files, someone managed to help me. I managed to read all the files, but can you guys give me some more help?

I'm reading 12 different files, and a new file is only inserted once a year.

In these 12 files, each one refers to a year, and I wanted to insert a column with the year that file refers to.

In the case of files, the first line only has "Fiscal year: 2013", and I wanted to make this column a line for each file, but it reads them all as a single year.

I'm doing it this way:

# Extract the year from the file
first_line = spark.read.text(path_files).first()[0]
file_year = re.search(r"\d{4}", first_line).group()
1

There are 1 best solutions below

7
Karthikeyan Rasipalay Durairaj On

You can find the logic for your requirement .

Note : Based on the same file which you have shared, I could not figure out the delimiter of the file . So , I have assumed delimiter of your input file is comma(',').

Input file format :

year: 2017  
CITY,COD
abc,123
def,456
geh,789

#getting the year of the data 

var_lines_df = spark.read.csv("/FileStore/tables/input1")
var_first_line = var_lines_rdd.first()[0]
var_year = var_first_line.split(":")[1]
print(var_year)

#creating the dataframe and create a another column based on the Year of the data . 

from pyspark.sql.functions import lit
lines_to_skip = 1
lines_rdd = spark.sparkContext.textFile("/FileStore/tables/input1")
filtered_rdd = lines_rdd.zipWithIndex().filter(lambda x: x[1] >= lines_to_skip).map(lambda x: x[0])
df = spark.read.option("delimiter", ",").csv(filtered_rdd, header=True, inferSchema=True)
df_with_new_column = df.withColumn("year", lit(var_year))
display(df_with_new_column)

enter image description here