Mulitple Row in spark dataframe using schema

46 Views Asked by At

This is my schema

my_schema = StructType([
    StructField('uid', StringType(), True),
    StructField('test_id', StringType(), True),
    StructField("struct_ids", ArrayType(
        StructType([
                StructField("st", IntegerType(), True),
                StructField("mt", IntegerType(), True),
            ])
        ) )
 ])

this is my data

my_data = {'table_test': {'uid': 'test',
                          'test_id': 'test',
                          'struct_ids': [{'st': 1234, 'mt': 1111}, {'st': 6789, 'mt': 2222}]}}

This is how I create a dataframe and it works.

df = spark.createDataFrame(data=[my_data['table_test']], schema=my_schema)

How to create multiple rows? eg: Add this row to the table during creation of table or later.

{'uid': 'test2',
 'test_id': 'test2',
 'struct_ids': [{'st': 3333, 'mt': 114411}, {'st': 333, 'mt': 444}]}

Creating an array did not work.

2

There are 2 best solutions below

1
Shubham Sharma On

Explode the array of structs

df.select('*', F.inline('struct_ids')).drop('struct_ids')

+----+-------+----+----+
| uid|test_id|  st|  mt|
+----+-------+----+----+
|test|   test|1234|1111|
|test|   test|6789|2222|
+----+-------+----+----+
0
Akhaya Chandan Mishra On

If your required output is thisenter image description here

use this while creation of table

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType
spark = SparkSession.builder.appName("SplitDataExample").getOrCreate()
my_schema = StructType([
    StructField('uid', StringType(), True),
    StructField('test_id', StringType(), True),
    StructField("struct_ids", ArrayType(
        StructType([
            StructField("st", IntegerType(), True),
            StructField("mt", IntegerType(), True),
        ])
    ))
])
my_data = {
    'table_test': {
        'uid': 'test',
        'test_id': 'test',
        'struct_ids': [{'st': 1234, 'mt': 1111}, {'st': 6789, 'mt': 2222}]
    }
}
uid = my_data['table_test']['uid']
test_id = my_data['table_test']['test_id']
struct_ids = my_data['table_test']['struct_ids']
rows = [(uid, test_id, [struct]) for struct in struct_ids]
df = spark.createDataFrame(rows, schema=my_schema)
df.show(truncate=False)
spark.stop()