Mulitple Row in spark dataframe using schema

Question

Mulitple Row in spark dataframe using schema

46 Views Asked by Blue Clouds At 23 August 2023 at 17:43

This is my schema

my_schema = StructType([
    StructField('uid', StringType(), True),
    StructField('test_id', StringType(), True),
    StructField("struct_ids", ArrayType(
        StructType([
                StructField("st", IntegerType(), True),
                StructField("mt", IntegerType(), True),
            ])
        ) )
 ])

this is my data

my_data = {'table_test': {'uid': 'test',
                          'test_id': 'test',
                          'struct_ids': [{'st': 1234, 'mt': 1111}, {'st': 6789, 'mt': 2222}]}}

This is how I create a dataframe and it works.

df = spark.createDataFrame(data=[my_data['table_test']], schema=my_schema)

How to create multiple rows? eg: Add this row to the table during creation of table or later.

{'uid': 'test2',
 'test_id': 'test2',
 'struct_ids': [{'st': 3333, 'mt': 114411}, {'st': 333, 'mt': 444}]}

Creating an array did not work.

Original Q&A

There are 2 best solutions below

**Shubham Sharma** · Answer 1 · 2023-08-23T18:02:59.227000

Explode the array of structs

df.select('*', F.inline('struct_ids')).drop('struct_ids')

+----+-------+----+----+
| uid|test_id|  st|  mt|
+----+-------+----+----+
|test|   test|1234|1111|
|test|   test|6789|2222|
+----+-------+----+----+

**Akhaya Chandan Mishra** · Answer 2 · 2023-08-23T19:41:26.183000

If your required output is this

use this while creation of table

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType
spark = SparkSession.builder.appName("SplitDataExample").getOrCreate()
my_schema = StructType([
    StructField('uid', StringType(), True),
    StructField('test_id', StringType(), True),
    StructField("struct_ids", ArrayType(
        StructType([
            StructField("st", IntegerType(), True),
            StructField("mt", IntegerType(), True),
        ])
    ))
])
my_data = {
    'table_test': {
        'uid': 'test',
        'test_id': 'test',
        'struct_ids': [{'st': 1234, 'mt': 1111}, {'st': 6789, 'mt': 2222}]
    }
}
uid = my_data['table_test']['uid']
test_id = my_data['table_test']['test_id']
struct_ids = my_data['table_test']['struct_ids']
rows = [(uid, test_id, [struct]) for struct in struct_ids]
df = spark.createDataFrame(rows, schema=my_schema)
df.show(truncate=False)
spark.stop()

Mulitple Row in spark dataframe using schema

There are 2 best solutions below

Related Questions in PYSPARK

Related Questions in PYSPARK-SCHEMA

Trending Questions

Popular # Hahtags

Popular Questions