StructType Columns
The basics
We can define StructType columns in typedspark as follows:
[1]:
from pyspark.sql.types import IntegerType, StringType
from typedspark import DataSet, StructType, Schema, Column
class Values(Schema):
name: Column[StringType]
severity: Column[IntegerType]
class Actions(Schema):
consequeces: Column[StructType[Values]]
We can get auto-complete (and refactoring) of the sub-columns by using:
[2]:
def get_high_severity_actions(df: DataSet[Actions]) -> DataSet[Actions]:
return df.filter(Actions.consequeces.dtype.schema.severity > 5)
Transform to schema
You can use the following syntax to add StructType columns in transform_to_schema().
[3]:
from pyspark.sql import SparkSession
spark = SparkSession.Builder().config("spark.ui.showConsoleProgress", "false").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
[4]:
from typedspark import create_partially_filled_dataset, transform_to_schema, structtype_column
class Input(Schema):
a: Column[StringType]
b: Column[IntegerType]
df = create_partially_filled_dataset(
spark,
Input,
{
Input.a: ["a", "b", "c"],
Input.b: [1, 2, 3],
},
)
transform_to_schema(
df,
Actions,
{
Actions.consequeces: structtype_column(
Actions.consequeces.dtype.schema,
{
Actions.consequeces.dtype.schema.name: Input.a,
Actions.consequeces.dtype.schema.severity: Input.b,
},
)
},
).show()
+-----------+
|consequeces|
+-----------+
| {a, 1}|
| {b, 2}|
| {c, 3}|
+-----------+
Note that just like in transform_to_schema(), the transformations dictionary in structtype_column(..., transformations) requires columns with unique names as keys.
Generating DataSets
We can generate DataSets with StructType columns as follows:
[5]:
from typedspark import create_partially_filled_dataset
values = create_partially_filled_dataset(
spark,
Values,
{
Values.severity: [1, 2, 3],
},
)
actions = create_partially_filled_dataset(
spark,
Actions,
{
Actions.consequeces: values.collect(),
},
)
actions.show()
+-----------+
|consequeces|
+-----------+
| {NULL, 1}|
| {NULL, 2}|
| {NULL, 3}|
+-----------+
Or in row-wise format:
[6]:
from typedspark import create_structtype_row
create_partially_filled_dataset(
spark,
Actions,
[
{
Actions.consequeces: create_structtype_row(
Values, {Values.name: "a", Values.severity: 1}
),
},
{
Actions.consequeces: create_structtype_row(
Values, {Values.name: "b", Values.severity: 2}
),
},
{
Actions.consequeces: create_structtype_row(
Values, {Values.name: "c", Values.severity: 3}
),
},
],
).show()
+-----------+
|consequeces|
+-----------+
| {a, 1}|
| {b, 2}|
| {c, 3}|
+-----------+