Auto-complete & easier refactoring using schema attributes
Schemas allow us to replaces the numerous strings throughout the code by schema attributes. Consider the following example:
[1]:
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
from pyspark.sql.functions import col
class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]
def birthday(df: DataSet[Person]) -> DataSet[Person]:
return DataSet[Person](
df.withColumn("age", col("age") + 1),
)
We can replace this with:
[2]:
def birthday(df: DataSet[Person]) -> DataSet[Person]:
return DataSet[Person](
df.withColumn(Person.age.str, Person.age + 1),
)
Which allows:
Autocomplete of column names during coding
Easy refactoring of column names
Note that we have two options when using schema attributes:
Person.age, which is similar to a SparkColumnobject (i.e.col("age"))Person.age.str, which is just the column name (i.e."age")
It is usually fairly obvious which one to use. For instance, in the above example, withColumn() expects a string as the first argument and a Column object as the second argument.