Auto-complete & easier refactoring using schema attributes

Schemas allow us to replaces the numerous strings throughout the code by schema attributes. Consider the following example:

[1]:

from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
from pyspark.sql.functions import col


class Person(Schema):
    id: Column[LongType]
    name: Column[StringType]
    age: Column[LongType]


def birthday(df: DataSet[Person]) -> DataSet[Person]:
    return DataSet[Person](
        df.withColumn("age", col("age") + 1),
    )

We can replace this with:

[2]:

def birthday(df: DataSet[Person]) -> DataSet[Person]:
    return DataSet[Person](
        df.withColumn(Person.age.str, Person.age + 1),
    )

Which allows:

Autocomplete of column names during coding
Easy refactoring of column names

Note that we have two options when using schema attributes:

Person.age, which is similar to a Spark Column object (i.e. col("age"))
Person.age.str, which is just the column name (i.e. "age")

It is usually fairly obvious which one to use. For instance, in the above example, withColumn() expects a string as the first argument and a Column object as the second argument.