Advanced usage of type hints

Functions that do not affect the schema

There are a number of functions in DataSet which do not affect the schema. For example:

[1]:

from pyspark.sql import SparkSession

spark = SparkSession.Builder().config("spark.ui.showConsoleProgress", "false").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

[2]:

from typedspark import Column, Schema, DataSet, create_partially_filled_dataset
from pyspark.sql.types import StringType


class A(Schema):
    a: Column[StringType]


df = create_partially_filled_dataset(
    spark,
    A,
    {
        A.a: ["a", "b", "c"],
    },
)
res = df.filter(A.a == "a")

In the above example, filter() will not actually make any changes to the schema, hence we have implemented the return type of DataSet.filter() to be a DataSet of the same Schema that you started with. In other words, a linter will see that res is of the type DataSet[A].

This allows you to skip casting steps in many cases and instead define functions as:

[3]:

def foo(df: DataSet[A]) -> DataSet[A]:
    return df.filter(A.a == "a")

The functions for which this is currently implemented include:

filter()
distinct()
orderBy()
where()
alias()
persist()
unpersist()

Functions applied to two DataSets of the same schema

Similarly, some functions return a DataSet[A] when they take two DataSet[A] as an input. For example, here a linter will see that res is of the type DataSet[A].

[4]:

df_a = create_partially_filled_dataset(spark, A, {A.a: ["a", "b", "c"]})
df_b = create_partially_filled_dataset(spark, A, {A.a: ["d", "e", "f"]})

res = df_a.unionByName(df_b)

The functions in this category include:

unionByName()
join(..., how="semi")

Transformations

Finally, the transform() function can also be typed. In the following example, a linter will see that res is of the type DataSet[B].

[5]:

from typedspark import transform_to_schema
from pyspark.sql.functions import lit


class B(A):
    b: Column[StringType]


def foo(df: DataSet[A]) -> DataSet[A]:
    return transform_to_schema(
        df,
        B,
        {
            B.b: lit("hi"),
        },
    )


res = create_partially_filled_dataset(
    spark,
    A,
    {
        A.a: ["a", "b", "c"],
    },
).transform(foo)

Did we miss anything?

There are likely more functions that we did not yet cover. Feel free to make an issue and reach out when there is one that you’d like typedspark to support!