Advanced usage of type hints
Functions that do not affect the schema
There are a number of functions in DataSet which do not affect the schema. For example:
[1]:
from pyspark.sql import SparkSession
spark = SparkSession.Builder().config("spark.ui.showConsoleProgress", "false").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
[2]:
from typedspark import Column, Schema, DataSet, create_partially_filled_dataset
from pyspark.sql.types import StringType
class A(Schema):
a: Column[StringType]
df = create_partially_filled_dataset(
spark,
A,
{
A.a: ["a", "b", "c"],
},
)
res = df.filter(A.a == "a")
In the above example, filter() will not actually make any changes to the schema, hence we have implemented the return type of DataSet.filter() to be a DataSet of the same Schema that you started with. In other words, a linter will see that res is of the type DataSet[A].
This allows you to skip casting steps in many cases and instead define functions as:
[3]:
def foo(df: DataSet[A]) -> DataSet[A]:
return df.filter(A.a == "a")
The functions for which this is currently implemented include:
filter()distinct()orderBy()where()alias()persist()unpersist()
Functions applied to two DataSets of the same schema
Similarly, some functions return a DataSet[A] when they take two DataSet[A] as an input. For example, here a linter will see that res is of the type DataSet[A].
[4]:
df_a = create_partially_filled_dataset(spark, A, {A.a: ["a", "b", "c"]})
df_b = create_partially_filled_dataset(spark, A, {A.a: ["d", "e", "f"]})
res = df_a.unionByName(df_b)
The functions in this category include:
unionByName()join(..., how="semi")
Transformations
Finally, the transform() function can also be typed. In the following example, a linter will see that res is of the type DataSet[B].
[5]:
from typedspark import transform_to_schema
from pyspark.sql.functions import lit
class B(A):
b: Column[StringType]
def foo(df: DataSet[A]) -> DataSet[A]:
return transform_to_schema(
df,
B,
{
B.b: lit("hi"),
},
)
res = create_partially_filled_dataset(
spark,
A,
{
A.a: ["a", "b", "c"],
},
).transform(foo)
Did we miss anything?
There are likely more functions that we did not yet cover. Feel free to make an issue and reach out when there is one that you’d like typedspark to support!