Easier unit testing through the creation of empty DataSets from schemas

We provide helper functions to generate (partially) empty DataSets from existing schemas. This can be helpful in certain situations, such as unit testing.

Column-wise definition of your DataSets

First, let us consider the column-wise definition of new DataSets. This is useful when you have few rows, but many columns.

[1]:

from pyspark.sql import SparkSession

spark = SparkSession.Builder().config("spark.ui.showConsoleProgress", "false").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

[2]:

from typedspark import Column, Schema, create_empty_dataset, create_partially_filled_dataset
from pyspark.sql.types import LongType, StringType


class Person(Schema):
    id: Column[LongType]
    name: Column[StringType]
    age: Column[LongType]


df_empty = create_empty_dataset(spark, Person)
df_empty.show()

+----+----+----+
|  id|name| age|
+----+----+----+
|NULL|NULL|NULL|
|NULL|NULL|NULL|
|NULL|NULL|NULL|
+----+----+----+

[3]:

df_partially_filled = create_partially_filled_dataset(
    spark,
    Person,
    {
        Person.id: [1, 2, 3],
        Person.name: ["John", "Jane", "Jack"],
    },
)
df_partially_filled.show()

+---+----+----+
| id|name| age|
+---+----+----+
|  1|John|NULL|
|  2|Jane|NULL|
|  3|Jack|NULL|
+---+----+----+

Row-wise definition of your DataSets

It is also possible to define your DataSets in a row-wise fashion. This is useful for cases where you have few columns, but many rows!

[4]:

create_partially_filled_dataset(
    spark,
    Person,
    [
        {Person.name: "Alice", Person.age: 20},
        {Person.name: "Bob", Person.age: 30},
        {Person.name: "Charlie", Person.age: 40},
        {Person.name: "Dave", Person.age: 50},
        {Person.name: "Eve", Person.age: 60},
        {Person.name: "Frank", Person.age: 70},
        {Person.name: "Grace", Person.age: 80},
    ],
).show()

+----+-------+---+
|  id|   name|age|
+----+-------+---+
|NULL|  Alice| 20|
|NULL|    Bob| 30|
|NULL|Charlie| 40|
|NULL|   Dave| 50|
|NULL|    Eve| 60|
|NULL|  Frank| 70|
|NULL|  Grace| 80|
+----+-------+---+

Example unit test

The following code snippet shows what a full unit test using typedspark can look like.

[5]:

from pyspark.sql import SparkSession
from pyspark.sql.types import LongType, StringType
from typedspark import Column, DataSet, Schema, create_partially_filled_dataset, transform_to_schema
from chispa.dataframe_comparer import assert_df_equality


class Person(Schema):
    name: Column[StringType]
    age: Column[LongType]


def birthday(df: DataSet[Person]) -> DataSet[Person]:
    return transform_to_schema(df, Person, {Person.age: Person.age + 1})


def test_birthday(spark: SparkSession):
    df = create_partially_filled_dataset(
        spark,
        Person,
        {
            Person.name: ["Alice", "Bob"],
            Person.age: [20, 30],
        },
    )

    observed = birthday(df)
    expected = create_partially_filled_dataset(
        spark,
        Person,
        {
            Person.name: ["Alice", "Bob"],
            Person.age: [21, 31],
        },
    )

    assert_df_equality(observed, expected, ignore_row_order=True, ignore_nullable=True)