Easier unit testing through the creation of empty DataSets from schemas

We provide helper functions to generate (partially) empty DataSets from existing schemas. This can be helpful in certain situations, such as unit testing.

Column-wise definition of your DataSets

First, let us consider the column-wise definition of new DataSets. This is useful when you have few rows, but many columns.

[1]:
from pyspark.sql import SparkSession

spark = SparkSession.Builder().config("spark.ui.showConsoleProgress", "false").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
[2]:
from typedspark import Column, Schema, create_empty_dataset, create_partially_filled_dataset
from pyspark.sql.types import LongType, StringType


class Person(Schema):
    id: Column[LongType]
    name: Column[StringType]
    age: Column[LongType]


df_empty = create_empty_dataset(spark, Person)
df_empty.show()
+----+----+----+
|  id|name| age|
+----+----+----+
|NULL|NULL|NULL|
|NULL|NULL|NULL|
|NULL|NULL|NULL|
+----+----+----+

[3]:
df_partially_filled = create_partially_filled_dataset(
    spark,
    Person,
    {
        Person.id: [1, 2, 3],
        Person.name: ["John", "Jane", "Jack"],
    },
)
df_partially_filled.show()
+---+----+----+
| id|name| age|
+---+----+----+
|  1|John|NULL|
|  2|Jane|NULL|
|  3|Jack|NULL|
+---+----+----+

Row-wise definition of your DataSets

It is also possible to define your DataSets in a row-wise fashion. This is useful for cases where you have few columns, but many rows!

[4]:
create_partially_filled_dataset(
    spark,
    Person,
    [
        {Person.name: "Alice", Person.age: 20},
        {Person.name: "Bob", Person.age: 30},
        {Person.name: "Charlie", Person.age: 40},
        {Person.name: "Dave", Person.age: 50},
        {Person.name: "Eve", Person.age: 60},
        {Person.name: "Frank", Person.age: 70},
        {Person.name: "Grace", Person.age: 80},
    ],
).show()
+----+-------+---+
|  id|   name|age|
+----+-------+---+
|NULL|  Alice| 20|
|NULL|    Bob| 30|
|NULL|Charlie| 40|
|NULL|   Dave| 50|
|NULL|    Eve| 60|
|NULL|  Frank| 70|
|NULL|  Grace| 80|
+----+-------+---+

Example unit test

The following code snippet shows what a full unit test using typedspark can look like.

[5]:
from pyspark.sql import SparkSession
from pyspark.sql.types import LongType, StringType
from typedspark import Column, DataSet, Schema, create_partially_filled_dataset, transform_to_schema
from chispa.dataframe_comparer import assert_df_equality


class Person(Schema):
    name: Column[StringType]
    age: Column[LongType]


def birthday(df: DataSet[Person]) -> DataSet[Person]:
    return transform_to_schema(df, Person, {Person.age: Person.age + 1})


def test_birthday(spark: SparkSession):
    df = create_partially_filled_dataset(
        spark,
        Person,
        {
            Person.name: ["Alice", "Bob"],
            Person.age: [20, 30],
        },
    )

    observed = birthday(df)
    expected = create_partially_filled_dataset(
        spark,
        Person,
        {
            Person.name: ["Alice", "Bob"],
            Person.age: [21, 31],
        },
    )

    assert_df_equality(observed, expected, ignore_row_order=True, ignore_nullable=True)