{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "f25abb69", "metadata": {}, "source": [ "# Easier unit testing through the creation of empty DataSets from schemas\n", "\n", "We provide helper functions to generate (partially) empty DataSets from existing schemas. This can be helpful in certain situations, such as unit testing.\n", "\n", "## Column-wise definition of your DataSets\n", "\n", "First, let us consider the column-wise definition of new `DataSets`. This is useful when you have few rows, but many columns." ] }, { "cell_type": "code", "execution_count": 1, "id": "25b679fc", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.Builder().config(\"spark.ui.showConsoleProgress\", \"false\").getOrCreate()\n", "spark.sparkContext.setLogLevel(\"ERROR\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "7a2690c7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+----+----+\n", "| id|name| age|\n", "+----+----+----+\n", "|NULL|NULL|NULL|\n", "|NULL|NULL|NULL|\n", "|NULL|NULL|NULL|\n", "+----+----+----+\n", "\n" ] } ], "source": [ "from typedspark import Column, Schema, create_empty_dataset, create_partially_filled_dataset\n", "from pyspark.sql.types import LongType, StringType\n", "\n", "\n", "class Person(Schema):\n", " id: Column[LongType]\n", " name: Column[StringType]\n", " age: Column[LongType]\n", "\n", "\n", "df_empty = create_empty_dataset(spark, Person)\n", "df_empty.show()" ] }, { "cell_type": "code", "execution_count": 3, "id": "d5ce76e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+----+----+\n", "| id|name| age|\n", "+---+----+----+\n", "| 1|John|NULL|\n", "| 2|Jane|NULL|\n", "| 3|Jack|NULL|\n", "+---+----+----+\n", "\n" ] } ], "source": [ "df_partially_filled = create_partially_filled_dataset(\n", " spark,\n", " Person,\n", " {\n", " Person.id: [1, 2, 3],\n", " Person.name: [\"John\", \"Jane\", \"Jack\"],\n", " },\n", ")\n", "df_partially_filled.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "571cf714", "metadata": {}, "source": [ "## Row-wise definition of your DataSets\n", "\n", "It is also possible to define your DataSets in a row-wise fashion. This is useful for cases where you have few columns, but many rows!" ] }, { "cell_type": "code", "execution_count": 4, "id": "a68e52bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+-------+---+\n", "| id| name|age|\n", "+----+-------+---+\n", "|NULL| Alice| 20|\n", "|NULL| Bob| 30|\n", "|NULL|Charlie| 40|\n", "|NULL| Dave| 50|\n", "|NULL| Eve| 60|\n", "|NULL| Frank| 70|\n", "|NULL| Grace| 80|\n", "+----+-------+---+\n", "\n" ] } ], "source": [ "create_partially_filled_dataset(\n", " spark,\n", " Person,\n", " [\n", " {Person.name: \"Alice\", Person.age: 20},\n", " {Person.name: \"Bob\", Person.age: 30},\n", " {Person.name: \"Charlie\", Person.age: 40},\n", " {Person.name: \"Dave\", Person.age: 50},\n", " {Person.name: \"Eve\", Person.age: 60},\n", " {Person.name: \"Frank\", Person.age: 70},\n", " {Person.name: \"Grace\", Person.age: 80},\n", " ],\n", ").show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bd5c7c1c", "metadata": {}, "source": [ "## Example unit test\n", "\n", "The following code snippet shows what a full unit test using typedspark can look like." ] }, { "cell_type": "code", "execution_count": 5, "id": "f80ddebf", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.types import LongType, StringType\n", "from typedspark import Column, DataSet, Schema, create_partially_filled_dataset, transform_to_schema\n", "from chispa.dataframe_comparer import assert_df_equality\n", "\n", "\n", "class Person(Schema):\n", " name: Column[StringType]\n", " age: Column[LongType]\n", "\n", "\n", "def birthday(df: DataSet[Person]) -> DataSet[Person]:\n", " return transform_to_schema(df, Person, {Person.age: Person.age + 1})\n", "\n", "\n", "def test_birthday(spark: SparkSession):\n", " df = create_partially_filled_dataset(\n", " spark,\n", " Person,\n", " {\n", " Person.name: [\"Alice\", \"Bob\"],\n", " Person.age: [20, 30],\n", " },\n", " )\n", "\n", " observed = birthday(df)\n", " expected = create_partially_filled_dataset(\n", " spark,\n", " Person,\n", " {\n", " Person.name: [\"Alice\", \"Bob\"],\n", " Person.age: [21, 31],\n", " },\n", " )\n", "\n", " assert_df_equality(observed, expected, ignore_row_order=True, ignore_nullable=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f5e716a8", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "typedspark", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }