{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "72077ffb", "metadata": {}, "source": [ "# Other complex datatypes\n", "Spark contains several other complex data types. \n", "\n", "## MapType, ArrayType, DecimalType and DayTimeIntervalType\n", "These can be used in `typedspark` as follows:" ] }, { "cell_type": "code", "execution_count": 1, "id": "934c2a2d", "metadata": {}, "outputs": [], "source": [ "from typing import Literal\n", "from pyspark.sql.types import StringType\n", "from typedspark import (\n", " ArrayType,\n", " DayTimeIntervalType,\n", " DecimalType,\n", " IntervalType,\n", " MapType,\n", " Schema,\n", " Column,\n", ")\n", "\n", "\n", "class Values(Schema):\n", " array: Column[ArrayType[StringType]]\n", " map: Column[MapType[StringType, StringType]]\n", " decimal: Column[DecimalType[Literal[38], Literal[18]]]\n", " interval: Column[DayTimeIntervalType[IntervalType.HOUR, IntervalType.SECOND]]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "271da325", "metadata": {}, "source": [ "## Generating DataSets\n", "\n", "You can generate `DataSets` using complex data types in the following way:" ] }, { "cell_type": "code", "execution_count": 2, "id": "f82a8b4f", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.Builder().config(\"spark.ui.showConsoleProgress\", \"false\").getOrCreate()\n", "spark.sparkContext.setLogLevel(\"ERROR\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "3ccb56ff", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+--------+--------------------+--------------------+----------+-------------------+\n", "| array| map| decimal| interval| date| timestamp|\n", "+---------+--------+--------------------+--------------------+----------+-------------------+\n", "|[a, b, c]|{a -> b}|32.00000000000000...|INTERVAL '26:03:0...|2020-01-01|2020-01-01 10:15:00|\n", "+---------+--------+--------------------+--------------------+----------+-------------------+\n", "\n" ] } ], "source": [ "from datetime import date, datetime, timedelta\n", "from decimal import Decimal\n", "from pyspark.sql.types import DateType, TimestampType\n", "from typedspark._utils.create_dataset import create_partially_filled_dataset\n", "\n", "\n", "class MoreValues(Values):\n", " date: Column[DateType]\n", " timestamp: Column[TimestampType]\n", "\n", "\n", "create_partially_filled_dataset(\n", " spark,\n", " MoreValues,\n", " {\n", " MoreValues.array: [[\"a\", \"b\", \"c\"]],\n", " MoreValues.map: [{\"a\": \"b\"}],\n", " MoreValues.decimal: [Decimal(32)],\n", " MoreValues.interval: [timedelta(days=1, hours=2, minutes=3, seconds=4)],\n", " MoreValues.date: [date(2020, 1, 1)],\n", " MoreValues.timestamp: [datetime(2020, 1, 1, 10, 15)],\n", " },\n", ").show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8a2232b3", "metadata": {}, "source": [ "## Did we miss a data type?\n", "\n", "Feel free to make an issue! We can extend the list of supported data types." ] }, { "attachments": {}, "cell_type": "markdown", "id": "33d83e5f", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "typedspark", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }