From e91e5fd02c1e2a2784a692c2cae06c922847243b Mon Sep 17 00:00:00 2001 From: Marcel Giar <marcel.giar@hpc-hessen.de> Date: Tue, 13 Sep 2022 15:58:13 +0200 Subject: [PATCH] Update course material --- .../WeatherData_Analysis.ipynb | 207 ++++++++++++++---- .../WeatherData_Analysis_tasks.ipynb | 107 +++++---- 2 files changed, 223 insertions(+), 91 deletions(-) diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb index 3b51422..eea21f1 100644 --- a/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb +++ b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb @@ -61,7 +61,7 @@ }, { "cell_type": "markdown", - "id": "eeb2132c", + "id": "56c155a5", "metadata": {}, "source": [ "URL to the dataset used in this exercise (*Source: Deutscher Wetterdienst*)." @@ -107,6 +107,24 @@ "After having imported the data gather some `info`rmation on the data (e.g. datatypes of columns or memory usage)." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5ec602a", + "metadata": {}, + "outputs": [], + "source": [ + "! head -n 10 tmp/produkt_tu_stunde_19500101_20211231_01639.txt" + ] + }, + { + "cell_type": "markdown", + "id": "06461f7c", + "metadata": {}, + "source": [ + "Import the data:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -117,10 +135,22 @@ "df_weather = pd.read_csv(\n", " TMP_DIRECTORY / \"produkt_tu_stunde_19500101_20211231_01639.txt\",\n", " delimiter=\";\",\n", - " usecols=[\"MESS_DATUM\",\"TT_TU\", \"RF_TU\"]\n", + " usecols=[\n", + " \"MESS_DATUM\", # data for the measurement\n", + " \"TT_TU\", # temperature in deg Celsius\n", + " \"RF_TU\" # relative humidity\n", + " ]\n", ")" ] }, + { + "cell_type": "markdown", + "id": "65a852ea", + "metadata": {}, + "source": [ + "Inspect the first few lines of the `DataFrame`:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -131,6 +161,14 @@ "df_weather.head()" ] }, + { + "cell_type": "markdown", + "id": "f05b58c7", + "metadata": {}, + "source": [ + "Inspect the last lines of the `DataFrame`:" + ] + }, { "cell_type": "code", "execution_count": null, @@ -141,6 +179,14 @@ "df_weather.tail()" ] }, + { + "cell_type": "markdown", + "id": "379a7374", + "metadata": {}, + "source": [ + "What is the memory usage of the current `DataFrame` instance?" + ] + }, { "cell_type": "code", "execution_count": null, @@ -166,7 +212,7 @@ "source": [ "The column `MESS_DATUM` contains the data of each measurement in the format `%Y%m%d%H`. The datatype of this column is `int64`.\n", "\n", - "Create a new `DataFrame` named `df_weather_cleaned` that is based on the original `df_weather` from above.\n", + "Create a new `DataFrame` named `df_weather_tweaked` that is based on the original `df_weather` from above.\n", "\n", "Make the following modifications to the `df_weather` `DataFrame` to generate a new one that is then assigned to the `df_weather_tweaked` variable.\n", "\n", @@ -199,6 +245,16 @@ "In all the following tasks you are supposed to work with the new modified `df_weather_tweaked`." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "5000fee3", + "metadata": {}, + "outputs": [], + "source": [ + "pd.to_datetime(df_weather[\"MESS_DATUM\"], format=\"%Y%m%d%H\")" + ] + }, { "cell_type": "code", "execution_count": null, @@ -207,15 +263,14 @@ "outputs": [], "source": [ "df_weather_tweaked = (\n", - " df_weather\n", " # Transform the integer dates to a date-like format\n", - " .assign(\n", + " df_weather.assign(\n", " MESS_DATUM=pd.to_datetime(\n", - " df_weather[\"MESS_DATUM\"].astype('str'), \n", + " df_weather[\"MESS_DATUM\"], \n", " format=\"%Y%m%d%H\"\n", " )\n", " )\n", - " # Rename the columns\n", + " # Rename the column with the dates\n", " .rename(\n", " columns={\n", " \"MESS_DATUM\": \"Date of Measurement\",\n", @@ -223,7 +278,7 @@ " \"RF_TU\": \"Humidity\"\n", " }\n", " )\n", - " # Set a new index\n", + " # Set the measurement dates as index\n", " .set_index(\"Date of Measurement\")\n", " .astype(\n", " {\n", @@ -232,7 +287,7 @@ " }\n", " )\n", ")\n", - "df_weather_tweaked" + "df_weather_tweaked.head()" ] }, { @@ -245,10 +300,15 @@ "When measurements are taken over a long period of time it is quite likely the erroneous data sneaks into the dataset. Indeed, we should remove this data from the `DataFrame`.\n", "\n", "Analyse the dataset in a suitable manner to investigate if the measured values for the temperature and the relative humidity are present that seem reasonable.\n", - "\n", - "- Plot the distribution of the temperature and the relative humidity. Look for suitable functions in the [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) module.\n", - "- Determine the smallest (minimal) as well as the largest (maximal) value for each of the data columns.\n", - "- Remove all conspicuously small or large values from the dataset. Make sure not to generate a new `DataFrame` but rather to perform all adjustments with the already-existing one. Afterwards check re-check your results to assure all " + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "1c6d2689", + "metadata": {}, + "source": [ + "- Plot the distribution of the temperature and the relative humidity. Look for suitable functions in the [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) module." ] }, { @@ -260,10 +320,21 @@ "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4), sharey=\"row\")\n", "\n", - "ax1.set_xlabel(\"temperature / degree Celsius\")\n", - "df_weather_tweaked.plot.hist(ax=ax1, y=\"Temperature\", bins=50)\n", - "ax2.set_xlabel(\"relative humidity / %\")\n", - "df_weather_tweaked.plot.hist(ax=ax2, y=\"Humidity\", bins=50)" + "ax1.set_xlabel(\"Temperature / degree Celsius\")\n", + "ax1.set_yscale(\"log\")\n", + "df_weather_tweaked.plot.hist(ax=ax1, y=\"Temperature\", bins=46)\n", + "\n", + "ax2.set_xlabel(\"Relative Humidity / %\")\n", + "ax2.set_yscale(\"log\")\n", + "df_weather_tweaked.plot.hist(ax=ax2, y=\"Humidity\", bins=46)" + ] + }, + { + "cell_type": "markdown", + "id": "d9c3c339", + "metadata": {}, + "source": [ + "- Determine the smallest (minimal) as well as the largest (maximal) value for each of the data columns.\n" ] }, { @@ -276,6 +347,14 @@ "df_weather_tweaked.describe()" ] }, + { + "cell_type": "markdown", + "id": "e63a133a", + "metadata": {}, + "source": [ + "- Remove all conspicuously small or large values from the dataset. Make sure not to generate a new `DataFrame` but rather to perform all adjustments with the already-existing one. Afterwards check re-check your results." + ] + }, { "cell_type": "code", "execution_count": null, @@ -283,9 +362,8 @@ "metadata": {}, "outputs": [], "source": [ - "# clear all values that are \n", - "boolean_mask = df_weather_tweaked.index[(df_weather[\"TT_TU\"] < -998.9999) | (df_weather[\"RF_TU\"] < -998.9999)]\n", - "df_weather_tweaked.drop(boolean_mask, inplace=True)" + "mask = (df_weather_tweaked[\"Temperature\"] < -998.9999) | (df_weather_tweaked[\"Humidity\"] < -998.9999) \n", + "df_weather_tweaked.drop(df_weather_tweaked.index[ mask ], inplace=True)" ] }, { @@ -305,13 +383,7 @@ "metadata": {}, "outputs": [], "source": [ - "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))\n", - "\n", - "ax1.set_xlabel(\"temperature / degree Celsius\")\n", - "ax2.set_xlabel(\"relative humidity / %\")\n", - "\n", - "df_weather_tweaked[\"Temperature\"].value_counts().plot.line(ax=ax1, style=\"o\")\n", - "df_weather_tweaked[\"Humidity\"].value_counts().plot.line(ax=ax2, style=\"s\")" + "# YOUR CODE GOES HERE" ] }, { @@ -328,10 +400,15 @@ "metadata": {}, "source": [ "### Monthly distribution of temperature and humidity\n", - "\n", - "- Group the data by month in which each measurement has been conducted. *Hint*: The `index` of the `DataFrame` has a `month` attribute.\n", - "\n", - "- Display the distribution of the temperature and the relative humidity for each month in a [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The abscissa must show each month as a integer value while the ordinate must show the values for the temperature or the relative humidity, respectively. *Hint*: In order to extract the data from the subframes of the `DataFrameGroupBy` object you need to iterate over it in a suitable manner." + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "7bee5b7d", + "metadata": {}, + "source": [ + "- Group the data by month in which each measurement has been conducted. *Hint*: The `index` of the `DataFrame` has a `month` attribute.\n" ] }, { @@ -341,7 +418,15 @@ "metadata": {}, "outputs": [], "source": [ - "by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)" + "grouped_by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)" + ] + }, + { + "cell_type": "markdown", + "id": "28c23a20", + "metadata": {}, + "source": [ + "- Display the distribution of the temperature and the relative humidity for each month in a [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The abscissa must show each month as a integer value while the ordinate must show the values for the temperature or the relative humidity, respectively. *Hint*: In order to extract the data from the subframes of the `DataFrameGroupBy` object you need to iterate over it in a suitable manner." ] }, { @@ -351,15 +436,25 @@ "metadata": {}, "outputs": [], "source": [ - "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 5), sharex=\"col\")\n", + "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 6), sharex=\"col\")\n", "\n", - "ax2.set_xticks(range(1, 13))\n", + "ax2.set_xticks( range(1, 13 ) )\n", "ax2.set_xlabel(\"month\")\n", "\n", - "ax1.set_ylabel(\"temperature / deg. C\")\n", - "ax1.violinplot([subframe[\"Temperature\"] for _, subframe in by_month]); # added to avoid verbose output\n", - "ax2.set_ylabel(\"rel. humidity / %\")\n", - "ax2.violinplot([subframe[\"Humidity\"] for _, subframe in by_month]); # added tp avoid verbose output" + "ax1.set_ylabel(\"Temperature / degree Celsius\")\n", + "ax1.violinplot(\n", + " [\n", + " df[\"Temperature\"]\n", + " for _, df in grouped_by_month \n", + " ]\n", + ");\n", + "ax2.set_ylabel(\"Relative Humidity /% \")\n", + "ax2.violinplot(\n", + " [\n", + " df[\"Humidity\"]\n", + " for _, df in grouped_by_month \n", + " ]\n", + ");" ] }, { @@ -367,11 +462,16 @@ "id": "95b85f06", "metadata": {}, "source": [ - "### Yearly mean temperature\n", - "\n", - "- Group the data by the year in which the measurements have been conducted. Then use *two* different methods of your choice to compute the mean value of the temperatures in each subframe. The result is the average temperature for each year in the dataset.\n", + "### Yearly mean temperature" + ] + }, + { + "cell_type": "markdown", + "id": "44fbb458", + "metadata": {}, + "source": [ "\n", - "- Plot the results for the yearly averaged temperate in a suitable manner." + "- Group the data by the year in which the measurements have been conducted. Then use *two* different methods of your choice to compute the mean value of the temperatures in each subframe. The result is the average temperature for each year in the dataset." ] }, { @@ -381,7 +481,7 @@ "metadata": {}, "outputs": [], "source": [ - "by_year = df_weather_tweaked.groupby(df_weather_tweaked.index.year)" + "grouped_by_year = df_weather_tweaked.groupby(df_weather_tweaked.index.year)" ] }, { @@ -391,8 +491,16 @@ "metadata": {}, "outputs": [], "source": [ - "df_by_year_agg = by_year.agg([np.mean])\n", - "df_by_year_apply = by_year.apply(lambda x: x.mean())" + "grouped_by_year.agg([np.mean])\n", + "grouped_by_year.apply(lambda x: x.mean())" + ] + }, + { + "cell_type": "markdown", + "id": "c3000dae", + "metadata": {}, + "source": [ + "- Plot the results for the yearly averaged temperate in a suitable manner." ] }, { @@ -402,13 +510,18 @@ "metadata": {}, "outputs": [], "source": [ - "df_by_year_apply[\"Temperature\"].plot.line(style=\"o\", xlabel=\"year\", ylabel=\"temperature / degree Celsius\")" + "ax = grouped_by_year.agg([np.mean])[\"Temperature\"].plot.line(\n", + " style=\"o\",\n", + " xlabel=\"year\",\n", + " ylabel=\"Temperature / degree Celsius\"\n", + ")\n", + "ax.set_title(\"Yearly average of temperature\")" ] }, { "cell_type": "code", "execution_count": null, - "id": "489abbfd", + "id": "9bcce04b", "metadata": {}, "outputs": [], "source": [] diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb index f416747..2c43dd1 100644 --- a/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb +++ b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb @@ -91,7 +91,7 @@ "source": [ "## Importing the measurement data\n", "\n", - "The file `produkt_tu_stunde_19500101_20201231_01639.txt` is a CSV file (although the suffix `.txt` conveys something else).\n", + "The file `produkt_tu_stunde_19500101_20211231_01639.txt` is a CSV file (although the suffix `.txt` conveys something else).\n", "\n", "The single columns of the file have the following headers:\n", "\n", @@ -107,6 +107,14 @@ "After having imported the data gather some `info`rmation on the data (e.g. datatypes of columns or memory usage)." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff8e949d", + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "id": "06461f7c", @@ -121,9 +129,7 @@ "id": "f99173c3", "metadata": {}, "outputs": [], - "source": [ - "### YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -139,9 +145,7 @@ "id": "28ec724b", "metadata": {}, "outputs": [], - "source": [ - "### YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -157,9 +161,7 @@ "id": "a9448957", "metadata": {}, "outputs": [], - "source": [ - "### YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -175,9 +177,7 @@ "id": "0377e45a", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -194,7 +194,7 @@ "source": [ "The column `MESS_DATUM` contains the data of each measurement in the format `%Y%m%d%H`. The datatype of this column is `int64`.\n", "\n", - "Create a new `DataFrame` named `df_weather_cleaned` that is based on the original `df_weather` from above.\n", + "Create a new `DataFrame` named `df_weather_tweaked` that is based on the original `df_weather` from above.\n", "\n", "Make the following modifications to the `df_weather` `DataFrame` to generate a new one that is then assigned to the `df_weather_tweaked` variable.\n", "\n", @@ -227,15 +227,21 @@ "In all the following tasks you are supposed to work with the new modified `df_weather_tweaked`." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "e53d7a7f", + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "code", "execution_count": null, "id": "8030b621", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -264,9 +270,7 @@ "id": "2a28e418", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -282,9 +286,7 @@ "id": "d16a3cfb", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -300,9 +302,7 @@ "id": "3eb31737", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "code", @@ -310,9 +310,7 @@ "id": "2ee9be9c", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "code", @@ -320,9 +318,7 @@ "id": "546c3127", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -356,7 +352,7 @@ "metadata": {}, "outputs": [], "source": [ - "# YOUR CODE GOES HERE" + "grouped_by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)" ] }, { @@ -374,7 +370,28 @@ "metadata": {}, "outputs": [], "source": [ - "# YOUR CODE GOES HERE" + "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 6), sharex=\"col\")\n", + "\n", + "ax2.set_xticks( range(1, 13 ) )\n", + "ax2.set_xlabel(\"month\")\n", + "\n", + "ax1.set_ylabel(\"Temperature / degree Celsius\")\n", + "ax1.violinplot(\n", + " [\n", + " df[\"Temperature\"]\n", + " for _, df in grouped_by_month \n", + " ]\n", + ");\n", + "ax2.set_ylabel(\"Relative Humidity /% \")\n", + "ax2.violinplot(\n", + " [\n", + " df[\"Humidity\"]\n", + " for _, df in grouped_by_month \n", + " ]\n", + ");\n", + "\n", + "\n", + "\n" ] }, { @@ -400,9 +417,7 @@ "id": "b750cd9d", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "code", @@ -410,9 +425,7 @@ "id": "a83a4e4b", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] }, { "cell_type": "markdown", @@ -428,14 +441,20 @@ "id": "ed4157a9", "metadata": {}, "outputs": [], - "source": [ - "# YOUR CODE GOES HERE" - ] + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9e943ec", + "metadata": {}, + "outputs": [], + "source": [] }, { "cell_type": "code", "execution_count": null, - "id": "9bcce04b", + "id": "c62bbfcd", "metadata": {}, "outputs": [], "source": [] -- GitLab