From e91e5fd02c1e2a2784a692c2cae06c922847243b Mon Sep 17 00:00:00 2001
From: Marcel Giar <marcel.giar@hpc-hessen.de>
Date: Tue, 13 Sep 2022 15:58:13 +0200
Subject: [PATCH] Update course material

---
 .../WeatherData_Analysis.ipynb                | 207 ++++++++++++++----
 .../WeatherData_Analysis_tasks.ipynb          | 107 +++++----
 2 files changed, 223 insertions(+), 91 deletions(-)

diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb
index 3b51422..eea21f1 100644
--- a/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb
+++ b/exercises/Pandas_WeatherData/WeatherData_Analysis.ipynb
@@ -61,7 +61,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "eeb2132c",
+   "id": "56c155a5",
    "metadata": {},
    "source": [
     "URL to the dataset used in this exercise (*Source: Deutscher Wetterdienst*)."
@@ -107,6 +107,24 @@
     "After having imported the data gather some `info`rmation on the data (e.g. datatypes of columns or memory usage)."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d5ec602a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! head -n 10  tmp/produkt_tu_stunde_19500101_20211231_01639.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06461f7c",
+   "metadata": {},
+   "source": [
+    "Import the data:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -117,10 +135,22 @@
     "df_weather = pd.read_csv(\n",
     "    TMP_DIRECTORY / \"produkt_tu_stunde_19500101_20211231_01639.txt\",\n",
     "    delimiter=\";\",\n",
-    "    usecols=[\"MESS_DATUM\",\"TT_TU\", \"RF_TU\"]\n",
+    "    usecols=[\n",
+    "        \"MESS_DATUM\", # data for the measurement\n",
+    "        \"TT_TU\",      # temperature in deg Celsius\n",
+    "        \"RF_TU\"       # relative humidity\n",
+    "    ]\n",
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "65a852ea",
+   "metadata": {},
+   "source": [
+    "Inspect the first few lines of the `DataFrame`:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -131,6 +161,14 @@
     "df_weather.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "f05b58c7",
+   "metadata": {},
+   "source": [
+    "Inspect the last lines of the `DataFrame`:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -141,6 +179,14 @@
     "df_weather.tail()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "379a7374",
+   "metadata": {},
+   "source": [
+    "What is the memory usage of the current `DataFrame` instance?"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -166,7 +212,7 @@
    "source": [
     "The column `MESS_DATUM` contains the data of each measurement in the format `%Y%m%d%H`. The datatype of this column is `int64`.\n",
     "\n",
-    "Create a new `DataFrame` named `df_weather_cleaned` that is based on the original `df_weather` from above.\n",
+    "Create a new `DataFrame` named `df_weather_tweaked` that is based on the original `df_weather` from above.\n",
     "\n",
     "Make the following modifications to the `df_weather` `DataFrame` to generate a new one that is then assigned to the `df_weather_tweaked` variable.\n",
     "\n",
@@ -199,6 +245,16 @@
     "In all the following tasks you are supposed to work with the new modified `df_weather_tweaked`."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5000fee3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pd.to_datetime(df_weather[\"MESS_DATUM\"], format=\"%Y%m%d%H\")"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -207,15 +263,14 @@
    "outputs": [],
    "source": [
     "df_weather_tweaked = (\n",
-    "    df_weather\n",
     "    # Transform the integer dates to a date-like format\n",
-    "    .assign(\n",
+    "    df_weather.assign(\n",
     "        MESS_DATUM=pd.to_datetime(\n",
-    "            df_weather[\"MESS_DATUM\"].astype('str'), \n",
+    "            df_weather[\"MESS_DATUM\"], \n",
     "            format=\"%Y%m%d%H\"\n",
     "        )\n",
     "    )\n",
-    "    # Rename the columns\n",
+    "    # Rename the column with the dates\n",
     "    .rename(\n",
     "        columns={\n",
     "            \"MESS_DATUM\": \"Date of Measurement\",\n",
@@ -223,7 +278,7 @@
     "            \"RF_TU\": \"Humidity\"\n",
     "        }\n",
     "    )\n",
-    "    # Set a new index\n",
+    "    # Set the measurement dates as index\n",
     "    .set_index(\"Date of Measurement\")\n",
     "    .astype(\n",
     "        {\n",
@@ -232,7 +287,7 @@
     "        }\n",
     "    )\n",
     ")\n",
-    "df_weather_tweaked"
+    "df_weather_tweaked.head()"
    ]
   },
   {
@@ -245,10 +300,15 @@
     "When measurements are taken over a long period of time it is quite likely the erroneous data sneaks into the dataset. Indeed, we should remove this data from the `DataFrame`.\n",
     "\n",
     "Analyse the dataset in a suitable manner to investigate if the measured values for the temperature and the relative humidity are present that seem reasonable.\n",
-    "\n",
-    "- Plot the distribution of the temperature and the relative humidity. Look for suitable functions in the [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) module.\n",
-    "- Determine the smallest (minimal) as well as the largest (maximal) value for each of the data columns.\n",
-    "- Remove all conspicuously small or large values from the dataset. Make sure not to generate a new `DataFrame` but rather to perform all adjustments with the already-existing one. Afterwards check re-check your results to assure all "
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c6d2689",
+   "metadata": {},
+   "source": [
+    "- Plot the distribution of the temperature and the relative humidity. Look for suitable functions in the [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) module."
    ]
   },
   {
@@ -260,10 +320,21 @@
    "source": [
     "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4), sharey=\"row\")\n",
     "\n",
-    "ax1.set_xlabel(\"temperature / degree Celsius\")\n",
-    "df_weather_tweaked.plot.hist(ax=ax1, y=\"Temperature\", bins=50)\n",
-    "ax2.set_xlabel(\"relative humidity / %\")\n",
-    "df_weather_tweaked.plot.hist(ax=ax2, y=\"Humidity\", bins=50)"
+    "ax1.set_xlabel(\"Temperature / degree Celsius\")\n",
+    "ax1.set_yscale(\"log\")\n",
+    "df_weather_tweaked.plot.hist(ax=ax1, y=\"Temperature\", bins=46)\n",
+    "\n",
+    "ax2.set_xlabel(\"Relative Humidity / %\")\n",
+    "ax2.set_yscale(\"log\")\n",
+    "df_weather_tweaked.plot.hist(ax=ax2, y=\"Humidity\", bins=46)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9c3c339",
+   "metadata": {},
+   "source": [
+    "- Determine the smallest (minimal) as well as the largest (maximal) value for each of the data columns.\n"
    ]
   },
   {
@@ -276,6 +347,14 @@
     "df_weather_tweaked.describe()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e63a133a",
+   "metadata": {},
+   "source": [
+    "- Remove all conspicuously small or large values from the dataset. Make sure not to generate a new `DataFrame` but rather to perform all adjustments with the already-existing one. Afterwards check re-check your results."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -283,9 +362,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# clear all values that are \n",
-    "boolean_mask = df_weather_tweaked.index[(df_weather[\"TT_TU\"] < -998.9999) | (df_weather[\"RF_TU\"] < -998.9999)]\n",
-    "df_weather_tweaked.drop(boolean_mask, inplace=True)"
+    "mask = (df_weather_tweaked[\"Temperature\"] < -998.9999) |  (df_weather_tweaked[\"Humidity\"] < -998.9999) \n",
+    "df_weather_tweaked.drop(df_weather_tweaked.index[ mask ], inplace=True)"
    ]
   },
   {
@@ -305,13 +383,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))\n",
-    "\n",
-    "ax1.set_xlabel(\"temperature / degree Celsius\")\n",
-    "ax2.set_xlabel(\"relative humidity / %\")\n",
-    "\n",
-    "df_weather_tweaked[\"Temperature\"].value_counts().plot.line(ax=ax1, style=\"o\")\n",
-    "df_weather_tweaked[\"Humidity\"].value_counts().plot.line(ax=ax2, style=\"s\")"
+    "# YOUR CODE GOES HERE"
    ]
   },
   {
@@ -328,10 +400,15 @@
    "metadata": {},
    "source": [
     "### Monthly distribution of temperature and humidity\n",
-    "\n",
-    "- Group the data by month in which each measurement has been conducted. *Hint*: The `index` of the `DataFrame` has a `month` attribute.\n",
-    "\n",
-    "- Display the distribution of the temperature and the relative humidity for each month in a [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The abscissa must show each month as a integer value while the ordinate must show the values for the temperature or the relative humidity, respectively. *Hint*: In order to extract the data from the subframes of the `DataFrameGroupBy` object you need to iterate over it in a suitable manner."
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7bee5b7d",
+   "metadata": {},
+   "source": [
+    "- Group the data by month in which each measurement has been conducted. *Hint*: The `index` of the `DataFrame` has a `month` attribute.\n"
    ]
   },
   {
@@ -341,7 +418,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)"
+    "grouped_by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28c23a20",
+   "metadata": {},
+   "source": [
+    "- Display the distribution of the temperature and the relative humidity for each month in a [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The abscissa must show each month as a integer value while the ordinate must show the values for the temperature or the relative humidity, respectively. *Hint*: In order to extract the data from the subframes of the `DataFrameGroupBy` object you need to iterate over it in a suitable manner."
    ]
   },
   {
@@ -351,15 +436,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 5), sharex=\"col\")\n",
+    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 6), sharex=\"col\")\n",
     "\n",
-    "ax2.set_xticks(range(1, 13))\n",
+    "ax2.set_xticks( range(1, 13 ) )\n",
     "ax2.set_xlabel(\"month\")\n",
     "\n",
-    "ax1.set_ylabel(\"temperature / deg. C\")\n",
-    "ax1.violinplot([subframe[\"Temperature\"] for _, subframe in by_month]); # added to avoid verbose output\n",
-    "ax2.set_ylabel(\"rel. humidity / %\")\n",
-    "ax2.violinplot([subframe[\"Humidity\"] for _, subframe in by_month]); # added tp avoid verbose output"
+    "ax1.set_ylabel(\"Temperature / degree Celsius\")\n",
+    "ax1.violinplot(\n",
+    "    [\n",
+    "        df[\"Temperature\"]\n",
+    "        for _, df in grouped_by_month \n",
+    "    ]\n",
+    ");\n",
+    "ax2.set_ylabel(\"Relative Humidity /% \")\n",
+    "ax2.violinplot(\n",
+    "    [\n",
+    "        df[\"Humidity\"]\n",
+    "        for _, df in grouped_by_month \n",
+    "    ]\n",
+    ");"
    ]
   },
   {
@@ -367,11 +462,16 @@
    "id": "95b85f06",
    "metadata": {},
    "source": [
-    "### Yearly mean temperature\n",
-    "\n",
-    "- Group the data by the year in which the measurements have been conducted. Then use *two* different methods of your choice to compute the mean value of the temperatures in each subframe. The result is the average temperature for each year in the dataset.\n",
+    "### Yearly mean temperature"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44fbb458",
+   "metadata": {},
+   "source": [
     "\n",
-    "- Plot the results for the yearly averaged temperate in a suitable manner."
+    "- Group the data by the year in which the measurements have been conducted. Then use *two* different methods of your choice to compute the mean value of the temperatures in each subframe. The result is the average temperature for each year in the dataset."
    ]
   },
   {
@@ -381,7 +481,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "by_year = df_weather_tweaked.groupby(df_weather_tweaked.index.year)"
+    "grouped_by_year = df_weather_tweaked.groupby(df_weather_tweaked.index.year)"
    ]
   },
   {
@@ -391,8 +491,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df_by_year_agg = by_year.agg([np.mean])\n",
-    "df_by_year_apply = by_year.apply(lambda x: x.mean())"
+    "grouped_by_year.agg([np.mean])\n",
+    "grouped_by_year.apply(lambda x: x.mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3000dae",
+   "metadata": {},
+   "source": [
+    "- Plot the results for the yearly averaged temperate in a suitable manner."
    ]
   },
   {
@@ -402,13 +510,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df_by_year_apply[\"Temperature\"].plot.line(style=\"o\", xlabel=\"year\", ylabel=\"temperature / degree Celsius\")"
+    "ax = grouped_by_year.agg([np.mean])[\"Temperature\"].plot.line(\n",
+    "    style=\"o\",\n",
+    "    xlabel=\"year\",\n",
+    "    ylabel=\"Temperature / degree Celsius\"\n",
+    ")\n",
+    "ax.set_title(\"Yearly average of temperature\")"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "489abbfd",
+   "id": "9bcce04b",
    "metadata": {},
    "outputs": [],
    "source": []
diff --git a/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb
index f416747..2c43dd1 100644
--- a/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb
+++ b/exercises/Pandas_WeatherData/WeatherData_Analysis_tasks.ipynb
@@ -91,7 +91,7 @@
    "source": [
     "## Importing the measurement data\n",
     "\n",
-    "The file `produkt_tu_stunde_19500101_20201231_01639.txt` is a CSV file (although the suffix `.txt` conveys something else).\n",
+    "The file `produkt_tu_stunde_19500101_20211231_01639.txt` is a CSV file (although the suffix `.txt` conveys something else).\n",
     "\n",
     "The single columns of the file have the following headers:\n",
     "\n",
@@ -107,6 +107,14 @@
     "After having imported the data gather some `info`rmation on the data (e.g. datatypes of columns or memory usage)."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ff8e949d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "id": "06461f7c",
@@ -121,9 +129,7 @@
    "id": "f99173c3",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "### YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -139,9 +145,7 @@
    "id": "28ec724b",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "### YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -157,9 +161,7 @@
    "id": "a9448957",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "### YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -175,9 +177,7 @@
    "id": "0377e45a",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -194,7 +194,7 @@
    "source": [
     "The column `MESS_DATUM` contains the data of each measurement in the format `%Y%m%d%H`. The datatype of this column is `int64`.\n",
     "\n",
-    "Create a new `DataFrame` named `df_weather_cleaned` that is based on the original `df_weather` from above.\n",
+    "Create a new `DataFrame` named `df_weather_tweaked` that is based on the original `df_weather` from above.\n",
     "\n",
     "Make the following modifications to the `df_weather` `DataFrame` to generate a new one that is then assigned to the `df_weather_tweaked` variable.\n",
     "\n",
@@ -227,15 +227,21 @@
     "In all the following tasks you are supposed to work with the new modified `df_weather_tweaked`."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e53d7a7f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "id": "8030b621",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -264,9 +270,7 @@
    "id": "2a28e418",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -282,9 +286,7 @@
    "id": "d16a3cfb",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -300,9 +302,7 @@
    "id": "3eb31737",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "code",
@@ -310,9 +310,7 @@
    "id": "2ee9be9c",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "code",
@@ -320,9 +318,7 @@
    "id": "546c3127",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -356,7 +352,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# YOUR CODE GOES HERE"
+    "grouped_by_month = df_weather_tweaked.groupby(df_weather_tweaked.index.month)"
    ]
   },
   {
@@ -374,7 +370,28 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# YOUR CODE GOES HERE"
+    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 6), sharex=\"col\")\n",
+    "\n",
+    "ax2.set_xticks( range(1, 13 ) )\n",
+    "ax2.set_xlabel(\"month\")\n",
+    "\n",
+    "ax1.set_ylabel(\"Temperature / degree Celsius\")\n",
+    "ax1.violinplot(\n",
+    "    [\n",
+    "        df[\"Temperature\"]\n",
+    "        for _, df in grouped_by_month \n",
+    "    ]\n",
+    ");\n",
+    "ax2.set_ylabel(\"Relative Humidity /% \")\n",
+    "ax2.violinplot(\n",
+    "    [\n",
+    "        df[\"Humidity\"]\n",
+    "        for _, df in grouped_by_month \n",
+    "    ]\n",
+    ");\n",
+    "\n",
+    "\n",
+    "\n"
    ]
   },
   {
@@ -400,9 +417,7 @@
    "id": "b750cd9d",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "code",
@@ -410,9 +425,7 @@
    "id": "a83a4e4b",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -428,14 +441,20 @@
    "id": "ed4157a9",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# YOUR CODE GOES HERE"
-   ]
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9e943ec",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9bcce04b",
+   "id": "c62bbfcd",
    "metadata": {},
    "outputs": [],
    "source": []
-- 
GitLab