compare results for each dataset

docummentation about reproducibility

compare results for each dataset
f59ee58e · DesireeWyrzylala · 85c00219 · f59ee58e · f59ee58e · f59ee58e
Commit f59ee58e authored 1 month ago by DesireeWyrzylala
--- a/data/train/temp/149_Stock_id_1_Finance_tr_500_1st_7.csv
+++ b/data/train/temp/149_Stock_id_1_Finance_tr_500_1st_7.csv
--- a/data/train/temp/151_Stock_id_3_Finance_tr_500_1st_62.csv
+++ b/data/train/temp/151_Stock_id_3_Finance_tr_500_1st_62.csv
--- a/data/train/temp/152_Stock_id_4_Finance_tr_500_1st_2.csv
+++ b/data/train/temp/152_Stock_id_4_Finance_tr_500_1st_2.csv
--- a/data/train/temp/155_Stock_id_7_Finance_tr_500_1st_18.csv
+++ b/data/train/temp/155_Stock_id_7_Finance_tr_500_1st_18.csv
--- a/data/train/temp/156_Stock_id_8_Finance_tr_500_1st_24.csv
+++ b/data/train/temp/156_Stock_id_8_Finance_tr_500_1st_24.csv
--- a/data/train/temp/160_Stock_id_12_Finance_tr_500_1st_14.csv
+++ b/data/train/temp/160_Stock_id_12_Finance_tr_500_1st_14.csv
--- a/data/train/temp/161_Stock_id_13_Finance_tr_500_1st_3.csv
+++ b/data/train/temp/161_Stock_id_13_Finance_tr_500_1st_3.csv
--- a/data/train/temp/162_Stock_id_14_Finance_tr_500_1st_4.csv
+++ b/data/train/temp/162_Stock_id_14_Finance_tr_500_1st_4.csv
--- a/data/train/temp/165_Stock_id_17_Finance_tr_500_1st_7.csv
+++ b/data/train/temp/165_Stock_id_17_Finance_tr_500_1st_7.csv
--- a/data/train/temp/166_Stock_id_18_Finance_tr_500_1st_18.csv
+++ b/data/train/temp/166_Stock_id_18_Finance_tr_500_1st_18.csv
--- a/data/train/temp/257_TAO_id_1_Environment_tr_500_1st_3.csv
+++ b/data/train/temp/257_TAO_id_1_Environment_tr_500_1st_3.csv
--- a/data/train/temp/258_TAO_id_2_Environment_tr_500_1st_4.csv
+++ b/data/train/temp/258_TAO_id_2_Environment_tr_500_1st_4.csv
--- a/data/train/temp/259_TAO_id_3_Environment_tr_500_1st_7.csv
+++ b/data/train/temp/259_TAO_id_3_Environment_tr_500_1st_7.csv
--- a/docs/evaluation/Reproducibility.md
+++ b/docs/evaluation/Reproducibility.md
+# 1. Methode Reproducibility
+**Bedeutung:**
+- [x]  Originalmodell erstellen und auf Trainingsdaten trainieren
+- [x]  auf Testdaten evaluieren
+- [ ]  Lässt sich das Modell nachbauen?
+
+**Allgemein:**
+- Training bzw. Evaluation erfolgte wie bei den Autoren pro Zeitreihe, sodass die Ergebnisse am Ende gemittelte Werte sind
+- Von den 870 Dateien haben die Autoren 397 angegeben, die sie für das Tunen und Trainieren verwendet haben. Genau diese wurden auch genutzt
+- Verwendung der gleichen Seeds
+- Es gibt teils unüberwachte Methoden (= meisten statistischen Modelle) und semi-überwachte Methoden (= neuronale Netzte und OCSVM). Bei letzterem war ein Testdatensatz notwendig
+- Aufteilung der Daten erfolgte wie bei den Autoren
+- Bei einigen Modellen wurden die einzelnen Zeitreihen nochmal mit einem sliding window als einzelne Batches erstellt, als Art eigener Trainingsdatensatz
+- Autoren haben dabei pro Model ein gleiches Paar optimaler Hyperparameter verwendet, obwohl die Daten sehr divers sind. Für die Replikation wurde eine Grid-Suche durchgeführt und damit pro Zeitreihe auf eigenen optimalen Hyperparameter evaluiert
+- Insgesamt ließen sich die meisten Modelle sehr gut nachbauen
+- Probleme:
+	- KShapeAD: trotz gleicher Hyperparameter und Datensätze haben wir Fehler debuggen müssen, weil es bei der Erstellung von Subsequenzen zu leeren Ein-Dimensionalen Arrays kam anstelle Zwei-Dimensionaler Matrizen
+	- LOF: Optimale Hyperparameter stimmen nicht mit den getesteten überein im git der Autoren
+	- CNN: Beim testen der von den Autoren vorgegebenen Hyperparameter traten Fehler beim Erstellen der Layer auf. Dabei diente ``num_channels`` als Parameter für die Anzahl an Neuronen pro Schicht. Dabei definiert die Länge des Arrays die Anzahl der Schichte. Stattdessen wurde die Lernrate optimiert.
+
+### KShapeAD
+
+
+### POLY
+
+
+### PCA
+
+
+
+### IForest
+
+
+
+### Sub-IForest
+
+
+
+### USAD
+
+
+
+#### LSTMAD
+
+
+
+### KMeansAD
+
+
+
+### Sub-KNN
+
+
+
+### OmniAnomaly
+
+
+
+### LOF
+
+
+
+### OCSVM
+
+
+
+### CNN
+
+
+
+# 2. Results Reproducibility
+**Bedeutung:**
+- [ ] Erhalten wir ähnliche Ergebnisse wie in dem Paper?
+- [ ]  Wenn nein, woran kann es liegen?
+
+**Allgemein:**
+*  Da eine Grid-Suche pro Zeitreihe durchgeführt wurde und nicht pauschal die optimalen Hyperparameter der Autoren für alle Zeitserien verwendet wurden, ist eine Verbesserung der Modelle zu erwarten
+ In einigen Datensätzen bei einigen Modellen sind die Ergebnisse jedoch nicht reproduzierbar, selbst mit den optimalen Hyperparametern der Autoren (meist noch schlechtere Werte). Hier nur Vergleich mit VUS-PR möglich, da zu AUC-PR keine Daten pro Datensatz zu Model.
+	*Missing: Sub-IForest und KNN*
+	*Betreffende Datensätze:*
+	+ **TAO** (Modelle: **PCA**: -0.09; **KShapeAD**: - 0.19; **Poly**: - 0.11; **KMeans**: - 0.17; **IForest**: - 0.26; **Sub-LOF**: - 0.11; **Sub-OCSVM**: - 0.2; **LOF** - 0.2 )
+	+ **Stock** (Modelle: **IForest**: - 0.25; )
+	+ **MSL** (Modelle: **KShapeAD**: -0.09; )
+	+ **IOPS**: (Modelle: **IForest**: - 0.15; )
+	+ **SWaT** (Modelle: **IForest**: - 0.13; )
+	+ **Yahoo** (Modelle: **IForest**: - 0.3; )
+	+ **WSD** (Modelle: **IForest**: - 0.11; )
+*Stark Verbesserte Datensätze bei einzelnen Modellen:*
+ **Power** (Modelle: **KShapeAD**: + 0.32; **KMeans**: + 0.13; )
+ **MSL** (Modelle: **POLY**: + 0.19; **IForest**: + 0.16; **OA**: + 0.18; **Sub-LOF**: + 0.19; )
+ **OPPORTUNITY** (Modelle: **Poly**: + 0.33; **CNN**: +0.22; **IForest**: + 0.31;)
+ **Exathlon:** (Modelle: **Poly**: + 0.11; **KMeans**: + 0.21; **IForest**:+ 0.29; **Sub-OCSVM**: + 0.27; )
+ **NAB** (Modelle: **Poly**: + 0.1; **IForest**: + 0.12; **OA**: + 0.15; **Sub-LOF**: + 0.21; **Sub-OCSVM**: + 0.12; )
+ **NEK** (Modelle: **Poly**: + 0.15; **KMeans:** + 0.12; **USAD**: + 0.1; **Sub-LOF**: + 0.44; **Sub-OCSVM**: + 0.11; )
+ **WSD** (Modelle: **Poly**: + 0.12; **USAD**: + 0.12; **Sub-LOF**: + 0.52; )
+ **LTDB** (Modelle: **KMeans**: + 0.35; **IForest**: + 0.2; **Sub-OCSVM**: + 0.36; )
+ **MITDB** (Modelle: **KMeans**: + 0.32; **Sub-LOF**: + 0.11; **Sub-OCSVM**: + 0.21; )
+ **SVDB** (Modelle: **KMeans**: + 0.37; **IForest**: + 0.22; **OA**: + 0.36; **Sub-LOF**: + 0.2; **Sub-OCSVM**: + 0.39; )
+ **UCR** (Modelle: **KMeans**: + 0.11; **Sub-LOF**: + 0.22; **Sub-OCSVM**: + 0.16; )
+ **Yahoo** (Modelle: **KMeans**: + 0.23;  **Sub-LOF**: + 0.15; )
+ **Daphnet** (Modelle: **IForest**: + 0.16; **Sub-LOF**: + 0.21; )
+ **SED** (Modelle: **IForest**: + 0.46; **Sub-OCSVM**: + 0.1; )
+ **Catsv2** (Modelle: **OA**: + 0.16; )
+ **SMAP** (Modelle: **OA**: + 0.27; **Sub-LOF**: + 0.2; )
+ **SMD** (Modelle: **OA**: + 0.12; **Sub-LOF**: + 0.5; )
+ **TODS** (Modelle: **OA**: + 0.1; **Sub-LOF**: + 0.16; **Sub-OCSVM**: + 0.1; )
+ **IOPS** (Modelle: **Sub-LOF**: + 0.27; )
+ **MGAB** (Modelle: **Sub-LOF**: + 0.19; )
+### KShapeAD
+
+
+### POLY
+
+
+### PCA
+
+
+
+### IForest
+
+
+
+### Sub-IForest
+
+
+
+### USAD
+
+
+
+#### LSTMAD
+
+
+
+### KMeansAD
+
+
+
+### Sub-KNN
+
+
+
+### OmniAnomaly
+
+
+
+### LOF
+
+
+
+### OCSVM
+
+
+
+### CNN
+
+
+
+
+# 3. Inferential Reproducibility
+**Bedeutung:**
+- [ ] Vergleichen der Ergebnisse und des Aufwandes, Hintergrundwissen und Einfachheit in der Implementierung der statistischen Algorithmen mit dem Neuronalem Netz
+- [ ]  Kommen wir zum gleichem Schluss und müssen die Überlegenheit neuronaler Netze hinterfragen?
+- [x]  Spielt eventuell Overfitting eine Rolle?
+- [x]  Könnte ein Bias der Autoren vorliegen?
+
+**Bewertung:**
+*Overfitting:*
+ Die meisten Modelle werden unüberwacht evaluiert und haben keine Testdaten. Daher spielt bei diesen Modellen Overfitting keine Rolle
+ Bei den semi-überwachten Modellen war ein Test- und Trainingsdatensatz erforderlich. User Code hat sich dabei an den der Autoren gerichtet. Hier wurde der gesamte Datensatz zum Testen verwendet und ein Teil davon zum Training, sodass auch auf einigen Trainingsdaten evaluiert wurde. Das kann das Ergebnis der semi-überwachten ins positive verzerren
+*Bias:*
+- Autoren haben einen Datensatz erstellt, der möglichst gut (korrektes Labeling, Machbarkeit des Datensatzes, Bias-frei) und divers für die Anomalie-Detektion ist. 
+- Zusätzlich haben sie für die Evaluierung eine eigene erstellte Metrik, die VUS-PR (Volume under Space- Precision Recall) genutzt, da sie die Genauigkeit der anderen Metriken (F1, Recall, Precision, AUC-Roc) anzweifeln und für unzureichend erachten
+- Es gibt teils starke Differenzen zw. AUC-PR und VUS-PR (zw. sehr schlecht mit AUC-PR um die 0.15 und sehr gut mit VUS-PR um die 0.75-0.95) auf einigen Datensätzen, insbesondere die die Punktanomalien enthalten und einen deutlich höheren Anomalie-Anteil haben
+	- Datensätze: Stock, TAO und TODS
+	  
+	  *Punktanomalie Datensätze*
+	- Nur in diesen drei Punktanomalie-Datensätzen ist ein höherer Anomalie-Anteil zu verzeichnen. In allen anderen liegt der Anteil an Punktanomalien zw. 0.3 % uns 1.5 %. Diese bleiben sowohl mit AUC-PR und VUS-PR schlecht. Kann es sein, dass die VUS-PR so gut ist, weil sie dem hohen Anomalie-Anteil zu Gute kommt?
+	  
+	  *Sequenzanomalie Datensätze*
+	- Hoher Anomalie-Anteil in: Exathlon, NAB, NEK, LTDB, Dapthnet, SWaT und Power
+	  
+	  *Anomalie-Anteil:*
+		- Durchschnitt: 4.1 %
+		- **TAO:** 9.4 %
+		- **TODS:** 8.8 %
+		- **Stock:** 6.4 %
+		-  UCR: 0.3 %
+		+ SMD: 1.9 %
+		+ YAHOO: 0.9 %
+		+ **Exathlon:** 10.9 % 
+		+ **NAB:** 10.2 % 
+		+ OPPORTUNITY: 4.8 %
+		+ WSD: 0.5 %
+		+ SVDB: 3.7 %
+		+ SMAP: 2.9 %
+		+ IOPS: 1.5 %
+		+ MGAB: 0.2 %
+		+ MSL: 4.0 %
+		+ **NEK:** 8.0 %
+		+ ***LTDB:*** 18.8 % 
+		+ MITDB: 4.2 %
+		+ SED: 4.0 %
+		+ **Daphnet:** 5.9 % 
+		+ SWaT: 12.1 %
+		+ **Power:** 8.6 % 
+		+ CATSv2: 4.9 %
+
+*Anpassen nach hinzufügen von Sub-IForest und ggf. neues KNN*
+
+| Datensatz    | AUC-PR | VUS-PR | Differenz AUC zu VUS-PR | Differenz prozentual |
+| ------------ | ------ | ------ | ----------------------- | -------------------- |
+| **Exathlon** | 0.658  | 0.659  | 0.001                   |                      |
+| **NAB**      | 0.341  | 0.367  | 0.026                   |                      |
+| **NEK**      | 0.566  | 0.592  | 0.026                   |                      |
+| **LTDB**     | 0.485  | 0.545  | *0.06*                  |                      |
+| **Dapthnet** | 0.227  | 0.217  | -0.01                   |                      |
+| **SWaT**     | 0.434  | 0.302  | *-0.132*                |                      |
+| **Power**    | 0.174  | 0.172  | -0.002                  |                      |
+| **TAO**      | 0.297  | 0.817  | *0.52*                  |                      |
+| **TODS**     | 0.252  | 0.614  | *0.362*                 |                      |
+| **Stock**    | 0.214  | 0.769  | *0.555*                 |                      |
+| MITDB        | 0.313  | 0.300  |                         |                      |
+| MSL          | 0.357  | 0.412  |                         |                      |
+| YAHOO        | 0.268  | 0.374  |                         |                      |
+| OPPORTUNITY  | 0.464  | 0.464  |                         |                      |
+| SMAP         | 0.426  | 0.470  |                         |                      |
+| MGAB         | 0.095  | 0.081  |                         |                      |
+| UCR          | 0.210  | 0.219  |                         |                      |
+| SVDB         | 0.441  | 0.440  |                         |                      |
+| SMD          | 0.396  | 0.381  |                         |                      |
+| WSD          | 0.206  | 0.188  |                         |                      |
+| SED          | 0.260  | 0.332  |                         |                      |
+| IOPS         | 0.226  | 0.187  |                         |                      |
+| CATSv2       | 0.394  | 0.266  |                         |                      |
+| **Gesamt:**  | 0.335  | 0.399  |                         |                      |
+	  
--- a/docs/evaluation/Vergleich der Ergebnisse.md
+++ b/docs/evaluation/Vergleich der Ergebnisse.md
@@ -225,19 +225,74 @@
 + Deutlich besser bei Sequenzanomalien
 + Gute AUC-PR bei SED, wo viele der Modelle Probleme haben
 + Bewertung von Yahoo echt schlecht, aber bei den Sequenzanomalien nahezu perfekt
-+ Stock deutlich schlechter (VUS-PR = 0.74 statt 0.99 lt. Autoren)
+ Stock deutlich schlechter (VUS-PR = 0.74 statt 0.99 lt. Autoren) und noch einige weitere Datensätze s.u.
 + Große Differenz im TAO zw. AUC-PR und VUS-PR (0.14 vs. 0.73)
 + Beste erkannte Gruppe: 
 	+ nach AUC-PR:
-		+ Exathlon: 0.962806
+		+ Exathlon: 0.962806 (TAO: 0.125817)
 	+ nach VUS-PR:
-		+ Exathlon: 0.964836 (vgl. Tao 0.99)
+		+ Exathlon: 0.964836 (vgl. 0.67; TAO: 0.728970,vgl. Tao 0.99)
 + Schlechteste erkannte Gruppe: 
 	+ nach AUC-PR: 
 		+ MGAB: 0.004336
 	+ nach VUS-PR:
 		+ MGAB: 0.004379 (vgl. MGAB 0.00)

+**TAO DATENSATZ**
+*Optimale Hyperparameter der Autoren:*
+ {'n_estimators': 200}
+*Hyperparameter nach Tuning pro Datensatz V1:*
+ {'n_estimators': 25}, 257_TAO_id_1_Environment_tr_500_1st_3.csv
+ {'n_estimators': 25}, 258_TAO_id_2_Environment_tr_500_1st_4.csv
+ {'n_estimators': 25}, 259_TAO_id_3_Environment_tr_500_1st_7.csv
+
+**Ergebnisse :**	
+ Datei: 257_TAO_id_1_Environment_tr_500_1st_3.csv, 
+	+ AUC-PR: 0.10360803817787549
+	+ VUS-PR: 0.8775058093917094
+ Datei: 258_TAO_id_2_Environment_tr_500_1st_4.csv, 
+	+ AUC-PR: 0.1482664785011723
+	+ VUS-PR: 0.9397830084380918
+ Datei: 259_TAO_id_3_Environment_tr_500_1st_7.csv, 
+	+ AUC-PR: 0.05900208133187386
+	+ VUS-PR: 0.3387773679381048
+ Durchschnittlicher AUC-PR: 0.10362553267030722
+ Durchschnittlicher VUS-PR: 0.718688728589302
+
+$\rightarrow$ Trotz Verwendung der optimalen Hyperparameter bleibt die starke Abweichung zu den Ergebnissen der Autoren bestehen
+
+**Weitere stark abweichende Datensätze:**
+ *Stock*: 0.74 vgl. 0.99 $\rightarrow$ um - 0.25 schlechter als bei Autoren
+	+ Grid-Suche ergab auf den Datensätzen verschiedene Parameter
+	  **Ergebnisse mit optimalen Hyperparametern  der Autoren**
+	+ Durchschnittlicher AUC-PR: 0.10581727147232856
+	+ Durchschnittlicher VUS-PR: 0.7280962165055002
+	  $\rightarrow$ Ergebnisse schlechter als mit der Grid-Suche. 
+ *IOPS*: 0.13 vgl. 28 $\rightarrow$ um - 0.15 schlechter als bei Autoren
+	+ Grid-Suche ergab auf den Datensätzen verschiedene Parameter
+	  **Ergebnisse mit optimalen Hyperparametern  der Autoren**
+	+ Durchschnittlicher AUC-PR: 
+	+ Durchschnittlicher VUS-PR:
+	  $\rightarrow$ 
+ *SWaT*: 0.37 vgl. 0.5 $\rightarrow$ um - 0.13 schlechter als bei Autoren
+	+ Grid-Suche ergab auf den Datensätzen gleiche Parameter
+	  **Ergebnisse mit optimalen Hyperparametern  der Autoren**
+	+ Durchschnittlicher AUC-PR:
+	+ Durchschnittlicher VUS-PR: 
+	  $\rightarrow$ 
+ *Yahoo*: 0.14 vgl. 0.44 $\rightarrow$ um - 0.3 schlechter als bei Autoren
+	+ Grid-Suche ergab auf den Datensätzen verschiedene Parameter
+	  **Ergebnisse mit optimalen Hyperparametern  der Autoren**
+	+ Durchschnittlicher AUC-PR: 
+	+ Durchschnittlicher VUS-PR: 
+	  $\rightarrow$ 
+ *WSD*: 0.03 vgl. 0.14 $\rightarrow$ um - 0.11 schlechter als bei Autoren
+	+ Grid-Suche ergab auf den Datensätzen verschiedene Parameter
+	  **Ergebnisse mit optimalen Hyperparametern  der Autoren**
+	+ Durchschnittlicher AUC-PR: 
+	+ Durchschnittlicher VUS-PR: 
+	  $\rightarrow$ 
+	
 # Sub-iForest
 + Prozessor: CPU
 	+ Modell: iCore 5 8th
@@ -326,7 +381,32 @@
 		+ SED: 0.020994 (MGAB: 0.027123)
 	+ nach VUS-PR:
 		+ MGAB: 0.005551 (vgl. 0.01, vermutl. gleich ; SED: 0.030270)
-	
+
+**TAO DATENSATZ**
+Unser Ergebnis: 0.84 
+Autoren: 0.93
+*Optimale Hyperparameter der Autoren:*
+ {'periodicity': 1, 'n_components': None}
+*Hyperparameter nach Tuning pro Datensatz V1:*
+ {'periodicity': 1, 'n_components': 0.75}, 257_TAO_id_1_Environment_tr_500_1st_3.csv
+ {'periodicity': 1, 'n_components': 0.75}, 258_TAO_id_2_Environment_tr_500_1st_4.csv
+ {'periodicity': 1, 'n_components': 0.75}, 259_TAO_id_3_Environment_tr_500_1st_7.csv
+
+**Ergebnisse :**	
+ Datei: 257_TAO_id_1_Environment_tr_500_1st_3.csv, 
+	+ AUC-PR: 0.1262219668734263
+	+ VUS-PR: 0.9091747292463418
+ Datei: 258_TAO_id_2_Environment_tr_500_1st_4.csv, 
+	+ AUC-PR: 0.17567161748656224
+	+ VUS-PR: 0.9507001756819773
+ Datei: 259_TAO_id_3_Environment_tr_500_1st_7.csv, 
+	+ AUC-PR: 0.1437347444925471
+	+ VUS-PR: 0.6562715916627513
+ Durchschnittlicher AUC-PR: 0.14854277628417856
+ Durchschnittlicher VUS-PR: 0.8387154988636901
+
+$\rightarrow$ Ergebnis bleibt auch mit den optimalen Hyperparametern der Autoren gleich. Keine Erklärung, warum die starke Abweichung weiterhin besteht
+
 # USAD
 + Prozessor: GPU
 	+ Modell: NVIDIA GeForce RTX 2060 SUPER
@@ -587,6 +667,14 @@ $\rightarrow$ Optimalen Hyperparameter pro Datensatz stimmen mit denen der Autor

 $\rightarrow$ Ergebnisse mit den optimalen Hyperparametern der Autoren noch schlechter bei TAO als mit der Grid-Search pro Zeitserie

+**Weitere stark abweichende Datensätze:**
+ *MSL*: 0.46 vgl. 0.55 $\rightarrow$ um - 0.09 schlechter als bei Autoren
+	+ Grid-Suche ergab auf den Datensätzen verschiedene Parameter
+	  **Ergebnisse mit optimalen Hyperparametern  der Autoren**
+	+ Durchschnittlicher AUC-PR: 0.4300357746597835
+	+ Durchschnittlicher VUS-PR: 0.4774590136658375
+	  $\rightarrow$ Minimale Besserung der VUS-PR Werte, dennoch nicht so gut wie bei den Autoren
+	
 # kMeansAD
 + Prozessor: CPU
 	+ Modell: iCore 5 8th

--- a/src/datensatz/dataset_investigation.ipynb
+++ b/src/datensatz/dataset_investigation.ipynb
@@ -12,13 +12,6 @@
    "import matplotlib.pyplot as mpl"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -347,25 +340,95 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#test amount of anomalies of special datasets\n",
+    "dataset1 = 'TAO'\n",
+    "dataset2 = 'Stock'\n",
+    "dataset3 = 'TODS'\n",
+    "\n",
+    "\n",
+    "df_tao = df_univariate[df_univariate['FileName'].str.contains('_'+dataset1+'_')]\n",
+    "df_stock = df_univariate[df_univariate['FileName'].str.contains('_'+dataset2+'_')]\n",
+    "df_tods = df_univariate[df_univariate['FileName'].str.contains('_'+dataset3+'_')]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "FileName\n",
+       "287_TODS_id_1_Synthetic_tr_500_1st_11        5000\n",
+       "288_TODS_id_2_Synthetic_tr_500_1st_65        5000\n",
+       "289_TODS_id_3_Synthetic_tr_500_1st_26        5000\n",
+       "290_TODS_id_4_Synthetic_tr_500_1st_26        5000\n",
+       "291_TODS_id_5_Synthetic_tr_500_1st_11        5000\n",
+       "292_TODS_id_6_Synthetic_tr_500_1st_11        5000\n",
+       "293_TODS_id_7_Synthetic_tr_500_1st_7         5000\n",
+       "294_TODS_id_8_Synthetic_tr_500_1st_200       5000\n",
+       "295_TODS_id_9_Synthetic_tr_1250_1st_2046     5000\n",
+       "296_TODS_id_10_Synthetic_tr_500_1st_26       5000\n",
+       "297_TODS_id_11_Synthetic_tr_500_1st_7        5000\n",
+       "298_TODS_id_12_Synthetic_tr_500_1st_0        5000\n",
+       "299_TODS_id_13_Synthetic_tr_500_1st_65       5000\n",
+       "300_TODS_id_14_Synthetic_tr_1250_1st_2555    5000\n",
+       "301_TODS_id_15_Synthetic_tr_500_1st_245      5000\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 84,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df_tods['FileName'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#get procential anomalies of selected datasets\n",
+    "anomalies_tao= df_tao['Label'].value_counts()[1]\n",
+    "anomalies_stock= df_stock['Label'].value_counts()[1]\n",
+    "anomalies_tods= df_tods['Label'].value_counts()[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Anteil an Anomalien: 0.04185815322611735\n"
+      "Anteil an Anomalien: 0.04185815322611735\n",
+      "Anteil an Anomalien in TAO: 9.407\n",
+      "Anteil an Anomalien in Stock: 8.864\n",
+      "Anteil an Anomalien in TODS: 6.377\n"
     ]
    }
   ],
   "source": [
    "print(f'Anteil an Anomalien: {anomalies/df_univariate.shape[0]}')\n",
-    "#print(f'Anteil an Anomalien: {913261 / (913261 + 32854956)}')\n"
+    "print(f'Anteil an Anomalien in {dataset1}: {(anomalies_tao*100/df_tao.shape[0]):.3f}')\n",
+    "print(f'Anteil an Anomalien in {dataset2}: {(anomalies_stock*100/df_stock.shape[0]):.3f}')\n",
+    "print(f'Anteil an Anomalien in {dataset3}: {(anomalies_tods*100/df_tods.shape[0]):.3f}')\n"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
@@ -438,7 +501,7 @@
       "4  0.092652    0.0  280_NEK_id_4_WebService_tr_500_1st_231"
      ]
     },
-     "execution_count": 12,
+     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -458,7 +521,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
@@ -467,7 +530,7 @@
       "(1073, 3)"
      ]
     },
-     "execution_count": 13,
+     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -478,15 +541,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "c:\\Users\\desiw\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\plotly\\express\\_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.\n",
-      "  sf: grouped.get_group(s if len(s) > 1 else s[0])\n"
+      "c:\\Users\\desiw\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\plotly\\express\\_core.py:1985: FutureWarning:\n",
+      "\n",
+      "When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.\n",
+      "\n"
     ]
    },
    {
@@ -3369,7 +3434,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
@@ -3379,7 +3444,7 @@
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
-      "Cell \u001b[1;32mIn[15], line 3\u001b[0m\n\u001b[0;32m      1\u001b[0m path_multivariate \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdata/TSB-AD-M/\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[0;32m      2\u001b[0m file \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m019_MITDB_id_1_Medical_tr_37500_1st_103211.csv\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m----> 3\u001b[0m df_multivariate_single \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath_multivariate\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m/\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mfile\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m      4\u001b[0m df_multivariate_single\u001b[38;5;241m.\u001b[39mhead()\n",
+      "Cell \u001b[1;32mIn[90], line 3\u001b[0m\n\u001b[0;32m      1\u001b[0m path_multivariate \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdata/TSB-AD-M/\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[0;32m      2\u001b[0m file \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m019_MITDB_id_1_Medical_tr_37500_1st_103211.csv\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m----> 3\u001b[0m df_multivariate_single \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath_multivariate\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m/\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mfile\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m      4\u001b[0m df_multivariate_single\u001b[38;5;241m.\u001b[39mhead()\n",
      "File \u001b[1;32mc:\\Users\\desiw\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1026\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[0;32m   1013\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m   1014\u001b[0m     dialect,\n\u001b[0;32m   1015\u001b[0m     delimiter,\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m   1022\u001b[0m     dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[0;32m   1023\u001b[0m )\n\u001b[0;32m   1024\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m-> 1026\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[1;32mc:\\Users\\desiw\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:620\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m    617\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[0;32m    619\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[1;32m--> 620\u001b[0m parser \u001b[38;5;241m=\u001b[39m \u001b[43mTextFileReader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m    622\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[0;32m    623\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m parser\n",
      "File \u001b[1;32mc:\\Users\\desiw\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1620\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m   1617\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m   1619\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m-> 1620\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_make_engine\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m)\u001b[49m\n",

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 import os
 import plotly.express as px
 import matplotlib.pyplot as mpl
 ```

-%% Cell type:code id: tags:
-
-``` python
-```
-
 %% Cell type:markdown id: tags:

 # Univariate Daten

 %% Cell type:code id: tags:

 ``` python
 path_univariate = '../../data/train/all'
 file = '225_MGAB_id_1_Synthetic_tr_25000_1st_38478.csv'
 df_univariate_single = pd.read_csv(path_univariate+ '/' +file)
 df_univariate_single.head()
 ```

 %% Output

           Data  Label
    0  1.391493      0
    1  1.378404      0
    2  1.344583      0
    3  1.290358      0
    4  1.219080      0

 %% Cell type:code id: tags:

 ``` python
 df_univariate_single.shape[0]
 ```

 %% Output

    100000

 %% Cell type:code id: tags:

 ``` python
 df_univariate_single['Label'].value_counts()
 ```

 %% Output

    Label
    0    99800
    1      200
    Name: count, dtype: int64

 %% Cell type:code id: tags:

 ``` python
 print(f'Anteil an Anomalien: {18109/(631891+18109)}')
 ```

 %% Output

    Anteil an Anomalien: 0.02786

 %% Cell type:code id: tags:

 ``` python
 def check_if_label_only_contains_two_values(lst,):
    allowed_values = {0, 1,'0','1'}
    return set(lst).issubset(allowed_values)
 ```

 %% Cell type:code id: tags:

 ``` python
 #Einlesen aller univariater Daten
 #create empty dataframe to concat all data from files
 df_univariate = pd.DataFrame({
    'Data': [],
    'Label': [],
    'FileName': []
 })
 for file in os.listdir(path_univariate):
    name = file.split('.')[0]
    df_single = pd.read_csv(path_univariate +  '/' + file)
    #make sure, that file has correct shape
    if df_single.shape[1] != 2:
        raise ValueError(f'Shape of Dataframe made from file: {name} is not valid!')
    df_single.columns = ['Data','Label']
    #check if second row includs unique keys for normal and abnormal data points
    if not check_if_label_only_contains_two_values(list(df_single['Label'])):
        raise ValueError(f'Labeling is not consistent in file: {name}')
    #add row filename, which allows to trace back the file
    df_single['FileName'] = name
    #concat to combined dataframe
    df_univariate = pd.concat([df_univariate,df_single])

 df_univariate.shape
 ```

 %% Output

    (20414183, 3)

 %% Cell type:code id: tags:

 ``` python
 df_univariate.head()
 ```

 %% Output

         Data  Label                                FileName
    0  47.606    0.0  001_NAB_id_1_Facility_tr_1007_1st_2014
    1  42.580    0.0  001_NAB_id_1_Facility_tr_1007_1st_2014
    2  46.030    0.0  001_NAB_id_1_Facility_tr_1007_1st_2014
    3  44.992    0.0  001_NAB_id_1_Facility_tr_1007_1st_2014
    4  45.238    0.0  001_NAB_id_1_Facility_tr_1007_1st_2014

 %% Cell type:code id: tags:

 ``` python
 df_univariate['Label'].value_counts()
 ```

 %% Output

    Label
    0.0    19559683
    1.0      854500
    Name: count, dtype: int64

 %% Cell type:code id: tags:

 ``` python
 anomalies= df_univariate['Label'].value_counts()[1]
 print(f'Anteil an Anomalien total: {anomalies}')
 ```

 %% Output

    Anteil an Anomalien total: 854500

 %% Cell type:code id: tags:

 ``` python
+#test amount of anomalies of special datasets
+dataset1 = 'TAO'
+dataset2 = 'Stock'
+dataset3 = 'TODS'
+
+
+df_tao = df_univariate[df_univariate['FileName'].str.contains('_'+dataset1+'_')]
+df_stock = df_univariate[df_univariate['FileName'].str.contains('_'+dataset2+'_')]
+df_tods = df_univariate[df_univariate['FileName'].str.contains('_'+dataset3+'_')]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df_tods['FileName'].value_counts()
+```
+
+%% Output
+
+    FileName
+    287_TODS_id_1_Synthetic_tr_500_1st_11        5000
+    288_TODS_id_2_Synthetic_tr_500_1st_65        5000
+    289_TODS_id_3_Synthetic_tr_500_1st_26        5000
+    290_TODS_id_4_Synthetic_tr_500_1st_26        5000
+    291_TODS_id_5_Synthetic_tr_500_1st_11        5000
+    292_TODS_id_6_Synthetic_tr_500_1st_11        5000
+    293_TODS_id_7_Synthetic_tr_500_1st_7         5000
+    294_TODS_id_8_Synthetic_tr_500_1st_200       5000
+    295_TODS_id_9_Synthetic_tr_1250_1st_2046     5000
+    296_TODS_id_10_Synthetic_tr_500_1st_26       5000
+    297_TODS_id_11_Synthetic_tr_500_1st_7        5000
+    298_TODS_id_12_Synthetic_tr_500_1st_0        5000
+    299_TODS_id_13_Synthetic_tr_500_1st_65       5000
+    300_TODS_id_14_Synthetic_tr_1250_1st_2555    5000
+    301_TODS_id_15_Synthetic_tr_500_1st_245      5000
+    Name: count, dtype: int64
+
+%% Cell type:code id: tags:
+
+``` python
+#get procential anomalies of selected datasets
+anomalies_tao= df_tao['Label'].value_counts()[1]
+anomalies_stock= df_stock['Label'].value_counts()[1]
+anomalies_tods= df_tods['Label'].value_counts()[1]
+```
+
+%% Cell type:code id: tags:
+
+``` python
 print(f'Anteil an Anomalien: {anomalies/df_univariate.shape[0]}')
-#print(f'Anteil an Anomalien: {913261 / (913261 + 32854956)}')
+print(f'Anteil an Anomalien in {dataset1}: {(anomalies_tao*100/df_tao.shape[0]):.3f}')
+print(f'Anteil an Anomalien in {dataset2}: {(anomalies_stock*100/df_stock.shape[0]):.3f}')
+print(f'Anteil an Anomalien in {dataset3}: {(anomalies_tods*100/df_tods.shape[0]):.3f}')
 ```

 %% Output

    Anteil an Anomalien: 0.04185815322611735
+    Anteil an Anomalien in TAO: 9.407
+    Anteil an Anomalien in Stock: 8.864
+    Anteil an Anomalien in TODS: 6.377

 %% Cell type:code id: tags:

 ``` python
 #view first timeseries plot
 df_univariate_first_entry = df_univariate.where(df_univariate['FileName'] == '280_NEK_id_4_WebService_tr_500_1st_231').dropna()
 df_univariate_first_entry.head()
 ```

 %% Output

           Data  Label                                FileName
    0  0.166970    0.0  280_NEK_id_4_WebService_tr_500_1st_231
    1  0.085052    0.0  280_NEK_id_4_WebService_tr_500_1st_231
    2  0.080585    0.0  280_NEK_id_4_WebService_tr_500_1st_231
    3  0.100995    0.0  280_NEK_id_4_WebService_tr_500_1st_231
    4  0.092652    0.0  280_NEK_id_4_WebService_tr_500_1st_231

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 df_univariate_first_entry.shape
 ```

 %% Output

    (1073, 3)

 %% Cell type:code id: tags:

 ``` python
 colors = {
    1: 'firebrick',
    0: 'darkolivegreen'
 }

 # NaN einfügen, um Übergänge zu trennen
 df = df_univariate_first_entry.copy()
 df["Data"] = df["Data"].mask(df["Label"].ne(df["Label"].shift()))

 fig = px.line(df,x=df.index,y='Data',color='Label', color_discrete_map= colors)
 fig.update_layout(
        legend=dict(
            x=0.9,
            y=1.0,
            traceorder="reversed",
            title_font_family="Times New Roman",
            font=dict(
                family="Courier",
                size=15,
                color="black"
            ),
            title_text='Group',
            bgcolor="white",
            bordercolor="lightgrey",
            borderwidth=2,
        ),
        template = 'ggplot2',
        title={
                'text': '<b> Zeitserie eines univariaten Beispiels mit einer Punktanomalie </b>',
                'y':0.95,
                'xanchor':'center',
                'yanchor':'top'},
        height=600,width=1200,
        font=dict(
            family="Arial",
            size=18,  # Set the font size here
        )
        )
 fig.update_xaxes(title_text = 'Zeit (t)')
 fig.update_yaxes(title_text= 'Messwerte')
 fig.update_traces(marker=dict(size=12,
                        opacity=0.8,
                        line=dict(width=2,
                                        color='DarkSlateGrey'),
                        ),
                selector=dict(mode='markers'))
 fig.show()
 ```

 %% Output

-    c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\plotly\express\_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
-      sf: grouped.get_group(s if len(s) > 1 else s[0])
+    c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\plotly\express\_core.py:1985: FutureWarning:
+    
+    When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
+    


 %% Cell type:markdown id: tags:

 # Multivariate Daten

 %% Cell type:code id: tags:

 ``` python
 path_multivariate = 'data/TSB-AD-M/'
 file = '019_MITDB_id_1_Medical_tr_37500_1st_103211.csv'
 df_multivariate_single = pd.read_csv(path_multivariate + '/' + file)
 df_multivariate_single.head()
 ```

 %% Output

    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
-Cell     In[15], line 3
+Cell     In[90], line 3
          1 path_multivariate = 'data/TSB-AD-M/'
          2 file = '019_MITDB_id_1_Medical_tr_37500_1st_103211.csv'
    ----> 3 df_multivariate_single = pd.read_csv(path_multivariate + '/' + file)
          4 df_multivariate_single.head()
 File     c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
       1013 kwds_defaults = _refine_defaults_read(
       1014     dialect,
       1015     delimiter,
       (...)
       1022     dtype_backend=dtype_backend,
       1023 )
       1024 kwds.update(kwds_defaults)
    -> 1026 return _read(filepath_or_buffer, kwds)
 File     c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:620, in _read(filepath_or_buffer, kwds)
        617 _validate_names(kwds.get("names", None))
        619 # Create the parser.
    --> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
        622 if chunksize or iterator:
        623     return parser
 File     c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
       1617     self.options["has_index_names"] = kwds["has_index_names"]
       1619 self.handles: IOHandles | None = None
    -> 1620 self._engine = self._make_engine(f, self.engine)
 File     c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1880, in TextFileReader._make_engine(self, f, engine)
       1878     if "b" not in mode:
       1879         mode += "b"
    -> 1880 self.handles = get_handle(
       1881     f,
       1882     mode,
       1883     encoding=self.options.get("encoding", None),
       1884     compression=self.options.get("compression", None),
       1885     memory_map=self.options.get("memory_map", False),
       1886     is_text=is_text,
       1887     errors=self.options.get("encoding_errors", "strict"),
       1888     storage_options=self.options.get("storage_options", None),
       1889 )
       1890 assert self.handles is not None
       1891 f = self.handles.handle
 File     c:\Users\desiw\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        868 elif isinstance(handle, str):
        869     # Check whether the filename is to be opened in binary mode.
        870     # Binary mode does not support 'encoding' and 'newline'.
        871     if ioargs.encoding and "b" not in ioargs.mode:
        872         # Encoding
    --> 873         handle = open(
        874             handle,
        875             ioargs.mode,
        876             encoding=ioargs.encoding,
        877             errors=errors,
        878             newline="",
        879         )
        880     else:
        881         # Binary mode
        882         handle = open(handle, ioargs.mode)
    FileNotFoundError: [Errno 2] No such file or directory: 'data/TSB-AD-M//019_MITDB_id_1_Medical_tr_37500_1st_103211.csv'

 %% Cell type:code id: tags:

 ``` python
 df_multivariate_single.shape
 ```

 %% Output

    (150000, 3)

 %% Cell type:code id: tags:

 ``` python
 file_2 = '050_GHL_id_19_Sensor_tr_43750_1st_55001.csv'
 df_multivariate_single_2 = pd.read_csv(path_multivariate+file_2)
 df_multivariate_single_2.head()
 ```

 %% Output

       RT_level_ini  RT_temperature.T  C_temperature.T  RT_level  out_valve_act  \
    0      2.227218        307.379578        327.69635  0.841864            0.0
    1      2.227218        307.299835        327.69635  0.843873            0.0
    2      2.227218        307.220490        327.69635  0.845882            0.0
    3      2.227218        307.141571        327.69635  0.847891            0.0
    4      2.227218        307.063049        327.69635  0.849899            0.0
    
        dT_rand  inv_valve_act  limiter.y  inj_valve_act  Relaxing.active  \
    0 -1.175146            0.0      274.0            1.0              0.0
    1 -1.175146            0.0      274.0            1.0              0.0
    2 -1.175146            0.0      274.0            1.0              0.0
    3 -1.175146            0.0      274.0            1.0              0.0
    4 -1.175146            0.0      274.0            1.0              0.0
    
       boundary.m_flow_in  dir_valve_act   dt_rand   C_level  HT_temperature.T  \
    0                20.0            0.0  0.054231  0.212269        311.273926
    1                20.0            0.0  0.054231  0.212269        311.249725
    2                20.0            0.0  0.054231  0.212269        311.225586
    3                20.0            0.0  0.054231  0.212269        311.201477
    4                20.0            0.0  0.054231  0.212269        311.177368
    
       heater_act  HT_level  limiter1.y   dL_rand  Label
    0         0.0       0.1  100.542313 -1.175146      0
    1         0.0       0.1  100.542313 -1.175146      0
    2         0.0       0.1  100.542313 -1.175146      0
    3         0.0       0.1  100.542313 -1.175146      0
    4         0.0       0.1  100.542313 -1.175146      0

 %% Cell type:code id: tags:

 ``` python
 df_multivariate_single_2.shape
 ```

 %% Output

    (175001, 20)

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

--- a/src/group_evaluation/create_heatmap.ipynb
+++ b/src/group_evaluation/create_heatmap.ipynb
--- a/src/models/desi/desi_evaluate_groups.ipynb
+++ b/src/models/desi/desi_evaluate_groups.ipynb
--- a/src/models/desi/test_hyperparameters.ipynb
+++ b/src/models/desi/test_hyperparameters.ipynb