Tutorial Summary: Training ConvLSTM Models for SST Prediction

pip install xarray numpy dask matplotlib

Requirement already satisfied: xarray in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (2024.5.0)
Requirement already satisfied: numpy in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (1.26.4)
Requirement already satisfied: dask in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (2024.5.1)
Requirement already satisfied: matplotlib in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (3.9.0)
Requirement already satisfied: packaging>=23.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from xarray) (23.2)
Requirement already satisfied: pandas>=2.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from xarray) (2.2.2)
Requirement already satisfied: click>=8.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (8.1.7)
Requirement already satisfied: cloudpickle>=1.5.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (3.0.0)
Requirement already satisfied: fsspec>=2021.09.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (2024.5.0)
Requirement already satisfied: partd>=1.2.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (1.4.2)
Requirement already satisfied: pyyaml>=5.3.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (6.0.1)
Requirement already satisfied: toolz>=0.10.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (0.12.1)
Requirement already satisfied: importlib-metadata>=4.13.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from dask) (7.1.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (4.52.1)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: pillow>=8 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: colorama in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from click>=8.1->dask) (0.4.6)
Requirement already satisfied: zipp>=0.5 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from importlib-metadata>=4.13.0->dask) (3.17.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from pandas>=2.0->xarray) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from pandas>=2.0->xarray) (2024.1)
Requirement already satisfied: locket in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from partd>=1.2.0->dask) (1.0.0)
Requirement already satisfied: six>=1.5 in c:\users\yuj81\anaconda3\envs\py310\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

This tutorial demonstrated the process of training a ConvLSTM model for sea surface temperature (SST) prediction, including data preprocessing, model training, evaluation, and visualization.

Data Preprocessing

Load and preprocess SST data from a Zarr store.
Normalize data and handle NaN values.
Split data into training, validation, and test sets.

Model Construction and Training

Build a ConvLSTM model with TensorFlow and Keras.
Compile and train the model with the Adam optimizer and MAE loss function.

Evaluating and Visualizing the Model

Prepare the test dataset and evaluate model performance.
Used utility functions to preprocess input data, make predictions, and visualize results.
Compared predicted output with true output to assess model accuracy.

In this tutorial, we will use several important Python libraries. Below is an explanation of each import and its purpose:

import xarray as xr: Xarray is a library for working with labeled multi-dimensional arrays, particularly useful for handling time-series, meteorological, and oceanographic data. We will use Xarray to process and analyze datasets.
import numpy as np: NumPy is a fundamental library for scientific computing in Python, providing support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions. We will use NumPy for array operations and data processing.
import dask.array as da: Dask is a parallel computing library, and Dask Array provides parallel computation capabilities similar to NumPy arrays, enabling us to handle larger datasets efficiently. We will use Dask to process data in parallel to improve effici ency.
import matplotlib.pyplot as plt: Matplotlib is a plotting library, and pyplot is a module within Matplotlib that offers MATLAB-like plotting functions. We will use pyplot to visualize our data and results.
import tensorflow as tf: TensorFlow is an open-source machine learning framework for building and training neural network models. We will use TensorFlow to create and train deep learning models.
- from tensorflow.keras.callbacks import EarlyStopping: Keras is a high-level neural networks API within TensorFlow that simplifies the construction of neural networks. EarlyStopping is a callback function that stops training early when a monitored metric stops improving, preventing overfitting.
- from tensorflow.keras.models import Sequential: Sequential is a type of model in Keras used for stacking multiple neural network layers linearly.
- from tensorflow.keras.layers import ConvLSTM2D, BatchNormalization, Conv2D, Dropout: These are different types of neural network layers in Keras. ConvLSTM2D is a convolutional long short-term memory layer for handling spatiotemporal data. BatchNormalization normalizes inputs after each layer to prevent vanishing gradients and speed up training. Conv2D is a two-dimensional convolutional layer used for image processing. Dropout is a regularization technique that randomly drops some neurons during training to prevent overfitting.

By using these libraries and modules together, we can efficiently process data, build and train complex deep learning models, and visualize the results to gain meaningful insights.

import xarray as xr
import numpy as np

import dask.array as da

import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import ConvLSTM2D, BatchNormalization, Conv2D, Dropout

Checking for Available GPUs

When training ConvLSTM models, using a GPU can significantly speed up the training process compared to using a CPU. The following code checks if your environment includes a GPU and uses it for training if available. By default, TensorFlow will use the available GPU for training.

# list all the physical devices
physical_devices = tf.config.list_physical_devices()
print("All Physical Devices:", physical_devices)

# list all the available GPUs
gpus = tf.config.list_physical_devices('GPU')
print("Available GPUs:", gpus)

# Print infomation for available GPU if there exists any
if gpus:
    for gpu in gpus:
        details = tf.config.experimental.get_device_details(gpu)
        print("GPU Details:", details)
else:
    print("No GPU available")

All Physical Devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Available GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
GPU Details: {'device_name': 'NVIDIA GeForce RTX 4060 Laptop GPU', 'compute_capability': (8, 9)}

Data Preprocessing.

Step 1: Load the Dataset

We start by loading the dataset from a Zarr in the path of “PycharmProjects/ML/Intern/INDIAN_OCEAN_025GRID_DAILY.zarr”

Step 2: Select a Subset of the Data

Next, we select a specific region of interest by slicing the latitude and longitude dimensions.

Step 3: Filter Out Dates with All NaN Values

We identify all dates where the SST variable contains only NaN values and exclude those dates from the dataset.lNotice that this step is very important since our model can’t take in any NaNs!

Step 4: Sort and Select Data by Time

We sort the dataset by time and select the time range from January 1, 2015, to December 31, 2022.idated=True)

zarr_ds = xr.open_zarr(store='PycharmProjects/ML/Intern/INDIAN_OCEAN_025GRID_DAILY.zarr', consolidated=True)

zarr_new = zarr_ds.sel(lat=slice(35, -5), lon=slice(45,90))

all_nan_dates = np.isnan(zarr_new["sst"]).all(dim=["lon", "lat"]).compute()

zarr_ds = zarr_new.sel(time=all_nan_dates == False)

zarr_ds = zarr_ds.sortby('time')
zarr_ds = zarr_ds.sel(time=slice('2015-01-01', '2022-12-31'))
zarr_ds

<xarray.Dataset> Size: 5GB
Dimensions:          (time: 2763, lat: 149, lon: 181)
Coordinates:
  * lat              (lat) float32 596B 32.0 31.75 31.5 ... -4.5 -4.75 -5.0
  * lon              (lon) float32 724B 45.0 45.25 45.5 ... 89.5 89.75 90.0
  * time             (time) datetime64[ns] 22kB 2015-01-01 ... 2022-12-31
Data variables: (12/19)
    CHL              (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    CHL_uncertainty  (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    adt              (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    air_temp         (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    curr_dir         (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    curr_speed       (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    ...               ...
    ug_curr          (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    v_curr           (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    v_wind           (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    vg_curr          (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    wind_dir         (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
    wind_speed       (time, lat, lon) float32 298MB dask.array<chunksize=(51, 149, 181), meta=np.ndarray>
Attributes: (12/17)
    creator_email:              minhphan@uw.edu
    creator_name:               Minh Phan
    creator_type:               person
    date_created:               2023-07-19
    geospatial_lat_max:         32.0
    geospatial_lat_min:         -12.0
    ...                         ...
    geospatial_lon_units:       degrees_east
    source:                     Earth & Space Research (ESR), Copernicus Clim...
    summary:                    Daily mean of 0.25 x 0.25 degrees gridded dat...
    time_coverage_end:          2022-12-31T23:59:59
    time_coverage_start:        1979-01-01T00:00:00
    title:                      Climate Data for Coastal Upwelling Machine Le...

Indexes: (3)

lat

PandasIndex

PandasIndex(Index([ 32.0, 31.75,  31.5, 31.25,  31.0, 30.75,  30.5, 30.25,  30.0, 29.75,
       ...
       -2.75,  -3.0, -3.25,  -3.5, -3.75,  -4.0, -4.25,  -4.5, -4.75,  -5.0],
      dtype='float32', name='lat', length=149))

lon

PandasIndex

PandasIndex(Index([ 45.0, 45.25,  45.5, 45.75,  46.0, 46.25,  46.5, 46.75,  47.0, 47.25,
       ...
       87.75,  88.0, 88.25,  88.5, 88.75,  89.0, 89.25,  89.5, 89.75,  90.0],
      dtype='float32', name='lon', length=181))

time

PandasIndex

PandasIndex(DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-01-10',
               ...
               '2022-12-22', '2022-12-23', '2022-12-24', '2022-12-25',
               '2022-12-26', '2022-12-27', '2022-12-28', '2022-12-29',
               '2022-12-30', '2022-12-31'],
              dtype='datetime64[ns]', name='time', length=2763, freq=None))

Attributes: (17)

creator_email :: minhphan@uw.edu
creator_name :: Minh Phan
creator_type :: person
date_created :: 2023-07-19
geospatial_lat_max :: 32.0
geospatial_lat_min :: -12.0
geospatial_lat_resolution :: 0.25
geospatial_lat_units :: degrees_north
geospatial_lon_max :: 102.0
geospatial_lon_min :: 42.0
geospatial_lon_resolution :: 0.25
geospatial_lon_units :: degrees_east
source :: Earth & Space Research (ESR), Copernicus Climate Change Service (C3S), Copernicus Marine Environment Monitoring Service (CMEMS), United States Geological Survey (USGS)
summary :: Daily mean of 0.25 x 0.25 degrees gridded data from multiple climate variables that may influence the patterns of coastal upwelling in the focused area
time_coverage_end :: 2022-12-31T23:59:59
time_coverage_start :: 1979-01-01T00:00:00
title :: Climate Data for Coastal Upwelling Machine Learning Project in Indian Ocean

Tutorial Summary: Training ConvLSTM Models for SST Prediction

Data Preprocessing

Model Construction and Training

Evaluating and Visualizing the Model

Checking for Available GPUs

Data Preprocessing.

Step 1: Load the Dataset

Step 2: Select a Subset of the Data

Step 3: Filter Out Dates with All NaN Values

Step 4: Sort and Select Data by Time

Function: `preprocess_day_data`

Function: `preprocess_data`

Function: `prepare_data_from_processed`

Function: `time_series_split`

Function: `create_improved_model`

Model Architecture

Explanation of Model Parameters and Common Activation Functions

Parameters

Model Compilation and Training

Compilation

Early Stopping

Data Preparation

Training

Visualizing Training History

Evaluating the Model on the Test Dataset

Saving the Model

Functions for Visualizing Model Predictions

Preprocess Input Data

Postprocess Prediction

Predict and Plot

Compute MAE

Predict and Plot

Data Preprocessing

Model Construction and Training

Evaluating and Visualizing the Model

Checking for Available GPUs

Data Preprocessing.

Step 1: Load the Dataset

Step 2: Select a Subset of the Data

Step 3: Filter Out Dates with All NaN Values

Step 4: Sort and Select Data by Time

Function: preprocess_day_data

Function: preprocess_data

Function: prepare_data_from_processed

Function: time_series_split

Function: create_improved_model

Model Architecture

Explanation of Model Parameters and Common Activation Functions

Parameters

Model Compilation and Training

Compilation

Early Stopping

Data Preparation

Training

Visualizing Training History

Evaluating the Model on the Test Dataset

Saving the Model

Functions for Visualizing Model Predictions

Preprocess Input Data

Postprocess Prediction

Predict and Plot

Compute MAE

Predict and Plot

Function: `preprocess_day_data`

Function: `preprocess_data`

Function: `prepare_data_from_processed`

Function: `time_series_split`

Function: `create_improved_model`