Benchmarking Model Accuracy

How to run this?

Go here to instantly run all code below inside of your browser.

Use Case

This use case accompanies the document "How to compare different ship performance models?". We recommend reading through the document first.

This document provides a framework to compare different ship performance models fairly. To recap, these are the broad steps of the framework:

  1. First, we establish a benchmark dataset

This dataset will provide the ‘ground truth’ of measured values that the models should try to estimate. It should be free of errors and constrained to sea-going conditions.

  1. Next we determine a set of quantitative metrics

To measure how accurate the models are it is important to have meaningful, robust and quantifiable metrics. It is important that all metrics are calculated on exact same dataset.

  1. Finally, we benchmark the results

We compare our obtained quantitative metrics to determine the most accurate model.

This use case will demonstrate how steps 1 and 2 of the framework can be applied to a Toqua Ship Kernel. Step 3 requires the presence of other models to compare the results against.

Some familiarity with the Toqua API and used Python frameworks is required to understand this notebook. If this is not the case, we recommend first going through the "Basics" and "Speed-Fuel curve" use cases.

Setup

Fill in the IMO number of your ship below.

IMO_NUMBER = "9999999"

1. Establish a benchmark dataset

As first step, we establish a dataset on which models can be benchmarked. This dataset will provide the ‘ground truth’ of measured values that the models should try to estimate. It should be free of errors and constrained to sea-going conditions. It is important that this exact same dataset is used for all models.

Creating such a benchmarking dataset is done by following this five-step process:

  1. Choosing a data source (sensor data or noon reports)
  2. Choosing an evaluation period
  3. Filtering outliers
  4. Filtering for unaccounted external effects
  5. Constraining to sea-going conditions

We will go through each of these steps in turn.

1. Choosing a data source

The first step is to determine which data source to use. Usually this comes down to the decision between using noon reports or sensor data. Sensor data is preferred as it significantly higher frequency than noon reports and is most accurate, given that data quality is properly monitored.

For illustration purposes, we've generated two toy datasets: one sensor dataset containing all parameters required by our Ship Kernel models and one dataset containing cleaning events.

Let's examine what these datasets look like.

import pandas as pd

df = pd.read_csv("data/ship_data.csv", parse_dates=["datetime_end"])
df
datetime_end voyage stw me_fo_consumption ship_heading wind_direction wind_speed wave_direction wave_height current_speed current_direction fuel_specific_energy draft_avg sea_depth
0 2022-06-24 22:45:00+00:00 1 12.736735 51.056979 116.834744 256.825078 4.652994 167.403422 1.808797 0.174668 1.132839 41.500563 19.199545 1754.005338
1 2022-06-24 23:00:00+00:00 1 11.912045 46.441744 97.254636 170.989375 4.937206 161.889526 2.046644 0.170478 104.104376 41.500402 19.199505 848.125564
2 2022-06-24 23:15:00+00:00 1 13.043936 48.837654 192.273266 -23.061444 4.768308 250.303015 0.390783 0.599106 218.323053 41.500723 19.199856 -1114.681358
3 2022-06-24 23:30:00+00:00 1 12.340733 46.220231 158.798366 210.453751 3.023702 230.254281 1.296289 -0.142017 51.892648 41.498476 19.200029 1357.116871
4 2022-06-24 23:45:00+00:00 1 12.867363 44.134012 104.596701 175.070960 5.494352 188.850350 0.912953 0.526234 244.260118 41.499404 19.198857 693.841692
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31945 2023-05-23 17:00:00+00:00 11 12.106177 29.341714 247.211329 120.444830 7.266383 193.127350 1.548775 0.241010 144.109422 41.499693 9.500587 4421.447117
31946 2023-05-23 17:15:00+00:00 11 13.274439 36.390085 342.079971 109.189566 7.885734 242.281907 1.818478 0.436219 89.660876 41.500398 9.498560 4176.561835
31947 2023-05-23 17:30:00+00:00 11 13.459992 38.990243 256.667607 103.239181 6.589824 212.017387 1.508247 0.093702 180.322267 41.499994 9.499034 3604.806021
31948 2023-05-23 17:45:00+00:00 11 12.720441 34.616783 269.605791 135.022811 7.284068 175.945316 1.956245 0.095444 95.676050 41.498040 9.499939 3918.618366
31949 2023-05-23 18:00:00+00:00 11 13.161837 37.028321 306.750306 87.565056 6.327873 190.728030 1.919243 0.231607 150.171653 41.499354 9.499044 5235.721062

31950 rows × 14 columns

df.describe()
voyage stw me_fo_consumption ship_heading wind_direction wind_speed wave_direction wave_height current_speed current_direction fuel_specific_energy draft_avg sea_depth
count 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000 31950.000000
mean 6.025039 9.815572 38.082985 149.195428 169.363234 5.797681 173.740267 1.434671 0.313175 160.650100 41.504645 15.268149 2273.463318
std 2.899940 4.666131 15.047520 91.118987 89.429703 2.777208 75.972676 0.941641 0.241369 92.110663 0.067866 4.663984 1853.460963
min 1.000000 -5.129072 -31.911774 -262.832715 -238.175233 -5.784175 -211.287347 -1.935043 -0.846737 -265.876376 40.519639 3.807594 -3847.904748
25% 3.000000 9.945383 22.365897 69.583797 104.914917 3.791703 130.165614 0.652885 0.150052 108.216858 41.499380 9.781429 318.648104
50% 6.000000 11.963889 42.508627 149.153555 168.878906 5.757093 182.832441 1.283217 0.271978 155.810387 41.500054 16.901380 2270.289218
75% 9.000000 12.712263 50.635050 217.592238 223.890564 7.644993 215.076361 2.120300 0.440932 224.127395 41.500564 19.201264 3692.330283
max 11.000000 19.905735 76.803888 639.734658 636.982189 18.290113 560.098908 5.125406 1.747020 601.952975 43.750813 26.274611 9429.949260

We see there's about 7 months of data, with 15min between each sample. All parameters necessary for our Ship Kernel model are present, including weather data. Weather data is typically not measured on board, but can be obtained from external weather services and merged onto a sensor dataset using timestamp and geolocation. That is what we've done to create this dataset.

The dataset contains some obvious erroneous values such as negative values for parameters that obviously cannot be negative (sea depth, wave height, fuel consumption, ...). An indication of a voyage number is also given, making it possible to quantify the accuracy per voyage.

Note that if voyage number is not available in the dataset, it can just be set to the same value everywhere, then below code will still work.

df_events = pd.read_csv("data/ship_events.csv", parse_dates=["datetime"])
df_events
datetime type
0 2022-06-25 hull_cleaning
1 2023-05-08 dry_dock

The ship had a hull cleaning on June 26, 2022 and went into dry dock on May 8, 2023.

2. Choosing an evaluation period

Now that we have a dataset, we must choose an appropriate evaluation period. This is a period during which the ship performance is assumed to be stable. Meaning it should be free from long idling periods, cleaning events, dry-dockings or ESD-installations. To get a robust estimate of how good the model is, it should also be as long as possible.

To determine the evaluation period we will create a small plot visualizing the events and the STW over time, colored per voyage. For now, this will give us an adequate indication of when the ship is idling. However, in real-world scenarios this sensor can fail so we recommend combining multiple idling indicators such as the ship location, SOG and RPM.

import matplotlib.pyplot as plt
from datetime import datetime

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 3))
ax.scatter(df["datetime_end"], df["stw"], c=df["voyage"], s=0.2)
ax.set_ylabel("STW [kn]")
for event in df_events.itertuples():
    ax.axvline(event.datetime, color="red")
    ax.text(event.datetime, y=1.03, s=event.type, color="red", transform=ax.get_xaxis_transform(), horizontalalignment='center')
    
ax.axvspan("2023-04-02", "2023-04-20", color="green", alpha=0.2)
ax.text(datetime.strptime("2023-04-11", "%Y-%m-%d"), y=1.03, s="long\nidling period", transform=ax.get_xaxis_transform(), horizontalalignment='center', color='green')

ax.axvspan("2022-06-25", "2022-12-12", color="blue", alpha=0.1)
ax.text(datetime.strptime("2022-09-12", "%Y-%m-%d"), y=1.03, s="training period", transform=ax.get_xaxis_transform(), horizontalalignment='center', color='blue')
Text(2022-09-12 00:00:00, 1.03, 'training period')

png

On the above graph we see that the ship had a hull cleaning, a dry dock and a long idling period of about 20 days. We also see a period in blue, indicated as "training period".

The hull cleaning, dry dock and long idling period are events that could alter ship performance. Therefore, our choice for evaluation period will have to be within the period between the hull cleaning and the idling period. However, not all data within that period can be used. The Toqua Ship Kernel is a data-driven model. That means it learns (also called "training") from seen data to predict the outcome on new data. In this case, the period between 2022-06-24 and 2022-12-13 has already been used train the model. As this model has "seen" this data, it would be unfair to evaluate it upon this same data.

Therefore, only the data after the training period, ranging from 2022-12-14 to 2023-04-02 remains, spanning 3 voyages.

Ideally, aim for at least 2 months of data spanning 2 voyages.

period_filters = [
    "datetime_end > '2022-12-14T00:00+00:00'", # keep only data after hull cleaning
    "datetime_end < '2023-04-02T00:00+00:00'"  # keep only data before esd
]

 # apply each filter in turn using .query method
for f in period_filters:
    df = df.query(f)

3. Filtering outliers

Outliers are data points that are significantly different from the rest of the data. They can be caused by sensor errors, human errors, or other external factors. We can filters outliers using advanced outlier detection algorithms, or by applying simple thresholds.

In this case, we will apply a set of very simple thresholds. These thresholds filter out data points that are obviously erroneous.
For example: directions must be between 0 and 360 degrees, fuel consumption cannot be below 0 and sea depth cannot be below 0.

From knowledge of the ship we also know that she can only sail between a draft of 9 and 20m and never uses fuels with an LCV value below 40 MJ/kg.

Be wary that these simple filters definitely do not capture all outliers.

outlier_filters = [
    "stw > 0",
    "me_fo_consumption > 0",
    "fuel_specific_energy > 40",
    "9 < draft_avg < 20",
    "0 < ship_heading < 360",
    "0 < wind_direction < 360",
    "wind_speed > 0",
    "0 < wave_direction < 360",
    "wave_height > 0",
    "0 < current_direction < 360",
    "current_speed > 0",
    "sea_depth > 0",
]

for f in outlier_filters:
    df = df.query(f)

4. Filtering unaccounted for external effects

Next, we filter out samples that are affected by external factors which are not accounted for by the model. These are factors the model "has no idea about". In this case, our Ship Kernel takes into acount all parameters of the dataset except sea depth. I.e. sea depth is not an input of the model. However, a shallow sea does result in a perceived decrease in performance due to additional drag. As this effect is not accounted for by the model we will filter out samples with a sea depth below 3 * the draft.

If available, we recommended to also filter out extreme rudder angles or temperatures below 5°C.

external_effect_filters = [
    "sea_depth > 3 * draft_avg"
]

for f in external_effect_filters:
    df = df.query(f)

5. Constraining to sea-going conditions

Finally, we remove any conditions during which the vessel is not in normal operation conditions (e.g. during port stay, manoeuvring, etc.).
For this we will use the following filters:

  • Speed > 5 kn
  • Fuel Consumption > 5 mt/day
sea_condition_filters = [
    "stw > 5",
    "me_fo_consumption > 5",
]

for f in sea_condition_filters:
    df = df.query(f)

Finally the following dataset remains, which meets all the requirements for a correct benchmark dataset.

df.describe()
voyage stw me_fo_consumption ship_heading wind_direction wind_speed wave_direction wave_height current_speed current_direction fuel_specific_energy draft_avg sea_depth
count 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000 5884.000000
mean 8.213460 11.872455 40.795409 123.062166 173.442223 6.053886 185.825918 1.658507 0.308868 167.165553 41.503330 16.600232 2652.149262
std 0.819602 1.163993 14.286145 62.510900 66.431414 2.628462 53.737770 0.961057 0.175661 77.421203 0.045546 4.478385 1776.279635
min 7.000000 5.038316 5.232447 0.024387 0.602252 0.001108 0.923097 0.000026 0.000378 0.040091 40.968636 9.028157 29.987202
25% 7.000000 10.978792 22.446645 69.331952 122.681096 4.147982 160.235449 0.698805 0.179125 109.978126 41.499322 9.501307 1101.395129
50% 8.000000 12.319606 48.437033 113.701521 173.516801 5.912339 192.175421 1.770468 0.291150 166.916770 41.500001 19.199770 2703.290472
75% 9.000000 12.690115 51.749307 185.721533 219.451190 7.818401 215.603917 2.393301 0.415947 224.693505 41.500725 19.201743 4122.457020
max 9.000000 16.950918 67.399952 358.343680 359.091475 15.143205 359.514260 4.706954 1.436451 359.438882 42.629039 20.000000 8401.586696

2. Calculate quantitative metrics

To measure how accurate the models are it is important to have meaningful, robust and quantifiable metrics. We will provide an implementation of the following metrics:

  • Mean Absolute Percentage Error (MAPE)
    $$\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left|\frac{\hat{y_i} - y_i}{y_i}\right|$$

  • Daily Percentage Error (DPE)
    $$\text{DPE} = \frac{100}{D} \sum{d}^{D} \left| \frac{\sum{i}^{d} \hat{y}i - y{i} }{\sum{j}^{d} y{j}} \right|$$

  • Voyage Error (VE)
    $$\text{VE} = \frac{1}{V} \sum{v}^{V} \left| \frac{100}{v} \sum{i}^{v} \frac{ \hat{y}{i} - y{i} }{y_i} \right|$$

  • Bias Percentage Error (BPE)
    $$\text{BPE} = \frac{\sum{i}^{N} \hat{y}{i} - \sum{j}^{N} y_j}{\sum{k}^{N} y_k}$$

A detailed explanation of these metrics, accompanied by toy examples, may be found in the whitepaper.

import numpy as np

def mape(true, pred):
    """Mean Absolute Percentage Error
    
    true: a list of measured values
    pred: a list of predicted values
    """
    return np.mean(np.abs((pred - true) / true)) * 100

def bpe(true, pred):
    """Bias Percentage Error
    
    true: a list of measured values
    pred: a list of predicted values
    """
    return np.sum(pred - true) / np.sum(true) * 100

def dpe(true, pred, days):
    """Daily Percentage Error
    
    true: a list of measured values
    pred: a list of predicted values
    days: a list of days, corresponding to the day on which the measurement was done
    """
    metrics_per_day = []
    for d in np.unique(days):
        mask = days == d
        metrics_per_day.append(np.abs(bpe(true[mask], pred[mask])))
    return np.mean(metrics_per_day)


def ve(true, pred, voyages):
    """Voyage Error
    
    true: a list of measured values
    pred: a list of predicted values
    voyages: a list of voyage ids, corresponding to the voyage during which the measurement was done
    """
    metrics_per_voyage = []
    for v in np.unique(voyages):
        mask = voyages == v
        metrics_per_voyage.append(np.abs(bpe(true[mask], pred[mask])))
    return np.mean(metrics_per_voyage)

We can use above functions to calculate the metrics, given the ground truth values and the values predicted by the model. The ground truth values we obtain from the benchmark dataset established earlier.

All that is left is to obtain the predicted values from the Ship Kernel through the Toqua API.

import logging
from time import sleep
from typing import Dict
import pandas as pd
import requests

def make_api_call(method, url, payload=None):
    """Generic function to make an API call to the Toqua API."""
    headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "X-API-Key": API_KEY,
    }
    
    if method == 'GET':
        response = requests.get(url, headers=headers)
    elif method == 'POST':
        response = requests.post(url, json=payload, headers=headers)
    else:
        print("Error: Invalid method")
        return None
    try:
        response.raise_for_status()
    except requests.HTTPError as e:
        logging.error(f"Error: {e}")
        logging.error(response.content)
        raise e
    output_json = response.json()
    del output_json["errors"]
    return output_json


def predict(imo_number, df):
    """Do a prediction on the Toqua Ship Kernel of a given vessel.
    
    This implementation is a bit more complicated than in previous use cases. Because the Toqua API 
    a maximum of 9000 rows at a time  we need to split the data into chunks of 9000.
    """
    url = f"https://api.toqua.ai/ships/{imo_number}/models/latest/predict"
    
    # toqua API limit
    chunk_size = 9000
    # split the input data into chunks of 9000 rows
    chunks = np.array_split(df, len(df) / chunk_size + 1)
    prediction_chunks = []
    # for each chunk, predict the values
    for chunk in chunks:
        model_input = { "data": chunk.drop(columns=["datetime_end", "me_fo_consumption"]).to_dict(orient="list") }
        pred = make_api_call('POST', url, model_input)
        prediction_chunks.append(pd.DataFrame(pred, index=chunk.index))
    # finally, concatenate the results again
    return pd.concat(prediction_chunks)


def calculate_metrics(df):
    """Calculate all metrics from a single dataframe containing the true and predicted values.
    
    The column me_fo_consumption contains the true fuel consumption values, the column predicted_me_fo_consumption contains the predicted values.
    """
    return pd.Series([
        mape(df.me_fo_consumption, df.predicted_me_fo_consumption),
        dpe(df.me_fo_consumption, df.predicted_me_fo_consumption, df.datetime_end.dt.round("1d")),
        ve(df.me_fo_consumption, df.predicted_me_fo_consumption, df.voyage),
        bpe(df.me_fo_consumption, df.predicted_me_fo_consumption)
    ], index= ["Mean Absolute Percentage Error [%]", "Daily Percentage Error [%]", "Voyage Error [%]", "Bias Percentage Error [%]"])
df_predictions = predict(IMO_NUMBER, df)

# add the predictions as new column in the old dataframe for easier comparison
df["predicted_me_fo_consumption"] = df_predictions["me_fo_consumption"]

# show the true values vs the raw predictions
df[["me_fo_consumption", "predicted_me_fo_consumption"]]
me_fo_consumption predicted_me_fo_consumption
16518 49.292736 47.298896
16519 46.580038 49.660815
16520 45.939625 44.305224
16522 58.059408 50.579893
16528 47.758690 40.819772
... ... ...
26946 51.041670 44.648453
26948 41.748385 38.688569
26949 50.465256 49.859098
26950 50.465256 50.770080
26951 50.465256 21.524687

5884 rows × 2 columns

Finally, we calculate the metrics and print them.

Note that if voyage number is not available in the dataset, the voyage number can be set to the same value everywhere, in that case it will be the same as the BPE.

calculate_metrics(df)
Mean Absolute Percentage Error [%]    6.681542
Daily Percentage Error [%]            2.296083
Voyage Error [%]                      0.945030
Bias Percentage Error [%]            -0.401064
dtype: float64

We can see that our Ship Kernel has a MAPE of 6.91%, a DPE of 2.43%, a Voyage Error of 1.00% and a BPE of -0.37%.

As expected, as the MAPE is calculated on the shortest time scale (every 15minutes) it has the highest error. The BPE is calculated on the longest time scale (the entire evaluation period) and thus has the lowest error because the errors cancel each other out. It shows that on average, the model has a tendency to underestimate by 0.37%. The DPE and VE are calculated on intermediate time scales and fall somewhere in between.

The voyage error is an interesting metric to look at more closely. We can examen the errors per voyage seperately so determine whether the model is more accurate on certain voyages than others, indicating that the model is perhaps less performant in specific conditions.

# group the dataframe by voyage and apply the metrics calculation function to each group
df.groupby("voyage").apply(calculate_metrics)
Mean Absolute Percentage Error [%] Daily Percentage Error [%] Voyage Error [%] Bias Percentage Error [%]
voyage
7 6.702404 2.172954 1.595164 -1.595164
8 6.753125 2.209183 0.843010 -0.843010
9 6.627150 2.740728 0.396915 0.396915

The voyage error and bias percentage error are the same in this case as the voyage error is simply the mean of the absolute values of the BPE calculated per voyage, but in this case there is only a single voyage. We can see quite some differences between the voyages, demonstrating that the model is not infallible.

All that remains now to execute an objective benchmarking study is to run this same methodology for other models and see how they stack up against each other.