How to run this?

Go here to instantly run all code below inside of your browser.

Use Case

This use case accompanies the document "How to compare different ship performance models?". We recommend reading through the document first.

This document provides a framework to compare different ship performance models fairly. To recap, these are the broad steps of the framework:

First, we establish a benchmark dataset

This dataset will provide the ‘ground truth’ of measured values that the models should try to estimate. It should be free of errors and constrained to sea-going conditions.

Next we determine a set of quantitative metrics

To measure how accurate the models are it is important to have meaningful, robust and quantifiable metrics. It is important that all metrics are calculated on exact same dataset.

Finally, we benchmark the results

We compare our obtained quantitative metrics to determine the most accurate model.

This use case will demonstrate how steps 1 and 2 of the framework can be applied to a Toqua Ship Kernel. Step 3 requires the presence of other models to compare the results against.

Some familiarity with the Toqua API and used Python frameworks is required to understand this notebook. If this is not the case, we recommend first going through the "Basics" and "Speed-Fuel curve" use cases.

Setup

Fill in the IMO number of your ship below.

IMO_NUMBER = "9999999"

1. Establish a benchmark dataset

As first step, we establish a dataset on which models can be benchmarked. This dataset will provide the ‘ground truth’ of measured values that the models should try to estimate. It should be free of errors and constrained to sea-going conditions. It is important that this exact same dataset is used for all models.

Creating such a benchmarking dataset is done by following this five-step process:

Choosing a data source (sensor data or noon reports)
Choosing an evaluation period
Filtering outliers
Filtering for unaccounted external effects
Constraining to sea-going conditions

We will go through each of these steps in turn.

1. Choosing a data source

The first step is to determine which data source to use. Usually this comes down to the decision between using noon reports or sensor data. Sensor data is preferred as it significantly higher frequency than noon reports and is most accurate, given that data quality is properly monitored.

For illustration purposes, we've generated two toy datasets: one sensor dataset containing all parameters required by our Ship Kernel models and one dataset containing cleaning events.

Let's examine what these datasets look like.

import pandas as pd

df = pd.read_csv("data/ship_data.csv", parse_dates=["datetime_end"])
df

	datetime_end	voyage	stw	me_fo_consumption	ship_heading	wind_direction	wind_speed	wave_direction	wave_height	current_speed	current_direction	fuel_specific_energy	draft_avg	sea_depth
0	2022-06-24 22:45:00+00:00	1	12.736735	51.056979	116.834744	256.825078	4.652994	167.403422	1.808797	0.174668	1.132839	41.500563	19.199545	1754.005338
1	2022-06-24 23:00:00+00:00	1	11.912045	46.441744	97.254636	170.989375	4.937206	161.889526	2.046644	0.170478	104.104376	41.500402	19.199505	848.125564
2	2022-06-24 23:15:00+00:00	1	13.043936	48.837654	192.273266	-23.061444	4.768308	250.303015	0.390783	0.599106	218.323053	41.500723	19.199856	-1114.681358
3	2022-06-24 23:30:00+00:00	1	12.340733	46.220231	158.798366	210.453751	3.023702	230.254281	1.296289	-0.142017	51.892648	41.498476	19.200029	1357.116871
4	2022-06-24 23:45:00+00:00	1	12.867363	44.134012	104.596701	175.070960	5.494352	188.850350	0.912953	0.526234	244.260118	41.499404	19.198857	693.841692
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
31945	2023-05-23 17:00:00+00:00	11	12.106177	29.341714	247.211329	120.444830	7.266383	193.127350	1.548775	0.241010	144.109422	41.499693	9.500587	4421.447117
31946	2023-05-23 17:15:00+00:00	11	13.274439	36.390085	342.079971	109.189566	7.885734	242.281907	1.818478	0.436219	89.660876	41.500398	9.498560	4176.561835
31947	2023-05-23 17:30:00+00:00	11	13.459992	38.990243	256.667607	103.239181	6.589824	212.017387	1.508247	0.093702	180.322267	41.499994	9.499034	3604.806021
31948	2023-05-23 17:45:00+00:00	11	12.720441	34.616783	269.605791	135.022811	7.284068	175.945316	1.956245	0.095444	95.676050	41.498040	9.499939	3918.618366
31949	2023-05-23 18:00:00+00:00	11	13.161837	37.028321	306.750306	87.565056	6.327873	190.728030	1.919243	0.231607	150.171653	41.499354	9.499044	5235.721062

31950 rows × 14 columns

df.describe()

	voyage	stw	me_fo_consumption	ship_heading	wind_direction	wind_speed	wave_direction	wave_height	current_speed	current_direction	fuel_specific_energy	draft_avg	sea_depth
count	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000	31950.000000
mean	6.025039	9.815572	38.082985	149.195428	169.363234	5.797681	173.740267	1.434671	0.313175	160.650100	41.504645	15.268149	2273.463318
std	2.899940	4.666131	15.047520	91.118987	89.429703	2.777208	75.972676	0.941641	0.241369	92.110663	0.067866	4.663984	1853.460963
min	1.000000	-5.129072	-31.911774	-262.832715	-238.175233	-5.784175	-211.287347	-1.935043	-0.846737	-265.876376	40.519639	3.807594	-3847.904748
25%	3.000000	9.945383	22.365897	69.583797	104.914917	3.791703	130.165614	0.652885	0.150052	108.216858	41.499380	9.781429	318.648104
50%	6.000000	11.963889	42.508627	149.153555	168.878906	5.757093	182.832441	1.283217	0.271978	155.810387	41.500054	16.901380	2270.289218
75%	9.000000	12.712263	50.635050	217.592238	223.890564	7.644993	215.076361	2.120300	0.440932	224.127395	41.500564	19.201264	3692.330283
max	11.000000	19.905735	76.803888	639.734658	636.982189	18.290113	560.098908	5.125406	1.747020	601.952975	43.750813	26.274611	9429.949260

We see there's about 7 months of data, with 15min between each sample. All parameters necessary for our Ship Kernel model are present, including weather data. Weather data is typically not measured on board, but can be obtained from external weather services and merged onto a sensor dataset using timestamp and geolocation. That is what we've done to create this dataset.

The dataset contains some obvious erroneous values such as negative values for parameters that obviously cannot be negative (sea depth, wave height, fuel consumption, ...). An indication of a voyage number is also given, making it possible to quantify the accuracy per voyage.

Note that if voyage number is not available in the dataset, it can just be set to the same value everywhere, then below code will still work.

df_events = pd.read_csv("data/ship_events.csv", parse_dates=["datetime"])
df_events

	datetime	type
0	2022-06-25	hull_cleaning
1	2023-05-08	dry_dock

The ship had a hull cleaning on June 26, 2022 and went into dry dock on May 8, 2023.

2. Choosing an evaluation period

Now that we have a dataset, we must choose an appropriate evaluation period. This is a period during which the ship performance is assumed to be stable. Meaning it should be free from long idling periods, cleaning events, dry-dockings or ESD-installations. To get a robust estimate of how good the model is, it should also be as long as possible.

To determine the evaluation period we will create a small plot visualizing the events and the STW over time, colored per voyage. For now, this will give us an adequate indication of when the ship is idling. However, in real-world scenarios this sensor can fail so we recommend combining multiple idling indicators such as the ship location, SOG and RPM.

import matplotlib.pyplot as plt
from datetime import datetime

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 3))
ax.scatter(df["datetime_end"], df["stw"], c=df["voyage"], s=0.2)
ax.set_ylabel("STW [kn]")
for event in df_events.itertuples():
    ax.axvline(event.datetime, color="red")
    ax.text(event.datetime, y=1.03, s=event.type, color="red", transform=ax.get_xaxis_transform(), horizontalalignment='center')
    
ax.axvspan("2023-04-02", "2023-04-20", color="green", alpha=0.2)
ax.text(datetime.strptime("2023-04-11", "%Y-%m-%d"), y=1.03, s="long\nidling period", transform=ax.get_xaxis_transform(), horizontalalignment='center', color='green')

ax.axvspan("2022-06-25", "2022-12-12", color="blue", alpha=0.1)
ax.text(datetime.strptime("2022-09-12", "%Y-%m-%d"), y=1.03, s="training period", transform=ax.get_xaxis_transform(), horizontalalignment='center', color='blue')

Text(2022-09-12 00:00:00, 1.03, 'training period')

On the above graph we see that the ship had a hull cleaning, a dry dock and a long idling period of about 20 days. We also see a period in blue, indicated as "training period".

The hull cleaning, dry dock and long idling period are events that could alter ship performance. Therefore, our choice for evaluation period will have to be within the period between the hull cleaning and the idling period. However, not all data within that period can be used. The Toqua Ship Kernel is a data-driven model. That means it learns (also called "training") from seen data to predict the outcome on new data. In this case, the period between 2022-06-24 and 2022-12-13 has already been used train the model. As this model has "seen" this data, it would be unfair to evaluate it upon this same data.

Therefore, only the data after the training period, ranging from 2022-12-14 to 2023-04-02 remains, spanning 3 voyages.

Ideally, aim for at least 2 months of data spanning 2 voyages.

period_filters = [
    "datetime_end > '2022-12-14T00:00+00:00'", # keep only data after hull cleaning
    "datetime_end < '2023-04-02T00:00+00:00'"  # keep only data before esd
]

 # apply each filter in turn using .query method
for f in period_filters:
    df = df.query(f)

3. Filtering outliers

Outliers are data points that are significantly different from the rest of the data. They can be caused by sensor errors, human errors, or other external factors. We can filters outliers using advanced outlier detection algorithms, or by applying simple thresholds.

In this case, we will apply a set of very simple thresholds. These thresholds filter out data points that are obviously erroneous.
For example: directions must be between 0 and 360 degrees, fuel consumption cannot be below 0 and sea depth cannot be below 0.

From knowledge of the ship we also know that she can only sail between a draft of 9 and 20m and never uses fuels with an LCV value below 40 MJ/kg.

Be wary that these simple filters definitely do not capture all outliers.

outlier_filters = [
    "stw > 0",
    "me_fo_consumption > 0",
    "fuel_specific_energy > 40",
    "9 < draft_avg < 20",
    "0 < ship_heading < 360",
    "0 < wind_direction < 360",
    "wind_speed > 0",
    "0 < wave_direction < 360",
    "wave_height > 0",
    "0 < current_direction < 360",
    "current_speed > 0",
    "sea_depth > 0",
]

for f in outlier_filters:
    df = df.query(f)

4. Filtering unaccounted for external effects

Next, we filter out samples that are affected by external factors which are not accounted for by the model. These are factors the model "has no idea about". In this case, our Ship Kernel takes into acount all parameters of the dataset except sea depth. I.e. sea depth is not an input of the model. However, a shallow sea does result in a perceived decrease in performance due to additional drag. As this effect is not accounted for by the model we will filter out samples with a sea depth below 3 * the draft.

If available, we recommended to also filter out extreme rudder angles or temperatures below 5°C.

external_effect_filters = [
    "sea_depth > 3 * draft_avg"
]

for f in external_effect_filters:
    df = df.query(f)

5. Constraining to sea-going conditions

Finally, we remove any conditions during which the vessel is not in normal operation conditions (e.g. during port stay, manoeuvring, etc.).
For this we will use the following filters:

Speed > 5 kn
Fuel Consumption > 5 mt/day

sea_condition_filters = [
    "stw > 5",
    "me_fo_consumption > 5",
]

for f in sea_condition_filters:
    df = df.query(f)

Finally the following dataset remains, which meets all the requirements for a correct benchmark dataset.

df.describe()

	voyage	stw	me_fo_consumption	ship_heading	wind_direction	wind_speed	wave_direction	wave_height	current_speed	current_direction	fuel_specific_energy	draft_avg	sea_depth
count	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000	5884.000000
mean	8.213460	11.872455	40.795409	123.062166	173.442223	6.053886	185.825918	1.658507	0.308868	167.165553	41.503330	16.600232	2652.149262
std	0.819602	1.163993	14.286145	62.510900	66.431414	2.628462	53.737770	0.961057	0.175661	77.421203	0.045546	4.478385	1776.279635
min	7.000000	5.038316	5.232447	0.024387	0.602252	0.001108	0.923097	0.000026	0.000378	0.040091	40.968636	9.028157	29.987202
25%	7.000000	10.978792	22.446645	69.331952	122.681096	4.147982	160.235449	0.698805	0.179125	109.978126	41.499322	9.501307	1101.395129
50%	8.000000	12.319606	48.437033	113.701521	173.516801	5.912339	192.175421	1.770468	0.291150	166.916770	41.500001	19.199770	2703.290472
75%	9.000000	12.690115	51.749307	185.721533	219.451190	7.818401	215.603917	2.393301	0.415947	224.693505	41.500725	19.201743	4122.457020
max	9.000000	16.950918	67.399952	358.343680	359.091475	15.143205	359.514260	4.706954	1.436451	359.438882	42.629039	20.000000	8401.586696

2. Calculate quantitative metrics

To measure how accurate the models are it is important to have meaningful, robust and quantifiable metrics. We will provide an implementation of the following metrics:

Mean Absolute Percentage Error (MAPE)
$$\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left|\frac{\hat{y_i} - y_i}{y_i}\right|$$
Daily Percentage Error (DPE)
$$\text{DPE} = \frac{100}{D} \sum*{d}^{D}\left| \frac{\sum*{i}^{d} \hat{y}i - y{i} }{\sum*{j}^{d} y*{j}} \right|$$
Voyage Error (VE)
$$\text{VE} = \frac{1}{V} \sum*{v}^{V}\left| \frac{100}{v} \sum*{i}^{v} \frac{ \hat{y}{i} - y{i} }{y_i} \right|$$
Bias Percentage Error (BPE)
$$\text{BPE} = \frac{\sum*{i}^{N}\hat{y}{i} - \sum{j}^{N} y_j}{\sum*{k}^{N} y_k}$$

A detailed explanation of these metrics, accompanied by toy examples, may be found in the whitepaper.

import numpy as np

def mape(true, pred):
    """Mean Absolute Percentage Error
    
    true: a list of measured values
    pred: a list of predicted values
    """
    return np.mean(np.abs((pred - true) / true)) * 100

def bpe(true, pred):
    """Bias Percentage Error
    
    true: a list of measured values
    pred: a list of predicted values
    """
    return np.sum(pred - true) / np.sum(true) * 100

def dpe(true, pred, days):
    """Daily Percentage Error
    
    true: a list of measured values
    pred: a list of predicted values
    days: a list of days, corresponding to the day on which the measurement was done
    """
    metrics_per_day = []
    for d in np.unique(days):
        mask = days == d
        metrics_per_day.append(np.abs(bpe(true[mask], pred[mask])))
    return np.mean(metrics_per_day)


def ve(true, pred, voyages):
    """Voyage Error
    
    true: a list of measured values
    pred: a list of predicted values
    voyages: a list of voyage ids, corresponding to the voyage during which the measurement was done
    """
    metrics_per_voyage = []
    for v in np.unique(voyages):
        mask = voyages == v
        metrics_per_voyage.append(np.abs(bpe(true[mask], pred[mask])))
    return np.mean(metrics_per_voyage)

We can use above functions to calculate the metrics, given the ground truth values and the values predicted by the model. The ground truth values we obtain from the benchmark dataset established earlier.

All that is left is to obtain the predicted values from the Ship Kernel through the Toqua API.

import logging
from time import sleep
from typing import Dict
import pandas as pd
import requests

def make_api_call(method, url, payload=None):
    """Generic function to make an API call to the Toqua API."""
    headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "X-API-Key": API_KEY,
    }
    
    if method == 'GET':
        response = requests.get(url, headers=headers)
    elif method == 'POST':
        response = requests.post(url, json=payload, headers=headers)
    else:
        print("Error: Invalid method")
        return None
    try:
        response.raise_for_status()
    except requests.HTTPError as e:
        logging.error(f"Error: {e}")
        logging.error(response.content)
        raise e
    output_json = response.json()
    del output_json["errors"]
    return output_json


def predict(imo_number, df):
    """Do a prediction on the Toqua Ship Kernel of a given vessel.
    
    This implementation is a bit more complicated than in previous use cases. Because the Toqua API 
    a maximum of 9000 rows at a time  we need to split the data into chunks of 9000.
    """
    url = f"https://api.toqua.ai/ships/{imo_number}/models/latest/predict"
    
    # toqua API limit
    chunk_size = 9000
    # split the input data into chunks of 9000 rows
    chunks = np.array_split(df, len(df) / chunk_size + 1)
    prediction_chunks = []
    # for each chunk, predict the values
    for chunk in chunks:
        model_input = { "data": chunk.drop(columns=["datetime_end", "me_fo_consumption"]).to_dict(orient="list") }
        pred = make_api_call('POST', url, model_input)
        prediction_chunks.append(pd.DataFrame(pred, index=chunk.index))
    # finally, concatenate the results again
    return pd.concat(prediction_chunks)


def calculate_metrics(df):
    """Calculate all metrics from a single dataframe containing the true and predicted values.
    
    The column me_fo_consumption contains the true fuel consumption values, the column predicted_me_fo_consumption contains the predicted values.
    """
    return pd.Series([
        mape(df.me_fo_consumption, df.predicted_me_fo_consumption),
        dpe(df.me_fo_consumption, df.predicted_me_fo_consumption, df.datetime_end.dt.round("1d")),
        ve(df.me_fo_consumption, df.predicted_me_fo_consumption, df.voyage),
        bpe(df.me_fo_consumption, df.predicted_me_fo_consumption)
    ], index= ["Mean Absolute Percentage Error [%]", "Daily Percentage Error [%]", "Voyage Error [%]", "Bias Percentage Error [%]"])

df_predictions = predict(IMO_NUMBER, df)

# add the predictions as new column in the old dataframe for easier comparison
df["predicted_me_fo_consumption"] = df_predictions["me_fo_consumption"]

# show the true values vs the raw predictions
df[["me_fo_consumption", "predicted_me_fo_consumption"]]

	me_fo_consumption	predicted_me_fo_consumption
16518	49.292736	47.298896
16519	46.580038	49.660815
16520	45.939625	44.305224
16522	58.059408	50.579893
16528	47.758690	40.819772
...	...	...
26946	51.041670	44.648453
26948	41.748385	38.688569
26949	50.465256	49.859098
26950	50.465256	50.770080
26951	50.465256	21.524687

5884 rows × 2 columns

Finally, we calculate the metrics and print them.

Note that if voyage number is not available in the dataset, the voyage number can be set to the same value everywhere, in that case it will be the same as the BPE.

calculate_metrics(df)

Mean Absolute Percentage Error [%]    6.681542
Daily Percentage Error [%]            2.296083
Voyage Error [%]                      0.945030
Bias Percentage Error [%]            -0.401064
dtype: float64

We can see that our Ship Kernel has a MAPE of 6.91%, a DPE of 2.43%, a Voyage Error of 1.00% and a BPE of -0.37%.

As expected, as the MAPE is calculated on the shortest time scale (every 15minutes) it has the highest error. The BPE is calculated on the longest time scale (the entire evaluation period) and thus has the lowest error because the errors cancel each other out. It shows that on average, the model has a tendency to underestimate by 0.37%. The DPE and VE are calculated on intermediate time scales and fall somewhere in between.

The voyage error is an interesting metric to look at more closely. We can examen the errors per voyage seperately so determine whether the model is more accurate on certain voyages than others, indicating that the model is perhaps less performant in specific conditions.

# group the dataframe by voyage and apply the metrics calculation function to each group
df.groupby("voyage").apply(calculate_metrics)

	Mean Absolute Percentage Error [%]	Daily Percentage Error [%]	Voyage Error [%]	Bias Percentage Error [%]
voyage
7	6.702404	2.172954	1.595164	-1.595164
8	6.753125	2.209183	0.843010	-0.843010
9	6.627150	2.740728	0.396915	0.396915

The voyage error and bias percentage error are the same in this case as the voyage error is simply the mean of the absolute values of the BPE calculated per voyage, but in this case there is only a single voyage. We can see quite some differences between the voyages, demonstrating that the model is not infallible.

All that remains now to execute an objective benchmarking study is to run this same methodology for other models and see how they stack up against each other.