Benchmarking Model Accuracy
How to run this?
Go here to instantly run all code below inside of your browser.
Use Case
This use case accompanies the document "How to compare different ship performance models?". We recommend reading through the document first.
This document provides a framework to compare different ship performance models fairly. To recap, these are the broad steps of the framework:
- First, we establish a benchmark dataset
This dataset will provide the ‘ground truth’ of measured values that the models should try to estimate. It should be free of errors and constrained to sea-going conditions.
- Next we determine a set of quantitative metrics
To measure how accurate the models are it is important to have meaningful, robust and quantifiable metrics. It is important that all metrics are calculated on exact same dataset.
- Finally, we benchmark the results
We compare our obtained quantitative metrics to determine the most accurate model.
This use case will demonstrate how steps 1 and 2 of the framework can be applied to a Toqua Ship Kernel. Step 3 requires the presence of other models to compare the results against.
Some familiarity with the Toqua API and used Python frameworks is required to understand this notebook. If this is not the case, we recommend first going through the "Basics" and "Speed-Fuel curve" use cases.
Setup
Fill in the IMO number of your ship below.
IMO_NUMBER = "9999999"
1. Establish a benchmark dataset
As first step, we establish a dataset on which models can be benchmarked. This dataset will provide the ‘ground truth’ of measured values that the models should try to estimate. It should be free of errors and constrained to sea-going conditions. It is important that this exact same dataset is used for all models.
Creating such a benchmarking dataset is done by following this five-step process:
- Choosing a data source (sensor data or noon reports)
- Choosing an evaluation period
- Filtering outliers
- Filtering for unaccounted external effects
- Constraining to sea-going conditions
We will go through each of these steps in turn.
1. Choosing a data source
The first step is to determine which data source to use. Usually this comes down to the decision between using noon reports or sensor data. Sensor data is preferred as it significantly higher frequency than noon reports and is most accurate, given that data quality is properly monitored.
For illustration purposes, we've generated two toy datasets: one sensor dataset containing all parameters required by our Ship Kernel models and one dataset containing cleaning events.
Let's examine what these datasets look like.
import pandas as pd
df = pd.read_csv("data/ship_data.csv", parse_dates=["datetime_end"])
df
datetime_end | voyage | stw | me_fo_consumption | ship_heading | wind_direction | wind_speed | wave_direction | wave_height | current_speed | current_direction | fuel_specific_energy | draft_avg | sea_depth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-06-24 22:45:00+00:00 | 1 | 12.736735 | 51.056979 | 116.834744 | 256.825078 | 4.652994 | 167.403422 | 1.808797 | 0.174668 | 1.132839 | 41.500563 | 19.199545 | 1754.005338 |
1 | 2022-06-24 23:00:00+00:00 | 1 | 11.912045 | 46.441744 | 97.254636 | 170.989375 | 4.937206 | 161.889526 | 2.046644 | 0.170478 | 104.104376 | 41.500402 | 19.199505 | 848.125564 |
2 | 2022-06-24 23:15:00+00:00 | 1 | 13.043936 | 48.837654 | 192.273266 | -23.061444 | 4.768308 | 250.303015 | 0.390783 | 0.599106 | 218.323053 | 41.500723 | 19.199856 | -1114.681358 |
3 | 2022-06-24 23:30:00+00:00 | 1 | 12.340733 | 46.220231 | 158.798366 | 210.453751 | 3.023702 | 230.254281 | 1.296289 | -0.142017 | 51.892648 | 41.498476 | 19.200029 | 1357.116871 |
4 | 2022-06-24 23:45:00+00:00 | 1 | 12.867363 | 44.134012 | 104.596701 | 175.070960 | 5.494352 | 188.850350 | 0.912953 | 0.526234 | 244.260118 | 41.499404 | 19.198857 | 693.841692 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
31945 | 2023-05-23 17:00:00+00:00 | 11 | 12.106177 | 29.341714 | 247.211329 | 120.444830 | 7.266383 | 193.127350 | 1.548775 | 0.241010 | 144.109422 | 41.499693 | 9.500587 | 4421.447117 |
31946 | 2023-05-23 17:15:00+00:00 | 11 | 13.274439 | 36.390085 | 342.079971 | 109.189566 | 7.885734 | 242.281907 | 1.818478 | 0.436219 | 89.660876 | 41.500398 | 9.498560 | 4176.561835 |
31947 | 2023-05-23 17:30:00+00:00 | 11 | 13.459992 | 38.990243 | 256.667607 | 103.239181 | 6.589824 | 212.017387 | 1.508247 | 0.093702 | 180.322267 | 41.499994 | 9.499034 | 3604.806021 |
31948 | 2023-05-23 17:45:00+00:00 | 11 | 12.720441 | 34.616783 | 269.605791 | 135.022811 | 7.284068 | 175.945316 | 1.956245 | 0.095444 | 95.676050 | 41.498040 | 9.499939 | 3918.618366 |
31949 | 2023-05-23 18:00:00+00:00 | 11 | 13.161837 | 37.028321 | 306.750306 | 87.565056 | 6.327873 | 190.728030 | 1.919243 | 0.231607 | 150.171653 | 41.499354 | 9.499044 | 5235.721062 |
31950 rows × 14 columns
df.describe()
voyage | stw | me_fo_consumption | ship_heading | wind_direction | wind_speed | wave_direction | wave_height | current_speed | current_direction | fuel_specific_energy | draft_avg | sea_depth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 | 31950.000000 |
mean | 6.025039 | 9.815572 | 38.082985 | 149.195428 | 169.363234 | 5.797681 | 173.740267 | 1.434671 | 0.313175 | 160.650100 | 41.504645 | 15.268149 | 2273.463318 |
std | 2.899940 | 4.666131 | 15.047520 | 91.118987 | 89.429703 | 2.777208 | 75.972676 | 0.941641 | 0.241369 | 92.110663 | 0.067866 | 4.663984 | 1853.460963 |
min | 1.000000 | -5.129072 | -31.911774 | -262.832715 | -238.175233 | -5.784175 | -211.287347 | -1.935043 | -0.846737 | -265.876376 | 40.519639 | 3.807594 | -3847.904748 |
25% | 3.000000 | 9.945383 | 22.365897 | 69.583797 | 104.914917 | 3.791703 | 130.165614 | 0.652885 | 0.150052 | 108.216858 | 41.499380 | 9.781429 | 318.648104 |
50% | 6.000000 | 11.963889 | 42.508627 | 149.153555 | 168.878906 | 5.757093 | 182.832441 | 1.283217 | 0.271978 | 155.810387 | 41.500054 | 16.901380 | 2270.289218 |
75% | 9.000000 | 12.712263 | 50.635050 | 217.592238 | 223.890564 | 7.644993 | 215.076361 | 2.120300 | 0.440932 | 224.127395 | 41.500564 | 19.201264 | 3692.330283 |
max | 11.000000 | 19.905735 | 76.803888 | 639.734658 | 636.982189 | 18.290113 | 560.098908 | 5.125406 | 1.747020 | 601.952975 | 43.750813 | 26.274611 | 9429.949260 |
We see there's about 7 months of data, with 15min between each sample. All parameters necessary for our Ship Kernel model are present, including weather data. Weather data is typically not measured on board, but can be obtained from external weather services and merged onto a sensor dataset using timestamp and geolocation. That is what we've done to create this dataset.
The dataset contains some obvious erroneous values such as negative values for parameters that obviously cannot be negative (sea depth, wave height, fuel consumption, ...). An indication of a voyage number is also given, making it possible to quantify the accuracy per voyage.
Note that if voyage number is not available in the dataset, it can just be set to the same value everywhere, then below code will still work.
df_events = pd.read_csv("data/ship_events.csv", parse_dates=["datetime"])
df_events
datetime | type | |
---|---|---|
0 | 2022-06-25 | hull_cleaning |
1 | 2023-05-08 | dry_dock |
The ship had a hull cleaning on June 26, 2022 and went into dry dock on May 8, 2023.
2. Choosing an evaluation period
Now that we have a dataset, we must choose an appropriate evaluation period. This is a period during which the ship performance is assumed to be stable. Meaning it should be free from long idling periods, cleaning events, dry-dockings or ESD-installations. To get a robust estimate of how good the model is, it should also be as long as possible.
To determine the evaluation period we will create a small plot visualizing the events and the STW over time, colored per voyage. For now, this will give us an adequate indication of when the ship is idling. However, in real-world scenarios this sensor can fail so we recommend combining multiple idling indicators such as the ship location, SOG and RPM.
import matplotlib.pyplot as plt
from datetime import datetime
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 3))
ax.scatter(df["datetime_end"], df["stw"], c=df["voyage"], s=0.2)
ax.set_ylabel("STW [kn]")
for event in df_events.itertuples():
ax.axvline(event.datetime, color="red")
ax.text(event.datetime, y=1.03, s=event.type, color="red", transform=ax.get_xaxis_transform(), horizontalalignment='center')
ax.axvspan("2023-04-02", "2023-04-20", color="green", alpha=0.2)
ax.text(datetime.strptime("2023-04-11", "%Y-%m-%d"), y=1.03, s="long\nidling period", transform=ax.get_xaxis_transform(), horizontalalignment='center', color='green')
ax.axvspan("2022-06-25", "2022-12-12", color="blue", alpha=0.1)
ax.text(datetime.strptime("2022-09-12", "%Y-%m-%d"), y=1.03, s="training period", transform=ax.get_xaxis_transform(), horizontalalignment='center', color='blue')
Text(2022-09-12 00:00:00, 1.03, 'training period')
On the above graph we see that the ship had a hull cleaning, a dry dock and a long idling period of about 20 days. We also see a period in blue, indicated as "training period".
The hull cleaning, dry dock and long idling period are events that could alter ship performance. Therefore, our choice for evaluation period will have to be within the period between the hull cleaning and the idling period. However, not all data within that period can be used. The Toqua Ship Kernel is a data-driven model. That means it learns (also called "training") from seen data to predict the outcome on new data. In this case, the period between 2022-06-24 and 2022-12-13 has already been used train the model. As this model has "seen" this data, it would be unfair to evaluate it upon this same data.
Therefore, only the data after the training period, ranging from 2022-12-14 to 2023-04-02 remains, spanning 3 voyages.
Ideally, aim for at least 2 months of data spanning 2 voyages.
period_filters = [
"datetime_end > '2022-12-14T00:00+00:00'", # keep only data after hull cleaning
"datetime_end < '2023-04-02T00:00+00:00'" # keep only data before esd
]
# apply each filter in turn using .query method
for f in period_filters:
df = df.query(f)
3. Filtering outliers
Outliers are data points that are significantly different from the rest of the data. They can be caused by sensor errors, human errors, or other external factors. We can filters outliers using advanced outlier detection algorithms, or by applying simple thresholds.
In this case, we will apply a set of very simple thresholds. These thresholds filter out data points that are obviously erroneous.
For example: directions must be between 0 and 360 degrees, fuel consumption cannot be below 0 and sea depth cannot be below 0.
From knowledge of the ship we also know that she can only sail between a draft of 9 and 20m and never uses fuels with an LCV value below 40 MJ/kg.
Be wary that these simple filters definitely do not capture all outliers.
outlier_filters = [
"stw > 0",
"me_fo_consumption > 0",
"fuel_specific_energy > 40",
"9 < draft_avg < 20",
"0 < ship_heading < 360",
"0 < wind_direction < 360",
"wind_speed > 0",
"0 < wave_direction < 360",
"wave_height > 0",
"0 < current_direction < 360",
"current_speed > 0",
"sea_depth > 0",
]
for f in outlier_filters:
df = df.query(f)
4. Filtering unaccounted for external effects
Next, we filter out samples that are affected by external factors which are not accounted for by the model. These are factors the model "has no idea about". In this case, our Ship Kernel takes into acount all parameters of the dataset except sea depth. I.e. sea depth is not an input of the model. However, a shallow sea does result in a perceived decrease in performance due to additional drag. As this effect is not accounted for by the model we will filter out samples with a sea depth below 3 * the draft.
If available, we recommended to also filter out extreme rudder angles or temperatures below 5°C.
external_effect_filters = [
"sea_depth > 3 * draft_avg"
]
for f in external_effect_filters:
df = df.query(f)
5. Constraining to sea-going conditions
Finally, we remove any conditions during which the vessel is not in normal operation conditions (e.g. during port stay, manoeuvring, etc.).
For this we will use the following filters:
- Speed > 5 kn
- Fuel Consumption > 5 mt/day
sea_condition_filters = [
"stw > 5",
"me_fo_consumption > 5",
]
for f in sea_condition_filters:
df = df.query(f)
Finally the following dataset remains, which meets all the requirements for a correct benchmark dataset.
df.describe()
voyage | stw | me_fo_consumption | ship_heading | wind_direction | wind_speed | wave_direction | wave_height | current_speed | current_direction | fuel_specific_energy | draft_avg | sea_depth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 | 5884.000000 |
mean | 8.213460 | 11.872455 | 40.795409 | 123.062166 | 173.442223 | 6.053886 | 185.825918 | 1.658507 | 0.308868 | 167.165553 | 41.503330 | 16.600232 | 2652.149262 |
std | 0.819602 | 1.163993 | 14.286145 | 62.510900 | 66.431414 | 2.628462 | 53.737770 | 0.961057 | 0.175661 | 77.421203 | 0.045546 | 4.478385 | 1776.279635 |
min | 7.000000 | 5.038316 | 5.232447 | 0.024387 | 0.602252 | 0.001108 | 0.923097 | 0.000026 | 0.000378 | 0.040091 | 40.968636 | 9.028157 | 29.987202 |
25% | 7.000000 | 10.978792 | 22.446645 | 69.331952 | 122.681096 | 4.147982 | 160.235449 | 0.698805 | 0.179125 | 109.978126 | 41.499322 | 9.501307 | 1101.395129 |
50% | 8.000000 | 12.319606 | 48.437033 | 113.701521 | 173.516801 | 5.912339 | 192.175421 | 1.770468 | 0.291150 | 166.916770 | 41.500001 | 19.199770 | 2703.290472 |
75% | 9.000000 | 12.690115 | 51.749307 | 185.721533 | 219.451190 | 7.818401 | 215.603917 | 2.393301 | 0.415947 | 224.693505 | 41.500725 | 19.201743 | 4122.457020 |
max | 9.000000 | 16.950918 | 67.399952 | 358.343680 | 359.091475 | 15.143205 | 359.514260 | 4.706954 | 1.436451 | 359.438882 | 42.629039 | 20.000000 | 8401.586696 |
2. Calculate quantitative metrics
To measure how accurate the models are it is important to have meaningful, robust and quantifiable metrics. We will provide an implementation of the following metrics:
-
Mean Absolute Percentage Error (MAPE)
$$\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left|\frac{\hat{y_i} - y_i}{y_i}\right|$$ -
Daily Percentage Error (DPE)
$$\text{DPE} = \frac{100}{D} \sum{d}^{D} \left| \frac{\sum{i}^{d} \hat{y}i - y{i} }{\sum{j}^{d} y{j}} \right|$$ -
Voyage Error (VE)
$$\text{VE} = \frac{1}{V} \sum{v}^{V} \left| \frac{100}{v} \sum{i}^{v} \frac{ \hat{y}{i} - y{i} }{y_i} \right|$$ -
Bias Percentage Error (BPE)
$$\text{BPE} = \frac{\sum{i}^{N} \hat{y}{i} - \sum{j}^{N} y_j}{\sum{k}^{N} y_k}$$
A detailed explanation of these metrics, accompanied by toy examples, may be found in the whitepaper.
import numpy as np
def mape(true, pred):
"""Mean Absolute Percentage Error
true: a list of measured values
pred: a list of predicted values
"""
return np.mean(np.abs((pred - true) / true)) * 100
def bpe(true, pred):
"""Bias Percentage Error
true: a list of measured values
pred: a list of predicted values
"""
return np.sum(pred - true) / np.sum(true) * 100
def dpe(true, pred, days):
"""Daily Percentage Error
true: a list of measured values
pred: a list of predicted values
days: a list of days, corresponding to the day on which the measurement was done
"""
metrics_per_day = []
for d in np.unique(days):
mask = days == d
metrics_per_day.append(np.abs(bpe(true[mask], pred[mask])))
return np.mean(metrics_per_day)
def ve(true, pred, voyages):
"""Voyage Error
true: a list of measured values
pred: a list of predicted values
voyages: a list of voyage ids, corresponding to the voyage during which the measurement was done
"""
metrics_per_voyage = []
for v in np.unique(voyages):
mask = voyages == v
metrics_per_voyage.append(np.abs(bpe(true[mask], pred[mask])))
return np.mean(metrics_per_voyage)
We can use above functions to calculate the metrics, given the ground truth values and the values predicted by the model. The ground truth values we obtain from the benchmark dataset established earlier.
All that is left is to obtain the predicted values from the Ship Kernel through the Toqua API.
import logging
from time import sleep
from typing import Dict
import pandas as pd
import requests
def make_api_call(method, url, payload=None):
"""Generic function to make an API call to the Toqua API."""
headers = {
"accept": "application/json",
"content-type": "application/json",
"X-API-Key": API_KEY,
}
if method == 'GET':
response = requests.get(url, headers=headers)
elif method == 'POST':
response = requests.post(url, json=payload, headers=headers)
else:
print("Error: Invalid method")
return None
try:
response.raise_for_status()
except requests.HTTPError as e:
logging.error(f"Error: {e}")
logging.error(response.content)
raise e
output_json = response.json()
del output_json["errors"]
return output_json
def predict(imo_number, df):
"""Do a prediction on the Toqua Ship Kernel of a given vessel.
This implementation is a bit more complicated than in previous use cases. Because the Toqua API
a maximum of 9000 rows at a time we need to split the data into chunks of 9000.
"""
url = f"https://api.toqua.ai/ships/{imo_number}/models/latest/predict"
# toqua API limit
chunk_size = 9000
# split the input data into chunks of 9000 rows
chunks = np.array_split(df, len(df) / chunk_size + 1)
prediction_chunks = []
# for each chunk, predict the values
for chunk in chunks:
model_input = { "data": chunk.drop(columns=["datetime_end", "me_fo_consumption"]).to_dict(orient="list") }
pred = make_api_call('POST', url, model_input)
prediction_chunks.append(pd.DataFrame(pred, index=chunk.index))
# finally, concatenate the results again
return pd.concat(prediction_chunks)
def calculate_metrics(df):
"""Calculate all metrics from a single dataframe containing the true and predicted values.
The column me_fo_consumption contains the true fuel consumption values, the column predicted_me_fo_consumption contains the predicted values.
"""
return pd.Series([
mape(df.me_fo_consumption, df.predicted_me_fo_consumption),
dpe(df.me_fo_consumption, df.predicted_me_fo_consumption, df.datetime_end.dt.round("1d")),
ve(df.me_fo_consumption, df.predicted_me_fo_consumption, df.voyage),
bpe(df.me_fo_consumption, df.predicted_me_fo_consumption)
], index= ["Mean Absolute Percentage Error [%]", "Daily Percentage Error [%]", "Voyage Error [%]", "Bias Percentage Error [%]"])
df_predictions = predict(IMO_NUMBER, df)
# add the predictions as new column in the old dataframe for easier comparison
df["predicted_me_fo_consumption"] = df_predictions["me_fo_consumption"]
# show the true values vs the raw predictions
df[["me_fo_consumption", "predicted_me_fo_consumption"]]
me_fo_consumption | predicted_me_fo_consumption | |
---|---|---|
16518 | 49.292736 | 47.298896 |
16519 | 46.580038 | 49.660815 |
16520 | 45.939625 | 44.305224 |
16522 | 58.059408 | 50.579893 |
16528 | 47.758690 | 40.819772 |
... | ... | ... |
26946 | 51.041670 | 44.648453 |
26948 | 41.748385 | 38.688569 |
26949 | 50.465256 | 49.859098 |
26950 | 50.465256 | 50.770080 |
26951 | 50.465256 | 21.524687 |
5884 rows × 2 columns
Finally, we calculate the metrics and print them.
Note that if voyage number is not available in the dataset, the voyage number can be set to the same value everywhere, in that case it will be the same as the BPE.
calculate_metrics(df)
Mean Absolute Percentage Error [%] 6.681542
Daily Percentage Error [%] 2.296083
Voyage Error [%] 0.945030
Bias Percentage Error [%] -0.401064
dtype: float64
We can see that our Ship Kernel has a MAPE of 6.91%, a DPE of 2.43%, a Voyage Error of 1.00% and a BPE of -0.37%.
As expected, as the MAPE is calculated on the shortest time scale (every 15minutes) it has the highest error. The BPE is calculated on the longest time scale (the entire evaluation period) and thus has the lowest error because the errors cancel each other out. It shows that on average, the model has a tendency to underestimate by 0.37%. The DPE and VE are calculated on intermediate time scales and fall somewhere in between.
The voyage error is an interesting metric to look at more closely. We can examen the errors per voyage seperately so determine whether the model is more accurate on certain voyages than others, indicating that the model is perhaps less performant in specific conditions.
# group the dataframe by voyage and apply the metrics calculation function to each group
df.groupby("voyage").apply(calculate_metrics)
Mean Absolute Percentage Error [%] | Daily Percentage Error [%] | Voyage Error [%] | Bias Percentage Error [%] | |
---|---|---|---|---|
voyage | ||||
7 | 6.702404 | 2.172954 | 1.595164 | -1.595164 |
8 | 6.753125 | 2.209183 | 0.843010 | -0.843010 |
9 | 6.627150 | 2.740728 | 0.396915 | 0.396915 |
The voyage error and bias percentage error are the same in this case as the voyage error is simply the mean of the absolute values of the BPE calculated per voyage, but in this case there is only a single voyage. We can see quite some differences between the voyages, demonstrating that the model is not infallible.
All that remains now to execute an objective benchmarking study is to run this same methodology for other models and see how they stack up against each other.
Updated 9 months ago