This benchmark is not meant for iterative testing sessions or parameter tweaking. Each algorithm should only be submitted once its development is finished (or in a state that you want to reference in a publication). We have limited the total number of submissions to three. You cannot edit/update your submission after it has been accepted. Do not waste submissions to tweak your parameters or training data.
To be listed on the public leaderboard, please follow these steps:
Get the data. Create an account to get access to the download links.
The download packages contain additional technical submission details.
Compute your results.
Identical parameter settings must be used for all frames.
Upload and submit.
Login, upload your results,
add a brief description of your method, and submit for evaluation. To support double-blind review processes,
author details may first be anonymous and may later be updated upon request.
Check your results.
Your submission will be evaluated automatically.
You will get notified via email as soon as we approve the evaluation results.
Frequently Asked Questions
wd141-wd155 are weird/broken!
Yes indeed! These are negative test cases, cases where we expect the algorithm to fail. These out-of-scope cases are only evaluated in the category negative within the benchmark. They have no influence on the other scores.
How do you evaluate negative test cases?
Any pixels with void labels are considered correct as well as an additional best case GT. For example, the upside-down image allows either void labels or an upside-down version of the regular GT as correct. Likewise, the correct result for the black-and-white image can either be void or the respective labels from a colored image showing the same scene. This is done for each pixel so mixtures of void and valid labels will be evaluated fairly. In future versions of the benchmark (after the CVPR ROB 2018 challenge) we will only count 'unlabeled' (id 0) as a correct label (in addition to the best case GT).
For instance segmentation, the mixing of different silhouettes would skew results. Thus, here we only exchange empty *.txt to the respective best case GT.
The classic cityscapes trainIds 0-18 do not contain any void labels. How do I fix this?
We do not encourage any solution that has no negative class/void classes. One quick /dirty hack to introduce void pixels could be a post-processing step where you map all pixels with an argmax probability < threshold to id 0.
Whatever steps and mechanisms you choose: you have to handle all frames of the dataset in the same way. If you apply some post-processing to negative test frames only, then this is considered cheating! In the end, you have to find a balance between producing good quality output while failing gracefully in the event of out-of-scope situations.
The negative test cases are unfair!
WildDash tries to focus on algorithm robustness rather than benchmarking the best-case performance. We apply the same metric to all submissions. In the CVPR challenge, negative test cases only affect a single ranking out of more than 30 that will be used to calculate an aggregated rank. It will on average have less than 5% impact on your algorithm's total rank.
wd150 shows an indoor scene and ROB 2018's ScanNet has labels for indoors. Should I do something special here?
No, WildDash evaluates labels compatible with the cityscapes label policy. For our dataset, the indoor scene is out-of-scope (see above about how negative tests are evaluated). The ROB 2018 dev kit will automatically map the different label ids for each dataset and transform all ScanNet labels as 'unlabeled' (id 0) before submission to WildDash. Please see the ROB website for ROB specific questions.
How do you calculate your metrics? Which label ids are relevant?
Version one of WildDash (wd_val_01, wd_bench_01, wd_both_01) is evaluated with the cityscapes evaluation scripts.
The same label policy, metrics, and weighting are used. See the official cityscapes website for details.
Thus, only 19 labels having the cityscapes trainIds (ignoreInEval == False) are evaluated.
In addition, see above about the handling of void labels for negative test cases.
All other evaluation except negative ignore regions with void labels in the GT but otherwise consider void in algorithm results as bad pixels.
In the future (after the finish of ROB 2018), for v2 we will be evaluating all labels also present in the training data.
I want to download/submit, but get the error message "Due to data protection legislation, we need to manually approve each account before granting access to download the data."
The error message does say it all: We have to manually approve each of our users individually before they can access the privacy-relevant data from our dataset. This process is done periodically but may take up to a week. Please use your academic email address to speed-up the process and register well ahead of paper/challenge deadlines. Sorry for the delays but this is a necessity to fulfill our data protection obligations.
The WildDash Benchmark is part of the semantic and instance segmentation challenges of the
Robust Vision Challenge 2018.
If you want to participate, follow the instructions above and add _ROB as a postfix to your method name.
Please note that you must use the same model / parameter setup to compute your results for all benchmarks of the respective challenge.