WildDash 2 Benchmark

New: WildDash 2 with 4256 public frames, new labels & panoptic GT!  
See also: RailSem19 dataset for rail scene understanding.

Please note: This benchmark is not meant for iterative testing sessions or parameter tweaking. Each algorithm should only be submitted once its development is finished (or in a state that you want to reference in a publication). We have limited the total number of submissions to three. You cannot edit/update your submission after it has been accepted. Do not waste submissions to tweak your parameters or training data.
To be listed on the public leaderboard, please follow these steps:

  1. Get the data. Create an account to get access to the download links. The download packages contain additional technical submission details.
  2. Compute your results. Identical parameter settings must be used for all frames.
  3. Upload and submit. Log in, upload your results, add a brief description of your method, and submit for evaluation. To support double-blind review processes, author details may first be anonymous and may later be updated upon request.
  4. Check your results. Your submission will be evaluated automatically. You will get notified via email as soon as we approve the evaluation results.

Frequently Asked Questions
  1. wd0141-wd0284 are weird/broken!
    Yes indeed! These are negative test cases, cases where we expect the algorithm to fail. These out-of-scope cases are only evaluated in the category negative within the benchmark. They do not influence the other scores.
  2. How do you evaluate negative test cases?
    Any pixels with void labels are considered correct as well as an additional best-case GT. For example, the upside-down image allows either void labels or an upside-down version of the regular GT as correct. Likewise, the correct result for the black-and-white image can either be void or the respective labels from a colored image showing the same scene. This is done for each pixel so mixtures of void and valid labels will be evaluated fairly. For instance segmentation, the mixing of different silhouettes would skew results. Thus, here we only exchange empty *.txt to the respective best-case GT. Panoptic segmentation uses a mixture of approaches described in our WildDash 2 paper.
  3. The classic cityscapes trainIds 0-18 do not contain any void labels. How do I fix this?
    Note that WildDash 2 evaluates more classes than Cityscapes, including new classes like van, pickup, street light, billboard, and ego-vehicle. We do not encourage any solution that has no negative class/void classes. One quick /dirty hack to introduce void pixels could be a post-processing step where you map all pixels with an argmax probability < threshold to id 0. Whatever steps and mechanisms you choose: you have to handle all frames of the dataset in the same way. If you apply some post-processing to negative test frames only, then this is considered cheating! In the end, you have to find a balance between producing good quality output while failing gracefully in the event of out-of-scope situations.
  4. The negative test cases are unfair!
    WildDash tries to focus on algorithm robustness rather than benchmarking the best-case performance. We apply the same metric to all submissions and weigh negative testcases and positive testcases according to their occurrence in the benchmark (18.6% vs. 81.4%) for a final meta average.
  5. wd150 shows an indoor scene and the RVC challenge includes indoor datasets with indoor labels. Should I do something special here?
    No, WildDash evaluates labels compatible with the WildDash 2 label policy (Cityscapes label ids plus new labels ordered after that). For our dataset, the indoor scene is out-of-scope (see above about how negative tests are evaluated). You have to remap your ids accordingly. Please see the RVC website and the RVC dev kit for RVC specific questions.
  6. How do you calculate your metrics? Which label ids are relevant?
    Version two of WildDash (wd_public_02, wd_both_02) is evaluated with extended cityscapes evaluation scripts. A total of six new labels are evaluated besides the classic 19 classes of Cityscapes: van, pickup, street light, billboard, guard rail, and ego-vehicle. In addition, see above about the handling of void labels for negative test cases. All other evaluations except negative ignore regions with void labels in the GT but otherwise consider void in algorithm results as bad pixels.
  7. I want to download/submit, but get the error message "Due to data protection legislation, we need to manually approve each account before granting access to download the data."
    The error message does say it all: We have to manually approve each of our users individually before they can access the privacy-relevant data from our dataset. This process is done periodically but may take up to a week. Please use your academic email address to speed up the process and register well ahead of paper/challenge deadlines. Correctly fill out the fields at your profile page (including "Latest publication") to speed things up. Sorry for the delays but this is a necessity to fulfill our data protection obligations. Do not register using anonymous accounts from public webmail providers (e.g. qq.com, outlook.com, gmail.com, 163.com, ..). We cannot check if these are valid email accounts!
  8. Which parts of the dataset may I use during training?
    You can use all 4256 frames from wd_public_02 and the associated GT during training in any way you see fit. The use of the 776 benchmarking frames (starting with "wd" ) from wd_both_02 during training (e.g. by creating GT yourself or by unsupervised learning) is considered cheating. We remove cheating submissions from the leaderboard and may invoke temporary or permanent bans of cheating users or institutes.
  9. Can I use/apply FancyPreprocessingSteps/FancyPostprocessingSteps?
    You can use any kind and mixture of classifiers and pre/post-processing effects as long as there are no manual steps to distinguish parts of the benchmarking data (e.g. define that a specific subset are negative test cases and process them differently). We accept solutions that use the image content (rather than the image's file name or other meta-data) to automatically detect negative test example and handle them differently.
  10. Is there an archived version of the RVC 2018 results?
    Yes! You can find the overall results here. The WildDash specific results are archived here: semantic / instance segmentation.
  11. Why are only evaluation results for validation frames visualized on the website? Where are the benchmarking frames?
    The GT for benchmarking frames should remain hidden to allow a fair evaluation. Visualizations with comparisons against the benchmarking GT would lead to the deciphering of our GT.
  12. The interface shows all submissions but the total number is less than expected!
    We will set the status of results for algorithms which were stable for more than a few months to "archived". This frees up new slots for you to submit new algorithms so that previously participating teams will not be at a disadvantage.
  13. My own validation results differ slightly from the numbers on the benchmark!
    The WildDash validation and benchmarking frames should have similar composition and difficulty but there is still some divergence to be expected. Additionally, WildDash evaluates metrics per frame and averages these per-frame metrics for each frame from a given subset. This differs slightly from averaging all pixels of all frames but better represents our interpretation of frames as individual test cases.
  14. Which licenses apply to the datasets and benchmarks?
    These are available in the download itself and here:
    WildDash dataset and benchmark license
    RailSem19 dataset license
  15. Which format is used for the panoptic segmentation challenge? The panoptic segmentation challenge uses the COCO panoptic format. The image_id, which links an annotation to a benchmark frame, and category_ids can be obtained from the dataset.json located in the benchmark data folder.
  16. I can't download the datasets
    Make sure your account is verified (i.e. you could submit results to the benchmark). Our server will terminate open connections after approx. 30 minutes. Use a download manager with resume support for longer downloads. Also, have a look at the automatic downloader script provided here: https://github.com/ozendelait/wilddash_scripts.

The WildDash Benchmark is part of the semantic and instance segmentation challenges of the Robust Vision Challenge 2020. If you want to participate, follow the instructions above and add _RVC as a postfix to your method name. Please note that you must use the same model / parameter setup to compute your results for all benchmarks of the respective challenge.