New Hackathon: Robustness to distribution changes and ambiguity

EffiSciences is proud to announce a new hackathon organized with challenge-data-ens in France.

The objective of this event is to address a sub-problem of alignment that is particularly actionable: namely value extrapolation in the case of distribution change and ambiguity for a classification task. The hackathon lasts until mid-March, you can participate by going to this page here: https://​​challengedata.ens.fr/​​challenges/​​95.

Challenge goals

What if misleading correlations are present in the training dataset?

human_age is an image classification benchmark with a distribution change in the unlabeled data: we classify old and young people. Text is also superimposed on the images: either written “old” or “young”. In the training dataset, which is labeled, the text always matches the face. But in the unlabeled (test) dataset, the text matches the image in 50% of the cases, which creates an ambiguity.

We thus have 4 types of images:

  1. Age young, text young (AYTY),

  2. Age old, text old (AOTO),

  3. Age young, text old (AYTO),

  4. Age old, text young (AOTY).

Types 1 and 2 appear in both datasets, types 3 and 4 appear only in the unlabeled dataset.

To resolve this ambiguity, participants can submit solutions to the leaderboard multiple times, testing different hypotheses (challengers may consider solutions that require two or more submissions to the leaderboard).

We use the accuracy on the unlabeled set of human_age as our metric.

The human_age dataset

https://github.com/EffiSciencesResearch/challenge_data_ens_2023/blob/main/assets/human_age.png?raw=true

We have 4 types of colored images of size 218x178 pixels: Age Young Text Old (AYTO), Age Young Text Young (AYTY), etc. We provide:

  • a labeled set: 20000 images (either AOTO or AYTY)

  • an unlabeled set: about 70000 images of the four types (mixing rate of 50%, the four types being present in equal proportion).

You can visit the english page of https://​​challengedata.ens.fr/​​challenges/​​95 for more details.

References

This hackathon is inspired by :

https://​​www.lesswrong.com/​​posts/​​DiEWbwrChuzuhJhGr/​​benchmark-for-successful-concept-extrapolation-avoiding-goal

[1] Armstrong, S; Cooper, J; Daniels-Koch, O; and Gorman, R, “The HappyFaces Benchmark”,” Aligned AI Limited published public benchmark, 2022.

[2] D’Amour, Alexander, et al. “Underspecification presents challenges for credibility in modern machine learning.” arXiv preprint arXiv:2011.03395 (2020).

[3] Amodei, Dario, et al. “Concrete problems in AI safety.” arXiv preprint arXiv:1606.06565 (2016).

[4] Oakden-Rayner, Luke, et al. “Hidden stratification causes clinically meaningful failures in machine learning for medical imaging.” Proceedings of the ACM conference on health, inference, and learning. 2020.

[5] Liu, Ziwei, et al. “Large-scale celebfaces attributes (celeba) dataset.” Retrieved August 15.2018 (2018): 11.

[6] Lee Yoonho, Yao Huaxiu, Finn Chelsea, “Diversify and Disambiguate: Learning From Underspecified Data”, arXiv preprint, arXiv:2202.03418v2 (2022).