If you look at the original dev data, you will see every datapoint is distinct. Training data set, however, has a lot of repetitions. This makes it infeasible to do a 90-10-10 split.
Potential solutions:
- one reasonable thing to do would be to (a) separate in a "not overlapping" dev and set up a cross-fold validation experiment
- using a fraction of the original dev as internal dev for Anli