Publication-quality grouped bar chart comparing models across multiple benchmarks with error bars.
For results sections that compare 3–5 models across 2–4 benchmark datasets.
Same data, but render as horizontal bars sorted by mean F1 across benchmarks. One panel per benchmark stacked vertically, sharing the y-axis labels.
Add significance asterisks (* p<0.05, ** p<0.01) on top of bars where the score is significantly higher than the BERT baseline. Add a small footnote explaining the test (paired bootstrap, B=1000).