Why do my bars look slightly different from my data?

Image-generation models approximate exact pixel positions. For final figure submission, use the prompt to draft the layout, then redraw in matplotlib / PGFPlots with your exact numbers.

Can I get an Excel-friendly export?

Not directly — paperbanana outputs PNG. For data export, generate the chart, copy the prompt structure into matplotlib code, and the bars will be re-drawn from your CSV.

PromptsCharts

ChartsText → ImageLow creditEN

Model Benchmark Grouped Bar Chart

Publication-quality grouped bar chart comparing models across multiple benchmarks with error bars.

When to use this prompt

For results sections that compare 3–5 models across 2–4 benchmark datasets.

The prompt

A grouped bar chart comparing F1 scores of 5 models across 3 benchmark datasets.

Models (with bar colors):
- BERT (slate)
- RoBERTa (steel blue)
- DeBERTa (teal)
- GPT-4 (amber)
- Claude (deep purple)

Datasets (x-axis groups):
- SQuAD: BERT 88.2, RoBERTa 90.1, DeBERTa 91.3, GPT-4 89.7, Claude 92.0
- MNLI: BERT 84.6, RoBERTa 87.2, DeBERTa 89.1, GPT-4 88.4, Claude 90.3
- SST-2: BERT 93.5, RoBERTa 95.0, DeBERTa 95.6, GPT-4 96.1, Claude 96.4

Y-axis: F1 score (%), range 80–100, gridlines every 5.
Error bars: thin black whiskers showing 95% CI (±0.4 to ±0.8 per bar).

Legend: top-right inside the plot area.
Title: "Model F1 across QA / NLI / Sentiment benchmarks".

Style: clean academic look, minimal palette, no chart junk, sans-serif. Match Nature / Science figure style.

Variations

Horizontal bars, sorted

Same data, but render as horizontal bars sorted by mean F1 across benchmarks. One panel per benchmark stacked vertically, sharing the y-axis labels.

With statistical-significance asterisks

Add significance asterisks (* p<0.05, ** p<0.01) on top of bars where the score is significantly higher than the BERT baseline. Add a small footnote explaining the test (paired bootstrap, B=1000).

Tips

Always specify the y-axis range. Generators default to 0–100 which crushes the differences.
List exact numbers — without them the bars will be plausible but not yours.
Name the test for error bars (95% CI / SE / SD) — generators draw matching whisker lengths.