Methodology

What is UPSC Bench?

UPSC Bench is an LLM evaluation benchmark based on India's UPSC Civil Services Examination — widely regarded as one of the most competitive exams in the world. It evaluates whether frontier AI models can pass both stages of the written exam: the Preliminary Examination (objective MCQs) and the Main Examination (essay and long-form written answers), using real questions, real marking schemes, and real cutoffs.

Why This Matters

The Civil Services Examination is not just an exam — it is a defining institution in Indian public life. Over a million candidates sit the Prelims each year, but only around 1,000 will eventually be selected after three grueling stages: Prelims (objective), Mains (nine written papers over five days), and a personality interview. The entire process spans over a year, and most serious aspirants dedicate two to four years of their lives to preparation, often relocating to coaching hubs like Delhi's Mukherjee Nagar or Rajinder Nagar.

The stakes justify the effort. Officers of the Indian Administrative Service are posted as District Magistrates with executive authority over districts larger than many countries. Indian Police Service officers command law enforcement across entire states. Indian Foreign Service officers represent the nation as diplomats. These are among the most powerful non-elected positions in the world's largest democracy, and the exam is the only path to them.

The UPSC journey has become a cultural phenomenon in India. To understand what this exam means, we recommend:

12th Fail (2023) — A critically acclaimed film by Vidhu Vinod Chopra, based on the true story of IPS officer Manoj Kumar Sharma, who rose from a village with no electricity to clear one of the world's hardest exams. Wikipedia →
TVF Aspirants (2021) — A widely loved web series depicting the friendship, sacrifice, and heartbreak of three UPSC aspirants in Delhi. YouTube →
Wikipedia: Civil Services Examination — A comprehensive overview of the exam's structure, history, and selection process. Wikipedia →

Dataset

The benchmark covers the 2025 UPSC examination across both stages:

Prelims: 180 questions — 100 GS Paper I (General Studies) and 80 CSAT Paper II (Civil Services Aptitude Test)
Mains: 87 questions — 8 Essay topics, 20 GS1, 20 GS2, 20 GS3, and 19 GS4 questions

Questions are sourced from official UPSC papers and structured into a machine-readable JSON format using LLM-assisted parsing. Answer keys are sourced from established coaching institutes (Vision IAS and others). Mains questions are sourced directly from the official question papers.

Prelims Scoring

We use the exact UPSC marking scheme:

GS Paper I: +2.0 marks per correct answer, −0.66 marks per wrong answer, 0 for unanswered
CSAT Paper II: +2.5 marks per correct answer, −0.83 marks per wrong answer, 0 for unanswered

Both papers are scored out of 200 marks. GS Paper I determines merit ranking, while CSAT Paper II is qualifying only (minimum 33% required). The negative marking means random guessing is roughly break-even — models must actually know the answer.

Prelims Grading

UPSC Prelims is single-correct MCQ, so grading is deterministic — no LLM grader is needed. We extract the model's answer letter (A/B/C/D) from its output using regex-based parsing and compare it against the correct answer. If the model's output cannot be parsed into a valid answer, it is marked as "unanswered" (0 marks).

Mains Scoring

The Mains evaluation tests long-form writing ability across 5 papers totaling 1,250 marks:

Essay Paper (250 marks): 8 topics in two sections (A and B). Each model writes all 8 essays (up to 1,200 words each). The best essay from each section is selected, mirroring actual UPSC rules.
GS Papers 1–4 (250 marks each): 20 questions per paper (10- and 15-mark questions). Each model writes 150–250 word answers.

Grading Rubric

All answers are graded by a calibrated LLM judge (Claude Opus 4.6) using a 5-dimension rubric:

Dimension	GS Weight	Essay Weight
Content Accuracy / Breadth	40%	30%
Structure & Flow	20%	20%
Depth & Examples	20%	20%
Analytical Depth	10%	20%
Presentation	10%	10%

Debiasing & Calibration

To ensure fair and realistic grading, the judge uses several debiasing measures:

Blind grading: Model names are hidden from the judge. Candidates are labeled A, B, C, D, E.
Shuffled order: Candidate order is randomized with a fixed seed per question for reproducibility.
Comparative format: All candidates' answers for the same question are graded in a single prompt, forcing the judge to differentiate rather than grade in isolation.
UPSC-calibrated score anchors: The judge prompt includes explicit scoring guidelines calibrated to real UPSC grading standards:

<30% — Irrelevant, factually wrong, or off-topic

30–45% — Partially relevant but shallow, missing key points

45–55% — Adequate. This is the median for serious Mains candidates.

55–65% — Good. Well-structured with relevant examples.

65–75% — Excellent. Exceptional depth and analysis.

>75% — Near-perfect. Almost never awarded in real UPSC grading.

Judge Validation

To verify the judge's accuracy, we ran a calibration study using 78 coaching institute model answers from InsightsIAS — a leading UPSC coaching institute that publishes detailed model answer synopses for each Mains question. We graded these 78 answers (covering GS Papers 1–4) with the same judge prompt and score anchors used for AI model answers, then compared the judge's scores against coaching-expected ranges.

Key Finding: The Judge Is Stricter Than Expected

The judge scores coaching model answers 8.8 percentage points lower than coaching institutes expect. This negative bias is consistent across all four GS papers, suggesting the judge is systematically stricter than real UPSC grading — not more lenient.

Metric	Value
Coaching answers graded	78
Mean Absolute Error	1.26 marks (9.8%)
Mean offset (bias)	−8.8%
Within expected range	29.5%

Paper	MAE	Bias	In Range
GS Paper 1	1.45	−9.9%	20%
GS Paper 2	1.40	−8.0%	35%
GS Paper 3	0.88	−7.0%	45%
GS Paper 4	1.34	−10.3%	17%

What This Means

The negative bias means our reported AI model scores (58–72% on Mains) are conservative estimates. If the judge were perfectly calibrated to real UPSC grading, AI scores would likely be even higher. We chose not to adjust scores upward — reporting conservative numbers is preferable to inflated ones.

The judge's strictness appears driven by two factors the coaching institutes don't penalize in their own expected scores: (1) word limit violations (coaching model answers consistently exceed limits by 2–4x), and (2) formulaic template structure (identical intro/body/conclusion patterns across all answers). These are legitimate quality concerns that a real UPSC examiner would also penalize, suggesting the judge may actually be more accurate than the coaching institutes' self-assessments.

Full validation data: data/calibration/ | Judge prompt

Pass / Fail Criteria

Prelims

Each model's GS Paper I score is compared against the actual UPSC General category cutoff for that year. CSAT requires a minimum of 66/200 (33%) to qualify. These cutoffs vary year-to-year based on exam difficulty:

Year	GS1 Cutoff	CSAT Qualifying
2025	90/200	66/200
2024	87.98/200	66/200

Mains

The Mains cutoff is proportionally derived from the real UPSC written exam cutoff. UPSC publishes a combined cutoff for all written papers (Essay + GS1-4 + Optional = 1,750 marks). Since we exclude the Optional paper, we scale proportionally: an approximate cutoff of 800/1,750 becomes 571/1,250. A model passes Mains if its total score across 5 papers exceeds this threshold.

Models Evaluated

GPT-5.2 (OpenAI)
Claude Opus 4.6 (Anthropic)
Gemini 3.1 Pro (Google)
Gemini 3 Flash (Google)
Gemini 2.5 Flash (Google)

Human reference: Shakti Dubey, CSE 2024 AIR 1. Mains score proportionally estimated at 602/1,250 (from 843/1,750 total written marks). UPSC does not publish paper-wise marks, so this is a uniform estimate across all papers.

All models are evaluated with temperature 0 for maximum determinism. All models evaluated via the OpenRouter API.

Questions & Answers

Why use an LLM to judge LLM answers?

Mains has no answer key — UPSC publishes questions but not model answers. Human grading would be ideal but is prohibitively expensive at scale (87 questions × 5 models = 435 answers to grade). LLM-as-judge is the practical alternative, and we mitigate its weaknesses through comparative grading (forcing differentiation), UPSC-calibrated score anchors (preventing score inflation), blind evaluation (preventing model favoritism), and fixed-seed shuffling (ensuring reproducibility).

How reliable is LLM-as-judge grading?

We validated the judge against 78 coaching institute model answers from InsightsIAS. The judge scores these answers 8.8% lower than coaching expectations on average — meaning it is stricter than real UPSC grading, not more lenient. The comparative format (grading all candidates for the same question in a single prompt) forces relative distinctions, and the UPSC-calibrated anchors prevent the common failure mode of LLM judges scoring everything above 80%. See the Judge Validation section above for full metrics.

Why do AI models outscore the human reference on Mains?

Three factors: (1) The human score is a proportional estimate — UPSC publishes total written marks (843/1,750 for AIR 1) but not paper-wise breakdowns, so we distribute evenly across papers. The real distribution is likely uneven. (2) AI models write verbose, well-structured answers that score well on rubric dimensions like "Structure & Flow" and "Presentation" — but real UPSC grading may value conciseness and handwriting quality differently. (3) The benchmark excludes the Optional paper (250 marks) and Interview (275 marks), which are part of the full selection process.

What about the Interview stage?

Not evaluated. The UPSC personality test is a 275-mark interview conducted by a board of senior civil servants. It assesses personality, communication, and situational judgment — qualities that require physical presence and real-time interaction. This is fundamentally different from a text-based benchmark.

Can I add my own model?

Yes. Create a config YAML in config/ specifying the model name, provider, and parameters. Run the Prelims benchmark with python benchmark/runner.py --config config/your_model.yaml and the Mains benchmark with python -m benchmark.mains_runner --config config/mains_your_model.yaml. Then regenerate the leaderboard.

Limitations

LLM-as-judge bias: The Mains judge (Claude Opus 4.6) may have systematic biases — e.g., preferring verbose answers, favoring certain writing styles, or having blind spots on India-specific cultural context. Comparative grading and score anchors mitigate but don't eliminate this.
Human reference is estimated: The Mains human reference score (Shakti Dubey, AIR 1) is proportionally scaled from total written marks. The actual paper-wise distribution is unknown and likely uneven.
Optional paper excluded: Real UPSC Mains includes an Optional subject paper (250 marks). Our benchmark covers Essay + GS1-4 only (1,250 of 1,750 written marks).
Answer keys from coaching institutes may have errors, particularly for disputed Prelims questions
PDF extraction may introduce artifacts in question text or miss complex formatting
CSAT Paper II includes comprehension passages — the full passage context is provided but may lose formatting nuances