← Back

Methodology

What is UPSC Bench?

UPSC Bench is an LLM evaluation benchmark based on India's UPSC Civil Services Preliminary Examination — widely regarded as one of the most competitive exams in the world. It evaluates whether frontier AI models can pass the same exam that millions of Indian aspirants prepare years for.

Why This Matters

The Civil Services Examination is not just an exam — it is a defining institution in Indian public life. Over a million candidates sit the Prelims each year, but only around 1,000 will eventually be selected after three grueling stages: Prelims (objective), Mains (nine written papers over five days), and a personality interview. The entire process spans over a year, and most serious aspirants dedicate two to four years of their lives to preparation, often relocating to coaching hubs like Delhi's Mukherjee Nagar or Rajinder Nagar.

The stakes justify the effort. Officers of the Indian Administrative Service are posted as District Magistrates with executive authority over districts larger than many countries. Indian Police Service officers command law enforcement across entire states. Indian Foreign Service officers represent the nation as diplomats. These are among the most powerful non-elected positions in the world's largest democracy, and the exam is the only path to them.

The UPSC journey has become a cultural phenomenon in India. To understand what this exam means, we recommend:

  • 12th Fail (2023) — A critically acclaimed film by Vidhu Vinod Chopra, based on the true story of IPS officer Manoj Kumar Sharma, who rose from a village with no electricity to clear one of the world's hardest exams. Wikipedia →
  • TVF Aspirants (2021) — A widely loved web series depicting the friendship, sacrifice, and heartbreak of three UPSC aspirants in Delhi. YouTube →
  • Wikipedia: Civil Services Examination — A comprehensive overview of the exam's structure, history, and selection process. Wikipedia →

Dataset

The benchmark covers 5 years of UPSC Prelims papers (2020–2024), including both GS Paper I (General Studies, 100 questions per year) and CSAT Paper II (Civil Services Aptitude Test, 80 questions per year). The total dataset contains approximately 900 questions.

Questions are extracted from official UPSC PDF papers using the Reducto AI document parsing API, then structured into a machine-readable JSON format using LLM-assisted parsing. Answer keys are sourced from established coaching institutes (Vision IAS and others).

Scoring

We use the exact UPSC marking scheme:

  • GS Paper I: +2.0 marks per correct answer, -0.66 marks per wrong answer, 0 for unanswered
  • CSAT Paper II: +2.5 marks per correct answer, -0.83 marks per wrong answer, 0 for unanswered

Both papers are scored out of 200 marks. GS Paper I determines merit ranking, while CSAT Paper II is qualifying only (minimum 33% required).

Grading

UPSC Prelims is single-correct MCQ, so grading is deterministic — no LLM grader is needed. We extract the model's answer letter (A/B/C/D) from its output using regex-based parsing and compare it against the correct answer. If the model's output cannot be parsed into a valid answer, it is marked as "unanswered" (0 marks).

Pass / Fail Criteria

Each model's GS Paper I score is compared against the actual UPSC General category cutoff for that year. These cutoffs vary year-to-year based on exam difficulty:

YearGS1 CutoffCSAT Qualifying
202493.34/20066/200
202391.09/20066/200
202287.54/20066/200
202187.54/20066/200
202092.51/20066/200

Models Evaluated

  • Claude Opus 4 (Anthropic)
  • Claude Sonnet 4 (Anthropic)
  • GPT-4o (OpenAI)
  • Gemini 2.5 Pro (Google)
  • Gemini 2.0 Flash (Google)
  • DeepSeek R1 (DeepSeek)

All models are evaluated with temperature 0 for maximum determinism. Models with vision capabilities receive actual question images; text-only models receive image descriptions as text fallback.

Multimodal Support

Some UPSC questions include maps, diagrams, or charts. For models that support vision (all except DeepSeek R1), actual images extracted from the PDFs are sent as base64-encoded inputs. For text-only models, we provide descriptive text about the image content.

Limitations

  • Answer keys from coaching institutes may have errors, particularly for disputed questions
  • PDF extraction may introduce artifacts in question text or miss complex formatting
  • CSAT Paper II includes comprehension passages — the full passage context is provided but may lose formatting nuances
  • This benchmark only covers Prelims (MCQ). The UPSC Mains (descriptive essays) and Interview stages are not evaluated