LLM-as-Judge for Scientific Creativity Scoring
This project investigates prompt engineering strategies for using large language models (LLMs) as automated judges to predict human-assigned creativity scores on the Scientific Creative Thinking Test (SCTT). Through a systematic ablation study, the project compares zero-shot, few-shot, chain-of-thought, and multi-dimensional rubric-based prompting approaches, examining how prompt specificity and reasoning effort affect regression performance against psychometrically derived creativity scores.
This project investigates LLM-based automated scoring of scientific creativity using the Scientific Creative Thinking Test (SCTT) — a psychometrically validated instrument where students generate ideas (experiments, hypotheses, research questions) for 15 scientific scenarios.
The central questions are:
- which prompt engineering strategies best enable an LLM to predict human-assigned creativity scores without task-specific fine-tuning?
- To what extent the reasoning effort, available with recent OpenAI models, can improve performance, and which aspects of creativity it attends to?
Research Questions:
- Which prompt engineering strategies most improve LLM-based creativity score prediction?
- How does prompt specificity — decomposing creativity into fluency, flexibility, originality, and elaboration — affect prediction performance?
Approach:
The project frames creativity scoring as a regression task. Ground-truth labels are JRT (Joint Rating Theory) scores derived from human raters, normalized to [0, 1]. LLM outputs (1–5 scale) are mapped to this continuous target. An ablation study systematically varies five dimensions:
- Zero-shot vs. few-shot (1, 3, 5 examples)
- No rubric vs. coarse vs. fine-grained rubric
- Holistic scoring vs. per-dimension (fluency, flexibility, originality, elaboration)
- Without vs. with chain-of-thought reasoning steps
- Pointwise vs. pairwise comparative judgment
Dataset: SCTT (~18K student responses across train/val/test/heldout splits); OSF: osf.io/439zs
Key Results (preliminary, val n=30): Best prompt-only condition (few-shot + rubric + CoT) achieves r = 0.294. Adding high reasoning effort (reasoning_effort="high") raises this to r = 0.523 — approaching the fine-tuned ceiling of r = 0.74 reported in prior work.
Future Directions:
A key open question is whether the large gain from reasoning_effort="high" reflects genuine creativity understanding or surface-level pattern matching. Planned work to investigate this includes:
- Reasoning trace analysis — Collect reasoning summaries (via the OpenAI Responses API) for high- and low-scoring responses to examine whether the model spontaneously attends to SCTT rubric dimensions (fluency, flexibility, originality, elaboration) without being prompted to.
- Reasoning-score alignment — Compare what the model reasons about vs. what it actually scores; systematic misalignment may explain error patterns on specific dimensions.
- Prompted vs. internal CoT — Directly compare prompt-level chain-of-thought traces against internal reasoning summaries on the same responses to determine whether explicit scaffolding constrains or enriches natural reasoning.
- Creativity construct validity — Analyze whether unprompted reasoning traces reference the same constructs specified in the SCTT rubric, informing whether LLM-as-judge is a valid proxy for human raters.
- Post-hoc score calibration — Fit a calibration mapping (e.g., isotonic regression) from LLM outputs to the JRT score distribution using a held-out set, as an alternative to fine-tuning.
- Fine-tuning on SCTT — Train on the 11,969 labeled responses as a ceiling-bound experiment.
Team: Stanley Nurnberger, Dr. Jiho Noh