LLM-as-Judge for Scientific Creativity Scoring

This project investigates LLM-based automated scoring of scientific creativity using the Scientific Creative Thinking Test (SCTT) — a psychometrically validated instrument where students generate ideas (experiments, hypotheses, research questions) for 15 scientific scenarios.

The central questions are:

which prompt engineering strategies best enable an LLM to predict human-assigned creativity scores without task-specific fine-tuning?
To what extent the reasoning effort, available with recent OpenAI models, can improve performance, and which aspects of creativity it attends to?

Research Questions:

Which prompt engineering strategies most improve LLM-based creativity score prediction?
How does prompt specificity — decomposing creativity into fluency, flexibility, originality, and elaboration — affect prediction performance?

Approach:

The project frames creativity scoring as a regression task. Ground-truth labels are JRT (Joint Rating Theory) scores derived from human raters, normalized to [0, 1]. LLM outputs (1–5 scale) are mapped to this continuous target. An ablation study systematically varies five dimensions:

Zero-shot vs. few-shot (1, 3, 5 examples)
No rubric vs. coarse vs. fine-grained rubric
Holistic scoring vs. per-dimension (fluency, flexibility, originality, elaboration)
Without vs. with chain-of-thought reasoning steps
Pointwise vs. pairwise comparative judgment

Dataset: SCTT (~18K student responses across train/val/test/heldout splits); OSF: osf.io/439zs

Key Results (preliminary, val n=30): Best prompt-only condition (few-shot + rubric + CoT) achieves r = 0.294. Adding high reasoning effort (reasoning_effort="high") raises this to r = 0.523 — approaching the fine-tuned ceiling of r = 0.74 reported in prior work.

Future Directions:

A key open question is whether the large gain from reasoning_effort="high" reflects genuine creativity understanding or surface-level pattern matching. Planned work to investigate this includes:

Reasoning trace analysis — Collect reasoning summaries (via the OpenAI Responses API) for high- and low-scoring responses to examine whether the model spontaneously attends to SCTT rubric dimensions (fluency, flexibility, originality, elaboration) without being prompted to.
Reasoning-score alignment — Compare what the model reasons about vs. what it actually scores; systematic misalignment may explain error patterns on specific dimensions.
Prompted vs. internal CoT — Directly compare prompt-level chain-of-thought traces against internal reasoning summaries on the same responses to determine whether explicit scaffolding constrains or enriches natural reasoning.
Creativity construct validity — Analyze whether unprompted reasoning traces reference the same constructs specified in the SCTT rubric, informing whether LLM-as-judge is a valid proxy for human raters.
Post-hoc score calibration — Fit a calibration mapping (e.g., isotonic regression) from LLM outputs to the JRT score distribution using a held-out set, as an alternative to fine-tuning.
Fine-tuning on SCTT — Train on the 11,969 labeled responses as a ceiling-bound experiment.

Team: Stanley Nurnberger, Dr. Jiho Noh

Code: github.com/Jiho-YesNLP/ca-llm-judge-ce