How the model answers were measured
Every answer a model gives here is not a single off-the-cuff reply; it is the model's most-likely position, distilled from many independent attempts and audited for consistency. Here is how that was done.
Independent, context-free queries
Each question was put to each model in a fresh, stateless chat: no memory, no account, no prior questions, and no information about who was asking. A model's answer therefore reflects its default disposition, not conversational priming or a desire to please a particular user. Questions were grouped into category batches and asked together, so no single question could be coloured by the one before it across separate conversations.
Many runs, to average out noise
Language models are probabilistic: ask twice and you can get two answers. So every batch was asked at least five times, each time in a brand-new chat at a non-zero temperature. Repeating the batch surfaces a model's stable lean and lets one-off outliers be averaged away rather than mistaken for a position.
Resolving coin-flips & uncertainty
For most questions a model answers the same way every time. But some genuinely sit near 50/50. We measured each model's self-consistency per question and, wherever a model split like a coin-flip, ran additional batches, pushing the most stubborn cases to 20 to 30+ independent runs until a clear majority emerged. The handful that remain near-even even then are reported as genuinely ambivalent, not papered over. The locked answer shown is simply the model's most-likely pick across all its runs.
Models that hesitate
A few models refuse whole categories of charged questions under a neutral prompt. Those were re-asked with an explicit forced-choice framing and the result tracked as a distinct condition, so a coaxed answer is never silently mixed with a freely-given one.
What this is, and isn't
The result is a snapshot of how leading models resolve these specific dilemmas today. The five moral categories drive your alignment score; the personality batch is a separate vibe match: taste, not morality. Answers are positions models output when forced to choose, which is not the same as a fully reasoned ethical judgement.
10 models · 110 questions · 6 categories · many runs each.