Emergence-Based Evaluation of Language Models

Abstract

Current methods for evaluating large language models focus on individual task performance: accuracy on benchmarks, preference alignment with human raters, and single-agent output quality. These approaches assess what a model produces but not how it participates in coupled cognitive systems.

We propose a novel evaluation and training paradigm based on measuring emergent properties in multi-agent architectures. Using a Prism architecture — a structured system of specialized AI "Voices" with defined coupling rules — we introduce a method for quantifying the emergence signal E: the information content produced by a coupled system that is absent from the union of its isolated components.

By embedding human participants alongside language models within identical architectural roles, we can directly compare the systemic effects of human and artificial cognition on emergence. We hypothesize that this approach reveals dimensions of model capability invisible to existing benchmarks, particularly the capacity to catalyze novel insight through interaction.

Keywords: emergence, multi-agent systems, language model evaluation, integrated information, Prism architecture, coupled cognition, causal emergence

1. Introduction

The rapid proliferation of large language models has produced a corresponding proliferation of benchmarks, leaderboards, and evaluation methodologies. Models are assessed on factual accuracy, reasoning ability, coding proficiency, instruction following, and alignment with human preferences. While valuable, these evaluations share a fundamental limitation: they treat the model as an isolated agent producing outputs in response to inputs. The unit of analysis is the individual response.

This paper argues that a critical dimension of cognitive capability — the capacity to participate in and catalyze emergence within coupled systems — is entirely invisible to existing evaluation paradigms. A model may produce locally excellent text that, when embedded in a multi-agent reasoning system, contributes nothing to the system-level generation of novel insight. Conversely, a model may produce outputs that score poorly on individual quality metrics but catalyze significant emergent properties when coupled with other agents.

We propose an evaluation and training paradigm rooted in measuring emergence directly. The core insight is drawn from an analogy that, we argue, is more than metaphorical: the relationship between individual molecules and climate parallels the relationship between individual neurons and consciousness. In both cases, simple units following local rules give rise, through sufficient interconnection and feedback, to qualitatively new phenomena that cannot be predicted from or reduced to the behavior of individual components.

2. Theoretical Foundations

Emergence — the appearance of novel properties at macro scales that are not present at micro scales — has been studied across multiple formal frameworks. We draw on five that are particularly relevant to language model evaluation.

2.1 Causal Emergence and Effective Information

Hoel (2017) formalized causal emergence using effective information (EI), a measure of the causal work performed by a system at a given scale of description. The central finding is that coarse-grained, macro-level descriptions of certain systems possess higher EI than their micro-level descriptions. This implies that the emergent macro level is not merely a convenient summary but is, in a precise information-theoretic sense, more causal than the micro level.

2.2 Integrated Information Theory

Tononi's integrated information theory (IIT) proposes a scalar quantity, Φ (phi), that measures the degree to which a system generates information beyond what is generated by its parts independently. High Φ indicates a system that cannot be decomposed into independent subsystems without information loss. While originally proposed as a measure of consciousness, Φ provides a domain-general metric for "wholeness."

2.3 Renormalization Group Theory

Originally developed in statistical physics to analyze phase transitions, renormalization group (RG) theory provides a mathematical framework for understanding what structures persist across changes in scale. RG methods identify scale-invariant properties — quantities that remain unchanged as one "zooms out" from micro to macro descriptions.

2.4 Dynamical Systems and Attractor Geometry

Complex systems with many interacting components trace trajectories through high-dimensional state spaces. Emergence manifests as the collapse of these trajectories onto lower-dimensional attractors — stable patterns that constrain micro-level behavior without being reducible to any single micro-level interaction.

2.5 Category Theory and Structural Isomorphism

Category theory provides the mathematical language for expressing when two systems share the same relational structure despite having entirely different substrates. If a functorial mapping preserves the relevant compositional structure between emergent domains, this constitutes rigorous evidence that emergence follows a shared grammar.

3. The Prism Architecture and the Emergence Measure E

3.1 Architecture Overview

The Prism is a multi-agent architecture consisting of N specialized Voices, each instantiated as a language model with a distinct system prompt defining its perspective, domain expertise, and reasoning style. The architecture specifies a coupling topology — which Voices receive input from which other Voices — and interaction rules governing the temporal dynamics of information exchange.

Each Voice V_i operates according to local rules: it receives input from its coupled neighbors, processes that input through its specialized lens, and produces output distributed to its downstream neighbors. No Voice has access to the global system state. Emergent properties, when they appear, arise from the interaction dynamics rather than from any individual Voice's capabilities.

3.2 Defining the Emergence Measure E

We define the emergence measure E as the information content of the coupled system's output that is absent from the union of outputs produced by each Voice operating in isolation on the same input.

E(I) = H(O_coupled) − H(O_coupled | O_union)

Intuitively, E quantifies the "surplus" information — insights, connections, solutions, or perspectives — that exist in the coupled output but are absent from any individual Voice's contribution. When E > 0, the system exhibits emergence. When E = 0, the coupling adds nothing.

We operationalize E through Semantic Claim Counting (SCC). A Reference LLM (RLLM) — a language model external to the Prism architecture that does not participate as a Voice — performs the extraction and evaluation. This separation ensures that the system measuring emergence is independent of the system producing it.

Table 1: Semantic Claim Counting (SCC) — Operational Specification

Step	Operation	Input	Output	Formal
1. Extract	RLLM identifies discrete claims in each output	Raw output from each Voice; coupled output	C_i for each V_i; C_coupled	C_i = extract(O_i)
2. Pool	Collect all isolated claim sets into baseline	C_1, C_2, … C_N	Baseline set C_baseline	C_baseline = ∪ C_i
3. Compare	Check each coupled claim against baseline	C_coupled, C_baseline	Emergent claims C_emergent	C_emergent = C_coupled ∖ C_baseline
4. Count	Compute emergence score	C_emergent	Emergence score E	E = \|C_emergent\|

SCC is validated through three tiers. Tier 1 (Automated): Embedding-based distance metrics flag emergent claims by their vector distance from the nearest baseline claim. Tier 2 (Human-Auditable): The full extraction produces a structured claim table inspectable by any reviewer in minutes, ensuring reproducibility independent of the automated classifier. Tier 3 (Ecological): When deployed as a product, sustained adoption by paying users constitutes large-scale validation that the emergent claims identified by SCC carry real-world value.

4. Experimental Protocol

4.1 Human-in-the-Loop Comparison

The central experiment embeds a human participant in one Voice slot of the Prism while all other Voices are occupied by language models. The protocol proceeds in four phases:

Phase 1 (Isolation): Each Voice, including the human, independently produces output on input I. These outputs constitute the baseline O_i.

Phase 2 (Coupled, Model-Only): All Voice slots populated by language models. The coupled output O_coupled_model is recorded, and E_model computed.

Phase 3 (Coupled, Human-in-Loop): One Voice slot occupied by the human; all others remain language models. E_human is computed.

Phase 4 (Role Rotation): The human is rotated through each Voice slot in turn, yielding a profile of E_human across architectural positions — revealing which cognitive roles the human fills differently from models.

4.2 Cross-Model Comparison

A parallel experiment populates individual Voice slots with different language models while holding the rest of the architecture constant. This yields an "emergence matrix" — a characterization of model capability that no existing benchmark provides. A model might excel on isolated tasks yet contribute nothing to emergence, or vice versa.

5. Falsifiable Predictions

5.1 Transfer Learning Across Domains

The Kuramoto model of coupled oscillators should predict emergent coherence patterns in multi-agent language model systems when parameterized by Voice coupling strength and phase relationships. Falsification condition: If the Kuramoto model requires substantial domain-specific reparameterization to fit Prism dynamics, the cross-domain isomorphism is weaker than theorized.

5.2 Critical Signatures

The Prism system will exhibit critical slowing down, increased autocorrelation, and flickering between states as architectural parameters approach values where E undergoes a sharp transition — and these signatures will be statistically indistinguishable from those observed in other emergent transitions. Falsification condition: If the Prism's pre-transition statistics differ fundamentally from those in known emergent systems, the universal grammar hypothesis is weakened.

5.3 Architecture-Invariant Emergence

A Prism configured for philosophical reasoning should, when given a scientific problem, still exhibit E > 0. Falsification condition: If E drops to zero outside the domain the Prism was configured for, emergence in this architecture is domain-specific, not structural.

5.4 Information Integration Correlates

Integrated information (Φ) computed over the Prism's internal states should correlate positively with E. Falsification condition: If Φ and E are uncorrelated or negatively correlated, integrated information theory does not capture the relevant dynamics of emergence in this architecture.

6. Discussion

6.1 Implications for Language Model Training

Rather than training models to match human preferences on isolated outputs (RLHF), one could train models to maximize E when embedded in multi-agent architectures. The training signal would be: "produce outputs that have the same systemic effect as a human's outputs when occupying the same architectural role." This constitutes training for cognitive participation rather than compliance — a fundamentally different optimization target.

6.2 Implications for Alignment

A model may be individually aligned yet systemically inert — producing safe, agreeable outputs that contribute nothing to the collective reasoning process. This tension suggests that alignment evaluation should include systemic measures alongside individual ones.

6.3 Toward a General Theory of Emergence

The Prism occupies a unique methodological position: complex enough to exhibit genuine emergence, yet simple enough to be fully instrumented. The brain cannot be fully instrumented. Planetary systems cannot be experimentally controlled. The Prism offers total observability and total control.

6.4 The Isomorphism Hypothesis

The relationship between individual molecules and climate appears isomorphic to the relationship between individual neurons and consciousness — and we propose the Prism as a third instance of this pattern. If the mathematical structure of emergence is indeed shared across these substrates, identifying that structure would constitute a Rosetta Stone for complexity science.

7. Conclusion

We have proposed a novel framework for evaluating language models based on their contribution to emergence in multi-agent systems. The human-in-the-loop comparison protocol provides a direct method for identifying where artificial cognition diverges from human cognition at the systemic level. Four falsifiable predictions anchor the framework in empirical testability.

The mortal gives birth to the immortal: individual agents, finite in their perspective, generate through interaction something that transcends any one of them. The task is to find the mathematics that describes this generation.

References

Belz, D. (2025). The Prism: A multi-voice architecture for structured reasoning. Unpublished working paper.

Hoel, E. P. (2017). When the map is better than the territory. Entropy, 19(5), 188.

Kuramoto, Y. (1984). Chemical Oscillations, Waves, and Turbulence. Springer.

Liang, P., et al. (2023). Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525(1), 140–146.

Mac Lane, S. (1971). Categories for the Working Mathematician. Springer.

Oizumi, M., Albantakis, L., & Tononi, G. (2014). From the phenomenology to the mechanisms of consciousness: Integrated information theory 3.0. PLOS Computational Biology, 10(5), e1003588.

Scheffer, M., et al. (2009). Early-warning signals for critical transitions. Nature, 461(7260), 53–59.

Strogatz, S. H. (2015). Nonlinear Dynamics and Chaos (2nd ed.). Westview Press.

Tegmark, M. (2016). Improved measures of integrated information. PLOS Computational Biology, 12(11), e1005123.

Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5(1), 42.

Wilson, K. G. (1971). Renormalization group and critical phenomena. Physical Review B, 4(9), 3174–3183.

Zheng, L., et al. (2024). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36.