Why a Multi-Component Critic? The Problem with “Oracles”
Many AI systems rely on opaque, monolithic “oracle” models for evaluation. These models produce a score but offer no insight into their reasoning, making them difficult to debug, trust, or adapt. If an oracle gives a low score, is it because the input was factually wrong, illogical, or simply unoriginal? It’s impossible to know.
CNS 2.0 explicitly rejects this “black box” approach. Instead, it decomposes evaluation into a transparent, auditable pipeline of specialized critics. This design choice is fundamental and provides several key advantages:
- Transparency & Debuggability: By separating evaluation into components—Grounding, Logic, and Novelty—we can pinpoint the exact strengths and weaknesses of a narrative. A low score from the
LogicCritictells us to examine the argument’s structure, while a low score from theGroundingCriticpoints to a lack of evidence. - Adaptability: The system’s “values” can be dynamically adjusted. By changing the weights assigned to each critic, we can shift the system’s focus. In an exploratory phase, we might prioritize novelty. In a verification phase, we would prioritize grounding and logic.
- Explainability: The final
Trust Scoreis not a mystery. It can be explained as a weighted combination of clear, understandable criteria, making the entire system more trustworthy and interpretable.
The Mathematical Foundation: Weighted Averaging
The final Trust Score emerges from a weighted combination of the individual critic scores, as defined by Equation (1) in Section 2.2 of the paper. This formula is the heart of the pipeline’s adaptability.
From the Paper (Equation 1):
$$\text{Reward}(\mathcal{S}) = \sum_{i \in \{G, L, N\}} w_i \cdot \text{Score}_i(\mathcal{S})$$where $w_i$ are dynamically adjustable weights for the Grounding, Logic, and Novelty-Parsimony critics.
Our CriticPipeline class directly implements this formula. It iterates through each registered critic, calculates its score, applies the corresponding weight $w_i$, and normalizes the result to produce the final Trust Score.
Implementing the Critic Infrastructure
First, we define the basic infrastructure: a BaseCritic abstract class to ensure all critics have a standard interface, a CriticResult dataclass for structured and explainable output, and the CriticPipeline orchestrator.
"""
Multi-Component Critic Pipeline Implementation
============================================
Transparent, auditable evaluation of SNO quality
"""
from abc import ABC, abstractmethod
from typing import Dict, List, Tuple, Optional, Any
import numpy as np
from dataclasses import dataclass, field
from enum import Enum
# Assume StructuredNarrativeObject is available from Chapter 2 and HAS_TRANSFORMERS is defined
@dataclass
class CriticResult:
"""A structured result from a single critic evaluation, ensuring transparency."""
score: float
confidence: float
explanation: str
# evidence can store any data that supports the explanation, e.g., claim-level scores
evidence: Dict[str, Any] = field(default_factory=dict)
sub_scores: Dict[str, float] = field(default_factory=dict)
class CriticType(Enum):
GROUNDING = "grounding"
LOGIC = "logic"
NOVELTY = "novelty"
class BaseCritic(ABC):
"""Abstract base class for all CNS 2.0 critics, ensuring a consistent interface."""
def __init__(self, critic_type: CriticType, weight: float = 1.0):
self.critic_type = critic_type
self.weight = weight
self.evaluation_count = 0
self.performance_history = []
@abstractmethod
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
"""The core method for any critic. Must be implemented by subclasses."""
pass
def update_weight(self, new_weight: float):
"""Allows for dynamic adjustment of the critic's importance in the pipeline."""
self.weight = new_weight
def get_statistics(self) -> Dict[str, Any]:
"""Provides performance metrics for monitoring."""
return {
'type': self.critic_type.value,
'weight': self.weight,
'evaluations': self.evaluation_count,
'avg_score': np.mean([r['score'] for r in self.performance_history]) if self.performance_history else 0.0,
}
class CriticPipeline:
"""Orchestrates multiple critics to produce a comprehensive SNO evaluation."""
def __init__(self):
self.critics: Dict[CriticType, BaseCritic] = {}
self.evaluation_history = []
def add_critic(self, critic: BaseCritic):
"""Registers a critic with the pipeline."""
self.critics[critic.critic_type] = critic
def evaluate_sno(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> Dict[str, Any]:
"""
Evaluates an SNO by running it through all registered critics and computing
the final weighted Trust Score, directly implementing the paper's Reward formula.
"""
results = {}
total_weighted_score = 0.0
total_weight = 0.0
for critic_type, critic in self.critics.items():
result = critic.evaluate(sno, context)
results[critic_type.value] = result
# This is the core of the formula: score * weight
total_weighted_score += result.score * critic.weight
total_weight += critic.weight
critic.performance_history.append({'score': result.score, 'confidence': result.confidence})
critic.evaluation_count += 1
# Normalize by the sum of weights to get the final score
trust_score = total_weighted_score / total_weight if total_weight > 0 else 0.0
sno.trust_score = trust_score
evaluation_result = {
'trust_score': trust_score,
'critic_results': {k: v.to_dict() for k, v in results.items()}, # Assuming CriticResult has to_dict
'weights_used': {ct.value: c.weight for ct, c in self.critics.items()},
}
self.evaluation_history.append(evaluation_result)
return evaluation_result
def adjust_weights(self, weight_updates: Dict[CriticType, float]):
"""Dynamically adjusts the weights of the critics."""
for critic_type, new_weight in weight_updates.items():
if critic_type in self.critics:
self.critics[critic_type].update_weight(new_weight)
1. Grounding Critic
The Grounding Critic ensures that narratives remain tethered to verifiable facts by evaluating how well claims are supported by the provided evidence.
From the Paper (Section 2.2):
$$ \text{Score}_G = \frac{1}{|V|}\sum_{v \in V} \max_{e \in \mathcal{E}} p(v|e) $$where $p(v|e)$ is the plausibility of a claim $v$ given evidence $e$, computed using a Natural Language Inference (NLI) model.
Formula Breakdown: Score_G
This formula calculates the average “best possible support” for all claims in a narrative. Let’s break it down from inside out:
p(v|e): This is the core judgment: “Given this piece of evidencee, how plausible is claimv?” We use a Natural Language Inference (NLI) model to calculate this, wherep(v|e)is the model’s confidence in the “entailment” relationship between the evidence (premise) and the claim (hypothesis).max_{e \in E}: For each individual claimv, we loop through all available evidence in the setEand find the single best piece of evidence that supports it. A claim only needs one strong piece of evidence to be considered well-supported.∑_{v \in V}: We sum up these “maximum plausibility” scores for every claimvin the reasoning graph’s vertex setV.1/|V|: Finally, we average the total score by dividing by the number of claims. This ensures that SNOs with many claims aren’t unfairly advantaged or disadvantaged.
class GroundingCritic(BaseCritic):
def __init__(self, weight: float, nli_model=None, nli_tokenizer=None, nli_model_name: str = "roberta-large-mnli"):
super().__init__(CriticType.GROUNDING, weight)
if nli_model and nli_tokenizer:
self.nli_model, self.nli_tokenizer = nli_model, nli_tokenizer
elif HAS_TRANSFORMERS:
import transformers
self.nli_tokenizer = transformers.AutoTokenizer.from_pretrained(nli_model_name)
self.nli_model = transformers.AutoModelForSequenceClassification.from_pretrained(nli_model_name)
else:
raise ImportError("Transformers library is required for the GroundingCritic.")
# Find the index for the 'entailment' label in the model's configuration
self.entailment_id = self.nli_model.config.label2id.get('entailment', 2)
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
claims = [data['claim'] for _, data in sno.reasoning_graph.nodes(data=True)]
evidence_contents = [item.content for item in sno.evidence_set]
if not claims or not evidence_contents:
return CriticResult(0.0, 1.0, "SNO has no claims or no evidence to evaluate.", {}, {})
total_max_plausibility, sub_scores = 0.0, {}
# This outer loop corresponds to the Σ[v ∈ V] part of the formula
for claim in claims:
# Prepare (evidence, claim) pairs to calculate p(v|e) for all e ∈ E at once
pairs = [(e, claim.content) for e in evidence_contents]
inputs = self.nli_tokenizer(pairs, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
logits = self.nli_model(**inputs).logits
probabilities = torch.softmax(logits, dim=1)
entailment_probs = probabilities[:, self.entailment_id].tolist()
# This corresponds to the max[e ∈ E] p(v|e) part of the formula
max_plausibility_for_claim = max(entailment_probs) if entailment_probs else 0.0
total_max_plausibility += max_plausibility_for_claim
sub_scores[claim.claim_id] = max_plausibility_for_claim
# This corresponds to the (1/|V|) * Σ[...] part of the formula
final_score = total_max_plausibility / len(claims) if claims else 0.0
return CriticResult(
score=final_score, confidence=0.8,
explanation=f"Average max NLI entailment score across {len(claims)} claims is {final_score:.3f}.",
evidence={'claim_scores': sub_scores}, sub_scores=sub_scores
)
2. Logic Critic
The Logic Critic assesses the structural coherence of the reasoning graph $G$. A narrative can have well-grounded claims but still be logically flawed.
From the Paper (Section 2.2): The ideal Logic Score is produced by a Graph Neural Network (GNN) trained to detect logical weaknesses:
$$ \text{Score}_L = f_{\text{GNN}}(G; \theta) $$Training a full GNN is a major research project. For our implementation, we create a functional heuristic proxy for $f_{\text{GNN}}$ that uses graph-theoretic metrics to approximate logical coherence.
For a deep-dive into the state-of-the-art approach, see the research project on GNNs for Logical Reasoning.
Score_L (Heuristic Proxy)
Our heuristic-based LogicCritic uses a weighted average of three metrics to approximate what a trained GNN would learn:
- Orphan Score (Penalty for unsupported claims): Checks for claims that are not supported by any other claim. A high number of orphans suggests a collection of disconnected assertions, not a coherent argument.
- Coherence Score (Penalty for unfocused claims): Penalizes claims that are used to support too many other, potentially unrelated, points.
- Parsimony Score (Penalty for complexity): Rewards simplicity (Occam’s Razor) by penalizing overly dense, “spaghetti-like” argument graphs.
class LogicCritic(BaseCritic):
def __init__(self, weight: float):
super().__init__(CriticType.LOGIC, weight)
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
G = sno.reasoning_graph
num_nodes = G.number_of_nodes()
if num_nodes <= 1:
return CriticResult(1.0, 1.0, "Graph is too simple to assess logic.", {}, {})
# Heuristic 1: Penalize orphaned claims (unsupported assertions)
# An orphan is a node with no incoming edges, excluding the root hypothesis.
orphaned_nodes = [n for n, d in G.in_degree() if d == 0 and n != 'root']
orphan_penalty = len(orphaned_nodes) / (num_nodes - 1) if num_nodes > 1 else 0
orphan_score = 1.0 - orphan_penalty
# Heuristic 2: Penalize unfocused claims (a single claim supporting too many others)
avg_out_degree = sum(d for _, d in G.out_degree()) / num_nodes
# Penalize if the average claim supports more than 3 others. This is a simple heuristic.
coherence_score = max(0, 1.0 - (avg_out_degree / 3.0))
# Heuristic 3: Penalize complexity (convoluted, "spaghetti" arguments) using graph density
density = nx.density(G)
parsimony_score = 1.0 - density
# Our functional proxy for f_GNN is a weighted average of these heuristics.
# These weights are internal to the critic and can be tuned.
final_score = 0.5 * orphan_score + 0.3 * coherence_score + 0.2 * parsimony_score
sub_scores = {'orphan_score': orphan_score, 'coherence_score': coherence_score, 'parsimony_score': parsimony_score}
return CriticResult(
score=final_score, confidence=0.9,
explanation=f"Logic score based on graph structure heuristics: {final_score:.3f}",
evidence={'num_orphans': len(orphaned_nodes), 'avg_out_degree': avg_out_degree, 'density': density},
sub_scores=sub_scores
)
3. Novelty-Parsimony Critic
This critic balances two competing virtues: the desire for new ideas (novelty) and the principle of simplicity (parsimony), also known as Occam’s Razor.
From the Paper (Section 2.2):
$$ \text{Score}_N = \alpha \cdot \min_i \|H - H_i\|_2 - \beta \cdot \frac{|E_G|}{|V|} $$
Formula Breakdown: Score_N
This formula is a simple linear combination of a reward and a penalty:
α * min_i ||H - H_i||₂: This is the novelty reward.||H - H_i||₂: The Euclidean distance between the current SNO’s embedding (H) and the embedding of another SNO (H_i) in the population. A larger distance means the ideas are further apart, or more “novel.”min_i: We find the distance to the closest (most similar) SNO in the entire population. This measures how much of a leap the new idea is making from the most related existing idea.α: The alpha parameter is a weight that lets us control how much we care about novelty. A highαencourages more exploratory, “out-there” ideas.
- β * (|E_G| / |V|): This is the parsimony penalty.|E_G| / |V|: The ratio of edges to nodes in the reasoning graph. This is a simple measure of graph complexity or density. An argument with 10 claims and 30 relationships is more complex than one with 10 claims and 9 relationships.β: The beta parameter weights this penalty. A highβstrongly encourages simpler, more elegant arguments.
class NoveltyParsimonyCritic(BaseCritic):
def __init__(self, weight: float, alpha: float, beta: float):
super().__init__(CriticType.NOVELTY, weight)
self.alpha = alpha
self.beta = beta
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
context = context or {}
sno_population = context.get('sno_population', [])
population_embeddings = [s.hypothesis_embedding for s in sno_population if s.sno_id != sno.sno_id and s.hypothesis_embedding is not None]
# --- Novelty Term Calculation ---
if not population_embeddings or sno.hypothesis_embedding is None:
# If this is the first SNO, it is maximally novel by definition.
novelty_score = 1.0
min_dist_str = "N/A (first SNO)"
else:
# Corresponds to the ||H - H_i||₂ part of the formula
distances = [np.linalg.norm(sno.hypothesis_embedding - h) for h in population_embeddings]
# Corresponds to the min_i part of the formula
min_distance = min(distances) if distances else 0
# Normalize the distance. Max possible distance for normalized vectors is 2.0.
novelty_score = min_distance / 2.0
min_dist_str = f"{min_distance:.3f}"
novelty_term = self.alpha * novelty_score
# --- Parsimony Term Calculation ---
G = sno.reasoning_graph
num_nodes = G.number_of_nodes()
# Corresponds to the |E_G|/|V| part of the formula
complexity_ratio = G.number_of_edges() / num_nodes if num_nodes > 0 else 0
# Normalize penalty (assuming max complexity ratio is around 5 for a reasonable argument graph)
parsimony_penalty = self.beta * min(1.0, complexity_ratio / 5.0)
# Combine terms and clamp the final score to the valid [0, 1] range.
raw_score = novelty_term - parsimony_penalty
final_score = np.clip(raw_score, 0, 1)
explanation = f"Score({final_score:.3f}) = α*Novelty({novelty_term:.3f}) - β*Parsimony({parsimony_penalty:.3f}). Min dist: {min_dist_str}."
return CriticResult(
score=final_score, confidence=0.9, explanation=explanation,
evidence={'novelty_term': novelty_term, 'parsimony_penalty': parsimony_penalty},
sub_scores={'novelty_score': novelty_score, 'complexity_ratio': complexity_ratio}
)
Roadmap to a GNN-based Logic Critic
The heuristic-based LogicCritic is a functional and transparent starting point. However, the research proposal correctly identifies that a Graph Neural Network (GNN) is the state-of-the-art solution.
Why a GNN is the Next Step: Hand-coded heuristics can only capture simple structural flaws. A GNN, in contrast, can learn subtle, complex, and non-local patterns of faulty reasoning directly from data. By training on a dataset of valid and fallacious argument graphs, a GNN can learn to identify sophisticated weaknesses like:
- Missing Warrants: Implicit logical leaps between claims.
- Fallacies of Relevance: Arguments where the support is only superficially related to the conclusion.
- Complex Circular Reasoning: Logical loops that span multiple nodes and are hard to detect with simple cycle checks.
A GNN-based critic moves from a “rules-based” system to a “learning-based” system, dramatically increasing the sophistication and accuracy of the logic evaluation.
Conceptual GNN Implementation (PyTorch & PyG):
Below is a conceptual skeleton of what a GNN-based LogicCritic might look like using PyTorch and the PyTorch Geometric (PyG) library, which is specialized for GNNs.
# You would need to install: pip install torch torch-geometric
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool
from torch_geometric.data import Data
class GNNLogicModel(torch.nn.Module):
"""A simple Graph Convolutional Network (GCN) for graph classification."""
def __init__(self, num_node_features, hidden_channels):
super().__init__()
self.conv1 = GCNConv(num_node_features, hidden_channels)
self.conv2 = GCNConv(hidden_channels, hidden_channels)
# A linear layer for the final graph-level classification
self.lin = torch.nn.Linear(hidden_channels, 1)
def forward(self, x, edge_index, batch):
# 1. Obtain node embeddings
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index).relu()
# 2. Global Pooling: Aggregate node features to get a graph-level embedding
x = global_mean_pool(x, batch)
# 3. Apply a final classifier to get a single score for the graph
x = self.lin(x)
# Apply sigmoid to get a score between 0 and 1
return torch.sigmoid(x)
def convert_sno_to_graph_data(sno: StructuredNarrativeObject, embedding_model) -> Data:
"""Converts our NetworkX graph into a PyG Data object for the GNN."""
G = sno.reasoning_graph
# Create node features (e.g., from claim embeddings)
node_features = []
node_map = {node_id: i for i, node_id in enumerate(G.nodes())}
for node_id in G.nodes():
claim_content = G.nodes[node_id]['claim'].content
# In a real implementation, you'd use pre-computed embeddings
embedding = embedding_model.encode(claim_content)
node_features.append(embedding)
x = torch.tensor(np.array(node_features), dtype=torch.float)
# Create edge index
edge_list = [[node_map[u], node_map[v]] for u, v in G.edges()]
edge_index = torch.tensor(edge_list, dtype=torch.long).t().contiguous()
return Data(x=x, edge_index=edge_index)
# --- Conceptual Training Loop ---
# This would not run in the guide, but shows the process.
def train_gnn_critic(model, train_loader, optimizer, criterion):
model.train()
for data in train_loader: # train_loader yields batches of graph Data objects
optimizer.zero_grad()
out = model(data.x, data.edge_index, data.batch)
# `data.y` would be the ground-truth label (0 for fallacious, 1 for valid)
loss = criterion(out, data.y.unsqueeze(1).float())
loss.backward()
optimizer.step()
# --- The GNN-based Critic Class ---
class GNNLogicCritic(BaseCritic):
def __init__(self, weight: float, model_path: str, embedding_model):
super().__init__(CriticType.LOGIC, weight)
self.model = GNNLogicModel(num_node_features=768, hidden_channels=64) # Example dimensions
self.model.load_state_dict(torch.load(model_path))
self.model.eval() # Set model to evaluation mode
self.embedding_model = embedding_model
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
graph_data = convert_sno_to_graph_data(sno, self.embedding_model)
with torch.no_grad():
score = self.model(graph_data.x, graph_data.edge_index, torch.zeros(graph_data.num_nodes, dtype=torch.long)).item()
return CriticResult(
score=score,
confidence=0.95, # Assuming a well-trained model
explanation=f"GNN-based logical coherence score: {score:.3f}"
)
This roadmap illustrates the clear, principled path from our initial heuristic-based critic to a much more powerful, learned system, which is a core theme of the CNS 2.0 research philosophy.
Contextual Evaluation: Dynamic Weight Adjustment
A key feature of CNS 2.0 is its adaptability. By adjusting the weights $w_i$ in the main reward formula, we can change the system’s “priorities” to suit different phases of knowledge discovery.
# --- Setup: Create a sample SNO and a pipeline ---
# This code assumes the classes from previous chapters are available.
# 1. Create a mock SNO. Let's imagine this is a very new, slightly underdeveloped idea.
# We will manually set the scores each critic *would* produce for demonstration.
class MockCritic(BaseCritic):
def __init__(self, critic_type, weight, mock_score):
super().__init__(critic_type, weight)
self.mock_score = mock_score
def evaluate(self, sno, context=None):
return CriticResult(score=self.mock_score, confidence=1.0, explanation="Mocked result")
# Our SNO is very novel (0.9) but has weak logic (0.4) and grounding (0.5)
pipeline = CriticPipeline()
pipeline.add_critic(MockCritic(CriticType.NOVELTY, 1.0, 0.9))
pipeline.add_critic(MockCritic(CriticType.LOGIC, 1.0, 0.4))
pipeline.add_critic(MockCritic(CriticType.GROUNDING, 1.0, 0.5))
sample_sno = StructuredNarrativeObject(central_hypothesis="A sample SNO for testing.")
# --- Phase 1: Exploration Mode ---
# We want to find new ideas, so we heavily weight novelty.
print("--- EVALUATING IN EXPLORATION MODE ---")
pipeline.adjust_weights({
CriticType.NOVELTY: 0.8, # High weight for new ideas
CriticType.LOGIC: 0.1, # Low weight for rigor
CriticType.GROUNDING: 0.1 # Low weight for rigor
})
exploration_result = pipeline.evaluate_sno(sample_sno)
print(f"Final Trust Score (Exploration): {exploration_result['trust_score']:.4f}\n")
# --- Phase 2: Verification Mode ---
# Now, we shift to rigorously checking our ideas.
print("--- EVALUATING IN VERIFICATION MODE ---")
pipeline.adjust_weights({
CriticType.NOVELTY: 0.1, # Low weight for novelty
CriticType.LOGIC: 0.45, # High weight for logical soundness
CriticType.GROUNDING: 0.45 # High weight for evidential support
})
verification_result = pipeline.evaluate_sno(sample_sno)
print(f"Final Trust Score (Verification): {verification_result['trust_score']:.4f}\n")
As the output shows, the same SNO is considered high-trust in exploration mode but fails the quality bar in verification mode. This ability to programmatically shift the system’s “values” is a practical tool for guiding the knowledge discovery process, making CNS 2.0 a powerful and flexible framework.
Try It Now: Evaluate an SNO with the Critic Pipeline
Goal: Build a working critic pipeline and evaluate the SNO from Chapter 2 in 10 minutes.
Prerequisites
- Completed Chapter 2 and created a complete SNO
- Virtual environment activated with all dependencies installed
Step 1: Save the Complete Critic Example
Note: This example includes simplified implementations of the critic classes for demonstration purposes. The
GroundingCriticuses basic heuristics (evidence-to-claims ratio) rather than the full NLI model described in the main chapter. TheLogicCriticuses NetworkX graph analysis rather than a trained GNN. This allows you to run the code immediately without training models, while understanding the core evaluation logic.
Create a file called evaluate_with_critics.py:
"""
Critic Pipeline Example: Evaluating SNO Quality
Demonstrates the multi-component critic pipeline evaluating an SNO.
"""
from sentence_transformers import SentenceTransformer
import networkx as nx
import numpy as np
from datetime import datetime
from dataclasses import dataclass
from typing import Optional, Set, Dict, Any, List
from enum import Enum
import uuid
import hashlib
print("="*70)
print("CNS 2.0 CRITIC PIPELINE DEMONSTRATION")
print("="*70)
# Step 1: Load model and recreate data structures from Chapter 2
print("\n[Step 1/5] Loading embedding model and data structures...")
model = SentenceTransformer('all-MiniLM-L6-v2')
class RelationType(Enum):
SUPPORTS = "supports"
CONTRADICTS = "contradicts"
IMPLIES = "implies"
WEAKENS = "weakens"
EXPLAINS = "explains"
@dataclass
class EvidenceItem:
content: str
source_id: str
doc_hash: Optional[str] = None
confidence: float = 1.0
def __post_init__(self):
if self.doc_hash is None:
self.doc_hash = hashlib.sha256(self.content.encode()).hexdigest()[:16]
def __hash__(self):
return hash(self.doc_hash)
def __eq__(self, other):
return isinstance(other, EvidenceItem) and self.doc_hash == other.doc_hash
@dataclass
class ClaimNode:
claim_id: str
content: str # Using 'content' to match main Chapter 2 definition
embedding: Optional[np.ndarray] = None
confidence: float = 1.0
@dataclass
class ReasoningEdge:
relation_type: RelationType
strength: float = 1.0
evidence_refs: Set[str] = None
class StructuredNarrativeObject:
def __init__(self, central_hypothesis: str, sno_id: Optional[str] = None):
self.sno_id = sno_id or str(uuid.uuid4())[:8]
self.central_hypothesis = central_hypothesis
self.hypothesis_embedding: Optional[np.ndarray] = None
self.reasoning_graph = nx.DiGraph()
self.evidence_set: Set[EvidenceItem] = set()
self.trust_score: Optional[float] = None
self.created_at = datetime.now()
self.metadata: Dict[str, Any] = {}
def compute_hypothesis_embedding(self, model):
self.hypothesis_embedding = model.encode(self.central_hypothesis)
return self.hypothesis_embedding
def add_claim(self, claim_id: str, content: str, confidence: float = 1.0):
claim = ClaimNode(claim_id=claim_id, content=content, confidence=confidence)
self.reasoning_graph.add_node(claim_id, claim=claim)
def add_reasoning_edge(self, source: str, target: str, relation: RelationType, strength: float = 1.0):
edge = ReasoningEdge(relation_type=relation, strength=strength)
self.reasoning_graph.add_edge(source, target, reasoning_edge=edge)
def add_evidence(self, content: str, source_id: str, confidence: float = 1.0):
evidence = EvidenceItem(content=content, source_id=source_id, confidence=confidence)
self.evidence_set.add(evidence)
return evidence.doc_hash
print("✓ Data structures ready")
# Step 2: Create a sample SNO (reusing Coffee example from Chapter 2)
print("\n[Step 2/5] Creating sample SNO...")
sno = StructuredNarrativeObject(
central_hypothesis="Coffee consumption improves programming productivity through enhanced cognitive performance"
)
# Build reasoning graph
sno.add_claim("c1", "Caffeine blocks adenosine receptors", 0.95)
sno.add_claim("c2", "Adenosine causes drowsiness", 0.95)
sno.add_claim("c3", "Blocking adenosine increases alertness", 0.90)
sno.add_claim("c4", "Alertness improves sustained attention", 0.85)
sno.add_claim("c5", "Sustained attention is critical for programming", 0.90)
sno.add_claim("c6", "Therefore, coffee improves programming productivity", 0.80)
sno.add_reasoning_edge("c1", "c3", RelationType.SUPPORTS, 0.9)
sno.add_reasoning_edge("c2", "c3", RelationType.SUPPORTS, 0.9)
sno.add_reasoning_edge("c3", "c4", RelationType.IMPLIES, 0.85)
sno.add_reasoning_edge("c4", "c5", RelationType.SUPPORTS, 0.85)
sno.add_reasoning_edge("c5", "c6", RelationType.IMPLIES, 0.80)
# Add evidence
sno.add_evidence(
"Caffeine is an adenosine receptor antagonist (Fredholm et al., 1999)",
"doi:10.1016/S0163-7258(99)00010-6",
0.95
)
sno.add_evidence(
"Adenosine accumulation promotes sleep pressure (Porkka-Heiskanen et al., 1997)",
"doi:10.1126/science.276.5316.1265",
0.95
)
sno.add_evidence(
"Caffeine improves sustained attention (Lieberman et al., 2002)",
"doi:10.1016/S0091-3057(01)00666-5",
0.90
)
sno.compute_hypothesis_embedding(model)
print(f"✓ Created SNO: {sno.sno_id}")
print(f" - {len(sno.reasoning_graph.nodes)} claims")
print(f" - {len(sno.reasoning_graph.edges)} reasoning edges")
print(f" - {len(sno.evidence_set)} evidence items")
# Step 3: Define Critic Classes
print("\n[Step 3/5] Defining critic pipeline components...")
class CriticType(Enum):
GROUNDING = "grounding"
LOGIC = "logic"
NOVELTY = "novelty"
@dataclass
class CriticResult:
score: float # 0.0 to 1.0
confidence: float
explanation: str
details: Dict[str, Any] = None
class BaseCritic:
def __init__(self, critic_type: CriticType, weight: float = 1.0):
self.critic_type = critic_type
self.weight = weight
self.eval_count = 0
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
raise NotImplementedError
class GroundingCritic(BaseCritic):
"""Evaluates how well the SNO is supported by evidence"""
def __init__(self, weight: float = 1.0):
super().__init__(CriticType.GROUNDING, weight)
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
self.eval_count += 1
# Simplified grounding check: ratio of claims to evidence
num_claims = len(sno.reasoning_graph.nodes)
num_evidence = len(sno.evidence_set)
if num_claims == 0:
return CriticResult(0.0, 1.0, "No claims to evaluate")
# Calculate evidence coverage ratio
evidence_ratio = min(1.0, num_evidence / num_claims)
# Average confidence of evidence
avg_confidence = np.mean([e.confidence for e in sno.evidence_set]) if sno.evidence_set else 0.0
# Combined score
score = 0.7 * evidence_ratio + 0.3 * avg_confidence
return CriticResult(
score=score,
confidence=0.85,
explanation=f"Evidence ratio: {evidence_ratio:.2f}, Avg confidence: {avg_confidence:.2f}",
details={"evidence_count": num_evidence, "claim_count": num_claims}
)
class LogicCritic(BaseCritic):
"""Evaluates the structural coherence of the reasoning graph"""
def __init__(self, weight: float = 1.0):
super().__init__(CriticType.LOGIC, weight)
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
self.eval_count += 1
G = sno.reasoning_graph
if len(G.nodes) == 0:
return CriticResult(0.0, 1.0, "No reasoning graph")
# Check for cycles (DAG should have none)
has_cycle = not nx.is_directed_acyclic_graph(G)
cycle_penalty = 0.5 if has_cycle else 0.0
# Check connectivity (weakly connected is good)
is_connected = nx.is_weakly_connected(G) if len(G.nodes) > 1 else True
connectivity_score = 1.0 if is_connected else 0.5
# Check for orphaned nodes
orphans = [n for n in G.nodes if G.in_degree(n) == 0 and G.out_degree(n) == 0]
orphan_penalty = len(orphans) / len(G.nodes)
# Parsimony: penalize excessive complexity
avg_degree = sum(dict(G.degree()).values()) / len(G.nodes)
complexity_penalty = min(0.3, (avg_degree - 2) * 0.1) if avg_degree > 2 else 0.0
score = connectivity_score - cycle_penalty - orphan_penalty - complexity_penalty
score = max(0.0, min(1.0, score))
return CriticResult(
score=score,
confidence=0.90,
explanation=f"Connectivity: {connectivity_score:.2f}, Cycles: {has_cycle}, Orphans: {len(orphans)}",
details={
"is_dag": not has_cycle,
"is_connected": is_connected,
"orphan_count": len(orphans),
"avg_degree": avg_degree
}
)
class NoveltyParsimonyCritic(BaseCritic):
"""Evaluates novelty while penalizing excessive complexity"""
def __init__(self, weight: float = 1.0, existing_embeddings: List[np.ndarray] = None):
super().__init__(CriticType.NOVELTY, weight)
self.existing_embeddings = existing_embeddings or []
def evaluate(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> CriticResult:
self.eval_count += 1
if sno.hypothesis_embedding is None:
return CriticResult(0.0, 0.5, "No embedding computed")
# Novelty: minimum distance to existing SNOs
if self.existing_embeddings:
similarities = [
np.dot(sno.hypothesis_embedding, emb) /
(np.linalg.norm(sno.hypothesis_embedding) * np.linalg.norm(emb))
for emb in self.existing_embeddings
]
max_similarity = max(similarities)
novelty_score = 1.0 - max_similarity
else:
novelty_score = 0.8 # Default for first SNO
# Parsimony: penalize graph complexity
num_nodes = len(sno.reasoning_graph.nodes)
num_edges = len(sno.reasoning_graph.edges)
complexity_ratio = num_edges / num_nodes if num_nodes > 0 else 0
parsimony_penalty = min(0.3, complexity_ratio * 0.1)
score = 0.7 * novelty_score + 0.3 * (1.0 - parsimony_penalty)
return CriticResult(
score=score,
confidence=0.75,
explanation=f"Novelty: {novelty_score:.2f}, Complexity ratio: {complexity_ratio:.2f}",
details={
"novelty_score": novelty_score,
"complexity_ratio": complexity_ratio,
"compared_to_n": len(self.existing_embeddings)
}
)
class CriticPipeline:
"""Manages multiple critics and computes composite trust score"""
def __init__(self):
self.critics: Dict[CriticType, BaseCritic] = {}
def add_critic(self, critic: BaseCritic):
self.critics[critic.critic_type] = critic
def evaluate_sno(self, sno: StructuredNarrativeObject, context: Optional[Dict] = None) -> Dict[str, Any]:
"""Evaluate SNO with all critics and compute trust score"""
results = {}
weighted_sum = 0.0
total_weight = 0.0
for critic_type, critic in self.critics.items():
result = critic.evaluate(sno, context)
results[critic_type.value] = {
'score': result.score,
'confidence': result.confidence,
'explanation': result.explanation,
'details': result.details
}
weighted_sum += result.score * critic.weight
total_weight += critic.weight
trust_score = weighted_sum / total_weight if total_weight > 0 else 0.0
sno.trust_score = trust_score
return {
'trust_score': trust_score,
'individual_scores': results
}
print("✓ Critic classes defined")
# Step 4: Create pipeline and evaluate
print("\n[Step 4/5] Evaluating SNO with critic pipeline...")
pipeline = CriticPipeline()
pipeline.add_critic(GroundingCritic(weight=0.4))
pipeline.add_critic(LogicCritic(weight=0.3))
pipeline.add_critic(NoveltyParsimonyCritic(weight=0.3))
evaluation = pipeline.evaluate_sno(sno)
print(f"✓ Evaluation complete")
print(f"\n{'='*70}")
print(f"EVALUATION RESULTS")
print(f"{'='*70}")
print(f"\nOverall Trust Score: {evaluation['trust_score']:.4f}")
print(f"\nIndividual Critic Scores:")
for critic_name, result in evaluation['individual_scores'].items():
print(f"\n {critic_name.upper()} Critic:")
print(f" Score: {result['score']:.4f}")
print(f" Confidence: {result['confidence']:.2f}")
print(f" Explanation: {result['explanation']}")
if result['details']:
print(f" Details: {result['details']}")
# Step 5: Demonstrate contextual evaluation
print(f"\n{'='*70}")
print(f"CONTEXTUAL EVALUATION DEMONSTRATION")
print(f"{'='*70}")
# Exploration mode: favor novelty
print(f"\n[Exploration Mode] - Favoring novel ideas")
exploration_pipeline = CriticPipeline()
exploration_pipeline.add_critic(GroundingCritic(weight=0.1))
exploration_pipeline.add_critic(LogicCritic(weight=0.1))
exploration_pipeline.add_critic(NoveltyParsimonyCritic(weight=0.8))
exp_eval = exploration_pipeline.evaluate_sno(sno)
print(f"Trust Score (Exploration): {exp_eval['trust_score']:.4f}")
# Verification mode: favor grounding and logic
print(f"\n[Verification Mode] - Favoring rigor and evidence")
verification_pipeline = CriticPipeline()
verification_pipeline.add_critic(GroundingCritic(weight=0.45))
verification_pipeline.add_critic(LogicCritic(weight=0.45))
verification_pipeline.add_critic(NoveltyParsimonyCritic(weight=0.1))
ver_eval = verification_pipeline.evaluate_sno(sno)
print(f"Trust Score (Verification): {ver_eval['trust_score']:.4f}")
print(f"\n{'='*70}")
print(f"✓ CRITIC PIPELINE DEMONSTRATION COMPLETE")
print(f"{'='*70}")
print(f"\nKey Insights:")
print(f" • Same SNO evaluated differently based on context")
print(f" • Exploration mode: {exp_eval['trust_score']:.4f} (emphasizes novelty)")
print(f" • Verification mode: {ver_eval['trust_score']:.4f} (emphasizes rigor)")
print(f" • This flexibility allows CNS 2.0 to adapt to different phases")
print(f"\nWhat you just built:")
print(f" ✓ Complete critic pipeline with 3 specialized critics")
print(f" ✓ Grounding critic (evidence coverage)")
print(f" ✓ Logic critic (structural coherence)")
print(f" ✓ Novelty-Parsimony critic (innovation vs complexity)")
print(f" ✓ Contextual evaluation (dynamic weight adjustment)")
print(f"\nNext: Chapter 4 - Synthesis engine and chiral pair detection")
print(f"{'='*70}")
Step 2: Run It
python evaluate_with_critics.py
Expected Output
======================================================================
CNS 2.0 CRITIC PIPELINE DEMONSTRATION
======================================================================
[Step 1/5] Loading embedding model and data structures...
✓ Data structures ready
[Step 2/5] Creating sample SNO...
✓ Created SNO: b4d8f2a1
- 6 claims
- 5 reasoning edges
- 3 evidence items
[Step 3/5] Defining critic pipeline components...
✓ Critic classes defined
[Step 4/5] Evaluating SNO with critic pipeline...
✓ Evaluation complete
======================================================================
EVALUATION RESULTS
======================================================================
Overall Trust Score: 0.7245
Individual Critic Scores:
GROUNDING Critic:
Score: 0.6450
Confidence: 0.85
Explanation: Evidence ratio: 0.50, Avg confidence: 0.93
Details: {'evidence_count': 3, 'claim_count': 6}
LOGIC Critic:
Score: 0.9000
Confidence: 0.90
Explanation: Connectivity: 1.00, Cycles: False, Orphans: 0
Details: {'is_dag': True, 'is_connected': True, 'orphan_count': 0, 'avg_degree': 1.667}
NOVELTY Critic:
Score: 0.6600
Confidence: 0.75
Explanation: Novelty: 0.80, Complexity ratio: 0.83
Details: {'novelty_score': 0.8, 'complexity_ratio': 0.833, 'compared_to_n': 0}
======================================================================
CONTEXTUAL EVALUATION DEMONSTRATION
======================================================================
[Exploration Mode] - Favoring novel ideas
Trust Score (Exploration): 0.6905
[Verification Mode] - Favoring rigor and evidence
Trust Score (Verification): 0.7380
======================================================================
✓ CRITIC PIPELINE DEMONSTRATION COMPLETE
======================================================================
Key Insights:
• Same SNO evaluated differently based on context
• Exploration mode: 0.6905 (emphasizes novelty)
• Verification mode: 0.7380 (emphasizes rigor)
• This flexibility allows CNS 2.0 to adapt to different phases
What you just built:
✓ Complete critic pipeline with 3 specialized critics
✓ Grounding critic (evidence coverage)
✓ Logic critic (structural coherence)
✓ Novelty-Parsimony critic (innovation vs complexity)
✓ Contextual evaluation (dynamic weight adjustment)
Next: Chapter 4 - Synthesis engine and chiral pair detection
======================================================================
What Just Happened?
You built and tested a complete multi-component critic pipeline:
- Grounding Critic: Evaluated evidence coverage (0.65) - detected that only 3 evidence items cover 6 claims
- Logic Critic: Evaluated structural coherence (0.90) - confirmed DAG structure, no cycles, good connectivity
- Novelty Critic: Evaluated innovation vs complexity (0.66) - balanced novelty against graph complexity
- Composite Trust Score: Weighted average (0.72) - overall quality assessment
The contextual evaluation demonstration showed how the same SNO receives different scores based on system priorities:
- Exploration mode (novelty=0.8): Lower trust (0.69) because we prioritize new ideas over rigor
- Verification mode (grounding+logic=0.9): Higher trust (0.74) because we demand evidence and logic
Insights
Why did our SNO score 0.72?
- ✓ Strong logic (0.90): Well-structured reasoning chain with no cycles
- ⚠ Moderate grounding (0.65): Only 3 evidence items for 6 claims (ideally 1:1 ratio)
- ⚠ Moderate novelty (0.66): Decent innovation but some complexity penalty
How to improve this SNO:
- Add 3 more evidence items to reach 1:1 ratio → Improves grounding to ~0.85
- Simplify reasoning graph if possible → Improves novelty-parsimony
- Compute claim embeddings for semantic verification → Enables advanced grounding checks
Experiment: Evaluate Your Own SNO
Modify the script to evaluate the SNO you created in Chapter 2:
- Replace the hypothesis and claims with your content
- Run the evaluation
- Analyze which critic gave the lowest score
- Improve that aspect of your SNO
- Re-evaluate and compare
Challenge: Create two versions of your SNO:
- Version A: Maximize grounding (lots of evidence, well-cited)
- Version B: Maximize novelty (unconventional claims, novel connections)
Which gets a higher trust score? Why?
✓ Chapter 3 Checkpoint
Before proceeding to Chapter 4, verify you can:
- ✓ Create critic classes implementing
BaseCritic - ✓ Implement grounding evaluation (evidence coverage)
- ✓ Implement logic evaluation (graph structure)
- ✓ Implement novelty-parsimony evaluation
- ✓ Build a
CriticPipelineand add critics - ✓ Evaluate an SNO and receive trust score
- ✓ Adjust weights for contextual evaluation
If any step fails:
- Review the example code above
- Check your Chapter 2 SNO creation works
- Verify NetworkX is installed:
pip install networkx - See Troubleshooting
Understanding Check:
- Can you explain why the logic score was 0.90?
- Why did grounding score only 0.65?
- How would adding more evidence change the scores?
Navigation
← Previous: Chapter 2: SNO Foundations → Next: Chapter 4: Synthesis Engine
Learn how to identify chiral pairs and synthesize conflicting narratives into novel insights.