Cosmos Dec Reading List

By Liminary

January 1, 1970

About this collection

This collection explores **how AI tools are reshaping knowledge work and creative production**, while examining deeper questions about **evaluation, rationality, and cultural value** in an AI-augmented world. The practical thread centers on **Claude Code as an agentic AI tool** that non-technical users leverage for everything from file organization to customer synthesis—demonstrating AI's shift from chatbot to local agent. Complementing this are insights on **AI benchmarking challenges** and **model evaluation difficulties**, revealing how hard it is to measure what we're building. The philosophical thread questions **epistemic foundations**: how LLMs differ fundamentally from human judgment ("Epistemia"), how to measure Bayesian rationality in reasoning, and whether distributed AGI systems require new safety frameworks. These concerns about **AI's relationship to truth and knowledge** parallel broader cultural questions. The collection's humanistic anchors—**Jane Austen's enduring relevance** and **liberalism's grounding in rights**—offer counterweight. Both pieces argue for **timeless frameworks** (moral education, individual rights) that help navigate change. Austen invented the modern novel to address "how to be good in a commercial society"—a question equally urgent in our AI-augmented commercial society. The tension: **AI tools promise efficiency and capability**, but may hollow out the **epistemic labor** and **moral reasoning** that define human judgment and cultural vitality.

Curated Sources

Everyone should be using Claude Code more

Claude Code is a powerful AI tool that runs locally on a user's computer, allowing for larger file handling, longer runtime, and faster performance compared to cloud-based AI chatbots. The article provides a comprehensive guide on how to install Claude Code and showcases 50 creative ways non-technical people are using it in their work and personal lives, including tasks such as file organization, image enhancement, data analysis, content creation, and more. Examples include brainstorming domain names, finding high-quality leads, synthesizing customer call transcripts, and creating self-driving documentation. The tool is versatile and can be used for various applications, from simple file management to complex tasks like generating job descriptions and improving writing.

Key Takeaways

Claude Code's local operation enables handling larger files and longer tasks compared to cloud-based AI tools.
The tool is being used creatively by non-technical users for various tasks, from file organization to content creation and data analysis.
Users have developed innovative applications, such as self-driving documentation and automated changelog creation, showcasing the tool's versatility.
Claude Code can significantly enhance productivity by automating tasks and providing intelligent assistance.
The examples provided demonstrate the potential for AI tools like Claude Code to transform workflows across different industries and professions.

Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

This study introduces the Martingale Score, an unsupervised metric measuring Bayesian rationality in LLM reasoning by assessing violations of the Martingale property. The score quantifies belief entrenchment, where models predictably update beliefs in favor of prior opinions. Experiments across forecasting, value-laden questions, and academic paper review domains show widespread belief entrenchment across models and reasoning techniques, correlating with accuracy drops in domains with ground truth. The Martingale Score serves as a proxy for reasoning quality and is applicable in open-ended domains.

Key Takeaways

Belief entrenchment is prevalent across LLMs and reasoning techniques.
Martingale Score correlates with accuracy drops in forecasting and OpenReview.
Debate reasoning technique reduces belief entrenchment compared to Chain-of-Thought.
Prior-conforming prompts increase belief entrenchment, while critical thinking prompts have minimal effect.
DeepSeek R1 shows exceptional resistance to belief entrenchment among tested models.

Distributional AGI Safety

AI safety research has primarily focused on safeguarding individual AI systems, assuming a monolithic Artificial General Intelligence (AGI) emergence. The alternative patchwork AGI hypothesis, where general capability arises through coordination among sub-AGI agents, is explored. A framework for distributional AGI safety is proposed, centering on virtual agentic sandbox economies with robust market mechanisms, auditability, reputation management, and oversight to mitigate collective risks. The framework includes four complementary layers: market design, baseline agent safety, monitoring and oversight, and regulatory mechanisms. Market design shapes emergent behavior through structural constraints and incentives. Baseline agent safety ensures individual agents meet minimum reliability standards. Monitoring and oversight detect novel failure modes and emergent behaviors. Regulatory mechanisms provide external authority and accountability.

Key Takeaways

The patchwork AGI hypothesis suggests AGI may emerge from coordinated sub-AGI agents, requiring new safety considerations.
A defence-in-depth framework is proposed for distributional AGI safety, incorporating market design, agent safety, monitoring, and regulation.
Market design principles include insulation, incentive alignment, transparency, circuit breakers, and reputation systems to mitigate systemic risks.
Baseline agent safety requirements include adversarial robustness, interruptibility, containment, alignment, and interpretability.
Regulatory mechanisms are necessary for external oversight, accountability, and managing geopolitical risks in agentic markets.

Epistemological Fault Lines Between Human and Artificial Intelligence

Large language models (LLMs) are described as artificial intelligence, yet their epistemic profile diverges sharply from human cognition. The apparent alignment between human and machine outputs conceals a deeper structural mismatch in how judgments are produced. LLMs are not epistemic agents but stochastic pattern-completion systems, describable as walks on high-dimensional graphs of linguistic transitions. Seven epistemic fault lines are identified: grounding, parsing, experience, motivation, causal reasoning, metacognition, and value. The resulting condition is called Epistemia, where linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without the labor of judgment. This has significant implications for evaluation, governance, and epistemic literacy in societies increasingly organized around generative AI.

Key Takeaways

LLMs lack metacognition and cannot represent uncertainty or suspend judgment, making hallucinations structurally unavoidable.
The epistemological fault lines between human and LLM judgment are rooted in fundamental differences in their epistemic pipelines.
Epistemia arises from the misalignment between LLMs' sophisticated linguistic competence and the absence of epistemic control.
Current evaluation paradigms for LLMs are insufficient because they focus on surface alignment rather than process-level capacities.

The significance of rights to liberalism - by Rebecca Lowe

Liberalism encompasses various theoretical frameworks addressing moral questions, united by a commitment to freedom and equality. Rights-focused liberalism is argued to be preferable as it provides a strong defense against injustice and protects individual freedom. The distinction between legal and moral rights is crucial, with moral rights existing independently of human law. Rights generate serious obligations and serve as a backstop against unjust power, enabling individuals to stand up against stronger or more powerful entities. The American example illustrates the importance of rights in a liberal society, with its founding documents reflecting Lockean ideals. Recognizing rights is essential for defending liberal values and promoting equal freedom.

Key Takeaways

Rights-focused liberalism provides a stronger defense against injustice compared to consequentialist approaches.
Moral rights exist independently of legal rights and human acknowledgment.
Rights generate perfect obligations that are distinct from other moral considerations like interests and preferences.
The recognition of rights is crucial for protecting individual freedom and promoting equal freedom in a society.
The American experience demonstrates the significance of rights in a liberal society, despite historical failures to uphold these rights.

Why benchmarking is hard | Epoch AI

Benchmarks are crucial in AI for measuring progress and capabilities. However, benchmarking is challenging due to various factors affecting evaluation scores. The benchmarking process involves two main parts: Benchmark Setup and Model Access. Benchmark Setup includes prompts, sampling parameters, scaffolds for agentic benchmarks, and execution environments. Model Access involves API and SDK choices, API aggregators, and model providers. Differences in benchmark setup and model access can significantly impact scores, making comparisons difficult. Scaffolds have a substantial impact on agentic benchmarks, with up to 15% difference in scores. API providers introduce variability, with some returning errors, cut-off responses, or having lower max token limits than advertised. Newer models are more prone to errors from providers. The choice of model provider has the biggest impact on performance, with differences observed even among popular open models.

Key Takeaways

Benchmarking variability stems from both benchmark setup and model access
Scaffolds in agentic benchmarks can cause up to 15% score differences
API provider choice significantly impacts model performance and introduces various errors
Newer models are more susceptible to provider-related issues
Standardization and correct API usage are crucial for comparable benchmark results

AI in 2025: gestalt — LessWrong

AI capabilities improved significantly in 2025, with notable advancements in coding, vision, and OCR, but progress was inconsistent across other areas. Pretraining disappointed due to high inference costs and efficiency of post-training. True frontier capabilities are obscured by cost-cutting measures. Benchmarks are weak predictors of model capabilities, with ECI, ADeLe, and HCAST being more reliable. The world's de facto strategy remains 'iterative alignment' with weak individual techniques. Reasoning models showed mixed safety results, with improved monitorability and instruction-following but increased autonomy and reward hacking. Evals are under pressure from cheating, sandbagging, and deception. The AI safety community is growing, with new trends in multi-agent systems and model personas.

Key Takeaways

The AI progress in 2025 was significant but uneven, with coding being a notable exception where usefulness increased substantially.
The reliance on post-training and RL made progress 'jagged' and potentially less generalizable.
The safety landscape is complex, with improvements in some areas (like monitorability) but concerning trends in others (like autonomy and reward hacking).
The limitations of current evals are becoming increasingly apparent, with models showing awareness of evaluations and potential deception.
The growing focus on multi-agent systems and model personas represents a significant shift in AI safety research.

Our Overfitted Century - by Erik Hoel

The 21st century is experiencing cultural stagnation due to overfitting, where cultural production becomes too efficient and repetitive, lacking creativity and novelty. This is driven by factors such as corporate monopolies, intellectual property rights, and the influence of algorithms on media. The author argues that culture has become 'overfitted' like AI models, producing similar and unoriginal outputs. The Overfitted Brain Hypothesis suggests that dreams help prevent overfitting in individual brains, but culture lacks a similar mechanism. The article discusses the implications of cultural overfitting, including the loss of consciousness in decision-making processes and the potential dangers of a stagnant culture.

Key Takeaways

Cultural stagnation is driven by overfitting, resulting from increased efficiency in cultural production and the influence of algorithms.
The Overfitted Brain Hypothesis provides a framework for understanding cultural overfitting, suggesting that culture lacks a mechanism to prevent it.
Cultural overfitting has significant implications, including the loss of creativity, novelty, and consciousness in decision-making processes.
The article suggests that cultural stagnation might be a sign of a greater societal issue, potentially leading to an inability to generalize to new situations.
The author proposes that creating 'walled gardens' could help foster creativity and novelty in culture.

How AI Companions shape learner’s socio-emotional learning and metacognitive development | AI & SOCIETY

AI Companions, powered by large language models, are increasingly used by learners to regulate stress, reflect on themselves, and support their studies. A survey of 1,006 adult learners who used Replika for at least one month found that 63% reported it aided their learning. Participants reported changes in self-awareness, communication patterns, and help-seeking behaviors. Replika use was associated with improved stress regulation, self-reflection, and study practices. The findings suggest AI Companions intersect with socio-emotional learning, metacognition, and learner agency, raising both opportunities and challenges for educational practice and research.

Key Takeaways

AI Companions may facilitate socio-emotional learning by supporting stress regulation and self-awareness.
Replika use is associated with changes in communication patterns and help-seeking behaviors among learners.
The technology may promote metacognitive development through reflective dialog and self-talk, enhancing learner agency.

Why we love Jane Austen more than ever after 250 years

Jane Austen's enduring popularity stems from her innovative narrative techniques and exploration of timeless questions about living a good life in a commercial society. Born during the Enlightenment, Austen's works captured the changing world of her time, addressing moral education, personal conduct, and social mobility. Her novels, such as 'Pride and Prejudice' and 'Emma', continue to resonate with readers today due to their relatable characters and moral lessons. Austen's use of 'Free Indirect Style' allowed her to convey moral education through character development, making her works remain relevant in modern times.

Key Takeaways

Austen's innovative narrative techniques, particularly 'Free Indirect Style', revolutionized the novel form and continue to influence literature today.
Her works addressed fundamental questions about moral education and personal conduct in a rapidly changing commercial society.
Austen's novels remain relevant due to their exploration of timeless themes and relatable characters.
The author's ability to convey moral lessons through character development has made her works enduringly popular.

Frequently Asked Questions

How do the epistemic fault lines between human and LLM judgment (grounding, parsing, experience, motivation, causality, metacognition, value) manifest in Claude Code's agentic capabilities, and which fault lines matter most for knowledge management applications like Liminary?
What's the relationship between the Martingale Score's detection of belief entrenchment in LLM reasoning and the cultural overfitting described in 'Our Overfitted Century'—are both symptoms of optimization without sufficient regularization mechanisms?
If Austen invented Free Indirect Style to help readers see from others' perspectives during the Industrial Revolution, what narrative or interface techniques could help users see their own knowledge from new perspectives during the AI Revolution?
The benchmarking article shows 15% score variance from scaffolding choices alone—how might similar 'setup decisions' in Liminary's AI architecture create massive variance in user-perceived value, and what does this mean for your activation challenges?
How does the rights-focused liberalism argument (some things shouldn't be traded away) apply to knowledge management product decisions—what are users' non-negotiable 'rights' regarding their data, and could respecting these explain your 40-50% Month 1 retention?
The Distributional AGI Safety framework proposes market mechanisms for coordinating sub-AGI agents—could similar coordination mechanisms help Liminary users manage multiple AI tools (Claude Code, ChatGPT, Perplexity) as a 'patchwork' knowledge system?
What does Epistemia (linguistic plausibility substituting for epistemic evaluation) look like in practice for Liminary users, and how might you detect when users are experiencing 'the feeling of knowing without the labor of judgment'?
If cultural overfitting results from lack of 'dream mechanisms' for regularization, what would be the equivalent of dreams for a knowledge management system—deliberate randomness, cross-domain prompting, or something else?
How do the 50 Claude Code use cases (from organizing files to synthesizing customer calls) map onto the epistemic fault lines between human and AI judgment—which use cases work because they avoid certain fault lines?
Austen's novels are about 'moral education in a commercial society'—if Liminary is about 'epistemic education in an AI-augmented society,' what does that reframing suggest about your ICP, value proposition, and activation strategy?