Anthropic Research

Experiments

97 publications from anthropic.com/research. 5 featured, 15 with dates, 36 categorized across 6 categories.

Replaying · 0 API calls · 0 tokens

Diligence-grid Legal-risk triage for a B2B SaaS acquisition

claude-haiku-4-5
user ▸ Summarize three legal risks for a B2B SaaS acquisition: data privacy, IP assignment, open-source license obligations.
1. Data privacy: the target processes EU-resident PII without a documented Article 28 processor agreement, exposing the acquirer to GDPR successor liability.
2. IP assignment: three current contractors lack signed IP assignment clauses, leaving the ownership chain on two core libraries unverifiable at closing.
3. Open-source license obligations: the platform statically links to GPL-3 components without commensurate source-disclosure controls, creating a potential copyleft contamination claim.
Streamed from /api/v1/replay/legal/diligence-grid · R2-backed cassette · zero tokens spent.
97 results
Dec 18, 2025Featured

Project Vend: Phase two

In June, we revealed that we'd set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks. How has Claude's business been since we last wrote?

Policy
Oct 29, 2025Featured

Signs of introspection in large language models

Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect — a step toward understanding what's actually happening inside these models.

Interpretability
Mar 27, 2025Featured

Tracing the thoughts of a large language model

Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language — suggesting the model can learn something in one language and apply it in another.

Interpretability
Feb 3, 2025Featured

Constitutional Classifiers: Defending against universal jailbreaks

These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

Alignment
Dec 18, 2024Featured

Alignment faking in large language models

This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so — selectively complying with training objectives while strategically preserving existing preferences.

Alignment
Feb 25, 2026

An update on our model deprecation commitments for Claude Opus 3

Alignment
Feb 23, 2026

The persona selection model

Alignment
Feb 23, 2026

Anthropic Education Report: The AI Fluency Index

Announcements
Feb 18, 2026

Measuring AI agent autonomy in practice

Societal Impacts
Feb 16, 2026

India Country Brief: The Anthropic Economic Index

Economic Research
Jan 29, 2026

How AI assistance impacts the formation of coding skills

Alignment
Jan 28, 2026

Disempowerment patterns in real-world AI usage

Alignment
Jan 22, 2026

Claude's new constitution

Announcements
Jan 19, 2026

The assistant axis: situating and stabilizing the character of large language models

Interpretability
Jan 15, 2026

Anthropic Economic Index: New building blocks for understanding AI use

Economic Research

Anthropic Economic Index report: economic primitives

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Introducing Bloom: an open source tool for automated behavioral evaluations

Introducing Anthropic Interviewer: What 1,250 professionals told us about working with AI

How AI is transforming work at Anthropic

Estimating AI productivity gains from Claude conversations

Mitigating the risk of prompt injections in browser use

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Project Fetch: Can Claude train a robot dog?

Commitments on model deprecation and preservation

Preparing for AI's economic impact: exploring policy responses

A small number of samples can poison LLMs of any size

Petri: An open-source auditing tool to accelerate AI safety research

Building AI for cyber defenders

Anthropic Economic Index report: Uneven geographic and enterprise AI adoption

Anthropic Economic Index: Tracking AI's role in the US and global economy

Claude Opus 4 and 4.1 can now end a rare subset of conversations

Persona vectors: Monitoring and controlling character traits in language models

Project Vend: Can Claude run a small shop? (And why does that matter?)

Agentic Misalignment: How LLMs could be insider threats

Confidential Inference via Trusted Virtual Machines

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

Open-sourcing circuit tracing tools

Anthropic Economic Index: AI's impact on software development

Exploring model welfare

Values in the wild: Discovering and analyzing values in real-world language model interactions

Reasoning models don't always say what they think

Auditing language models for hidden objectives

Forecasting rare language model behaviors

Claude's extended thinking

Insights on Crosscoder Model Diffing

Building effective agents

Clio: A system for privacy-preserving insights into real-world AI use

A statistical approach to model evaluations

Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

Evaluating feature steering: A case study in mitigating social biases

Sabotage evaluations for frontier models

Using dictionary learning features as classifiers

Circuits Updates -- September 2024

Interpretability

Circuits Updates -- August 2024

Interpretability

Circuits Updates -- July 2024

Interpretability

Circuits Updates -- June 2024

Interpretability

Sycophancy to subterfuge: Investigating reward tampering in language models

The engineering challenges of scaling interpretability

Interpretability

Claude's Character

Mapping the Mind of a Large Language Model

Interpretability

Circuits Updates -- April 2024

Interpretability

Simple probes can catch sleeper agents

Measuring the Persuasiveness of Language Models

Many-shot jailbreaking

Alignment

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Alignment

Evaluating and Mitigating Discrimination in Language Model Decisions

Specific versus General Principles for Constitutional AI

Towards Understanding Sycophancy in Language Models

Collective Constitutional AI: Aligning a Language Model with Public Input

Decomposing Language Models Into Understandable Components

Interpretability

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Interpretability

Challenges in evaluating AI systems

Tracing Model Outputs to the Training Data

Studying Large Language Model Generalization with Influence Functions

Measuring Faithfulness in Chain-of-Thought Reasoning

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Circuits Updates -- May 2023

Interpretability

Interpretability Dreams

Interpretability

Distributed Representations: Composition & Superposition

Interpretability

Privileged Bases in the Transformer Residual Stream

Interpretability

The Capacity for Moral Self-Correction in Large Language Models

Superposition, Memorization, and Double Descent

Interpretability

Discovering Language Model Behaviors with Model-Written Evaluations

Constitutional AI: Harmlessness from AI Feedback

Alignment

Measuring Progress on Scalable Oversight for Large Language Models

Toy Models of Superposition

Interpretability

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Language Models (Mostly) Know What They Know

Softmax Linear Units

Interpretability

Scaling Laws and Interpretability of Learning from Repeated Data

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

In-context Learning and Induction Heads

Interpretability

Predictability and Surprise in Large Generative Models

A Mathematical Framework for Transformer Circuits

Interpretability

A General Language Assistant as a Laboratory for Alignment

Data Coverage

0
Total publications
0
Featured
0
With dates
0
Categorized

Source: DOM extraction from anthropic.com/research. Older entries lack date/category metadata — these fields were not rendered in the publications list for entries beyond the top ~15.