Anthropic releases jailbreak severity scale and cybersecurity classifier framework for Claude
Anthropic has published two new AI safety documents alongside the redeployment of its Claude Fable 5 model: a detailed cybersecurity classifier taxonomy and a proposed Cyber Jailbreak Severity (CJS) scale. The classifier divides user requests into four categories — prohibited, high-risk dual-use, low-risk dual-use, and benign — with high-risk activities like exploit development remaining blocked until Anthropic can verify legitimate users. The CJS scale rates jailbreaks on a five-band system from CJS-0 to CJS-4, assessing each on capability gain, breadth, ease of weaponization, and discoverability. Anthropic has opened a dedicated email and a HackerOne bug bounty program to gather feedback on the framework from the security research community. The company is positioning both documents as potential industry standards, giving AI labs, researchers, and regulators a shared vocabulary for evaluating AI-related cybersecurity risks.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in