OpenAI o1 - Reinforcement Learning for Deep LLM Reasoning¶

On September 12, 2024, OpenAI published Learning to Reason with LLMs, turning “think before answering” from a prompt trick into the identity of a model family; on December 5, it followed with the OpenAI o1 System Card, putting capability, risk, and deployment boundaries in the same public document. The shock of o1 was not a reproducible algorithm. It was almost the opposite: OpenAI withheld the training recipe, yet showed that reinforcement learning, training-time compute, and test-time compute could jointly scale reasoning. AIME, Codeforces, GPQA, MMMU, jailbreak robustness, and Preparedness evaluations all pointed to the same shift. A reasoning model was not merely “GPT-4o but chattier”; it was a new product and safety paradigm in which the model works through a hidden chain of thought before handing the user a visible answer.

TL;DR¶

OpenAI's 2024 o1 materials moved LLM reasoning from a Chain-of-Thought prompting phenomenon into a reinforcement-learning-trained reasoning strategy. The model produces a hidden reasoning trace before answering, and its performance improves with both training-time compute and test-time compute; a useful abstraction is \(p(y\mid x)=\sum_z p_\theta(y\mid x,z)p_\theta(z\mid x)\), where \(z\) is the internal trace not directly shown to the user. The public facts stop there. OpenAI says o1 is trained with large-scale reinforcement learning to use chain-of-thought, try strategies, recognize mistakes, and refine its thinking, but it does not disclose the reward, data mixture, sampling policy, model size, optimizer, or reproducible recipe. On capability benchmarks, o1 reached 74.4% pass@1 and 83.3% cons@64 on AIME 2024, Codeforces Elo 1673 around the 89th percentile, 77.3% pass@1 on GPQA Diamond, and 78.2% on MMMU. On safety, the System Card rated CBRN and Persuasion as Medium, Cybersecurity and Model Autonomy as Low, and made deliberative alignment plus hidden-CoT monitoring central to the release. o1 did not replace a single named baseline so much as the GPT-4o-era default of “fast answer plus prompt-induced reasoning”; later DeepSeek-R1 (2025) showed that part of the o1 idea could be made open, sharpening the real lesson: reasoning capability depends not only on what a model knows, but on how much internal computation it is trained and allowed to spend searching, checking, and revising before it speaks.

Historical Context¶

From CoT prompting to reasoning models¶

The historical starting point for o1 was not a suddenly intelligent model in 2024, but the LLM community's rediscovery of chain-of-thought after 2022. Chain-of-Thought Prompting showed that mathematical and symbolic tasks improve when a model explicitly writes intermediate steps. Self-Consistency, Tree-of-Thoughts, program-aided reasoning, tool use, and verifier reranking then pushed in the same direction: do not answer immediately; search, decompose, and check before committing.

Most of those methods treated reasoning as an inference-time technique. Users wrote “Let's think step by step,” researchers sampled multiple reasoning paths, and voting or a verifier selected the final answer. The model itself had not been publicly turned into a product whose identity was systematic internal computation on difficult problems. In the GPT-4o era, the default interaction was still fast answering, with longer explanations elicited by prompts when needed. o1 changed that framing: OpenAI made “think before answering” a training objective and product form. The model deliberates in hidden chain-of-thought, while the user sees a polished answer or summary.

Stage	Representative event	Reasoning interface	Limitation
2022	Chain-of-Thought Prompting	user explicitly asks for step-by-step reasoning	prompt-dependent and unstable
2022-2023	Self-Consistency / verifier reranking	multi-sample generation and voting	compute is external to the trained objective
2023	GPT-4 Technical Report	strong capability but closed recipe	reasoning mechanism and training details opaque
2024	OpenAI o1	RL-trained hidden CoT plus test-time compute	public materials remain non-reproducible

September 2024: making “thinking longer” a capability curve¶

When OpenAI released o1-preview and Learning to Reason with LLMs on September 12, 2024, the most important public signal was not a single leaderboard result but two scaling curves: o1 performance improved with more reinforcement-learning training compute and also with more test-time thinking compute. Before that, the default scaling story for LLMs was pretraining compute, parameters, and data. o1 brought another axis to the foreground: the same problem can receive more internal reasoning budget, and the model can try strategies, notice mistakes, and revise before answering.

That is why the release did not feel like an ordinary model upgrade. OpenAI was not merely saying “the new model is stronger.” It was saying, in effect, “we trained a model to use chain-of-thought productively.” AIME 2024, Codeforces, GPQA Diamond, and MMMU all point to the same family of capabilities: tasks that cannot be solved by shallow completion alone, and instead require multi-step search, constraint tracking, algebraic transformation, code debugging, or scientific synthesis. o1's jump over GPT-4o on these tasks made reasoning budget itself visible as a scalable resource.

December 2024: the System Card put capability and risk at the same table¶

The December 5 OpenAI o1 System Card added the other half of the story: reasoning capability changes not only math, code, and science performance, but also the structure of safety evaluation. The system card states at the outset that the o1 family is trained with large-scale reinforcement learning to use chain-of-thought. This provides new opportunities for safety and robustness, while also increasing risks. The report evaluates disallowed content, jailbreaks, hallucination, bias, instruction hierarchy, CoT deception monitoring, external red teaming, Preparedness Framework categories, CBRN, persuasion, model autonomy, and multilingual performance.

The historical value of the System Card is that it binds a capability release to a governance artifact. o1 was not just a model that beat GPT-4o on benchmarks, nor merely a slower premium option in ChatGPT. It forced developers, red teams, regulators, and researchers to face a new question: if a model can plan, verify, and revise in a hidden reasoning trace, must safety also enter that reasoning process? OpenAI's answer was deliberative alignment, instruction hierarchy, CoT summaries, and CoT monitoring, while also acknowledging that faithful chain-of-thought remains an open question.

Background and Motivation¶

Motivation: spend compute before the answer¶

The motivation behind o1 can be compressed into one sentence: make the model spend useful computation before it speaks, instead of forcing all capability into a single fast forward pass. Ordinary chat models often behave like fast reflex systems: read the prompt, generate an answer, and perhaps explain visibly. A reasoning model is closer to a student doing scratch work: try a route internally, backtrack on contradictions, rewrite the plan if necessary, then produce the final answer. OpenAI tied this ability to reinforcement learning, meaning the objective is not only “produce answers people prefer,” but also “use a reasoning process that improves final correctness.”

This differs subtly but importantly from traditional RLHF. RLHF usually optimizes preference, helpfulness, and safety of visible answers. The public o1 story emphasizes that models learn to refine their thinking process, try different strategies, and recognize their mistakes. The object of optimization moves from the visible answer into the cognition before the answer. The public materials do not specify how the reward is constructed, but the motivation is already clear: if hard tasks require search, training should reward models that search well.

Disclosure boundary: this is not a reproducible paper¶

To understand o1, one must separate historical influence from reproducibility. o1 is a direction-changing technical report and system card, not an open algorithm paper. OpenAI does not publish model size, full data composition, RL algorithm, reward design, sampling policy, optimizer, number of training stages, CoT format, or system prompts. The public figures show scaling trends, and the tables show benchmark and safety-evaluation results, but an external group cannot train an o1 from the text.

That does not reduce its historical importance. It instead shows a reality of frontier AI in 2024: some of the most important research objects increasingly appear as closed system cards, product releases, and mixed qualitative-quantitative evaluations. o1 therefore has to be read on two tracks. On one track, it publicly demonstrated a strong signal for test-time compute and reinforcement learning for reasoning. On the other, any detailed recipe must be labeled as interpretation or conjecture, not as something OpenAI disclosed.

Method Deep Dive¶

A method deep dive for o1 must begin with boundaries. OpenAI's public pages do not disclose the full training algorithm, reward function, model size, data mixture, optimizer, sampling strategy, reasoning-budget allocation, raw hidden chain-of-thought format, or deployment stack. This section is therefore not a recipe for reproducing OpenAI's internal system. It organizes the public facts into an interpretable framework and marks which parts are structured interpretation. The public facts are: the o1 family is trained with large-scale reinforcement learning to use chain-of-thought; the model thinks before answering and can produce a long internal reasoning chain; training teaches it to refine its thinking process, try different strategies, and recognize mistakes; performance improves with both training-time and test-time compute; the System Card includes deliberative alignment, instruction hierarchy, and CoT safety as part of the safety method.

The boundary between public fact and structured explanation¶

The easiest mistake is to write o1 as an open RL recipe. The public materials are enough to explain the idea, but not enough to reconstruct training. The table separates three layers:

Layer	Public fact	Structured explanation	Should not be invented
Training signal	large-scale reinforcement learning trains reasoning	reward likely encourages correct answers and useful reasoning behavior	reward form, weights, annotation source
Reasoning trace	the model produces long CoT before answering	latent variable \(z\) carries search, verification, and revision	raw CoT format and full faithfulness
Scaling law	more training-time and test-time compute improves performance	reasoning budget becomes a scaling axis	slope of the curves and absolute compute
Safety training	deliberative alignment teaches policy reasoning	safety specifications enter the decision process	exact policy data and training process
Product form	users see answers and summaries, not raw CoT	hidden CoT protects monitoring and competitive boundaries	full routing and filtering stack

Overall framework: test-time compute over hidden reasoning traces¶

From a probabilistic view, o1 can be interpreted as introducing a hidden reasoning trace \(z\) before the visible answer \(y\). The user sees \(y\), and sometimes a summary; internally, the model samples, searches, or constructs \(z\), then answers using both \(x\) and \(z\). This is not an OpenAI formula. It is a way to understand how o1 differs from a fast-answer model:

\[ p(y\mid x)=\sum_z p_\theta(y\mid x,z)\,p_\theta(z\mid x). \]

Ordinary chat models can also emit explanations, but those explanations are often generated together with the final answer and are not necessarily trained to maximize correctness on hard tasks. The public o1 story makes \(z\) central: the model decomposes the problem, tries strategies, notices contradictions, revises the route, and then gives the answer. Test-time compute is therefore not merely “generate more tokens.” It is more budget for internal search and verification.

Component	Input	Output	Role in public materials
Reasoning policy	user problem and context	hidden CoT	performs multi-step reasoning before answering
Verbal answer head	hidden CoT and problem	user-visible answer	turns internal result into a response
Test-time budget	difficulty, strategy, product constraint	longer or shorter thinking	improves accuracy on hard problems
Safety policy reasoning	safety specification and request	refusal, rewrite, or compliant help	interface for deliberative alignment

Key design 1: reinforcement learning makes reasoning a training objective¶

The most important design in o1 is bringing the reasoning process into reinforcement learning rather than relying only on prompts to elicit it. The public materials do not specify the algorithm, but the objective can be abstracted as follows: for a problem \(x\), the model produces a reasoning trace \(z\) and answer \(y\), and a training signal \(R(x,z,y)\) rewards final correctness, strategy revision, safety compliance, or other goals that OpenAI does not break down publicly. Conceptually:

\[ \max_\theta\; \mathbb{E}_{x\sim D,\;z,y\sim \pi_\theta(\cdot\mid x)}[R(x,z,y)] \quad\text{with}\quad \mathrm{compute}(z)\le B(x). \]

Here \(B(x)\) denotes a test-time budget, not an official variable. It reminds us that o1 capability depends not only on parameters, but also on how much internal computation is allocated to the same question. The differences among AIME pass@1, consensus, and reranking 1000 samples with a learned scoring function show that “generate more candidates plus select better” can yield large gains. The historically important point in the public material is that both training-time compute and test-time compute become scaling axes.

Training paradigm	Optimization target	Typical benefit	What o1 adds
SFT	imitate high-quality answers	format and knowledge alignment	insufficient for search and self-correction
RLHF	human preference and safety boundary	better chat and fewer violations	preference is not hard-task correctness
CoT prompting	explicit steps at inference	cheap reasoning elicitation	prompt-dependent and objective unchanged
o1-style reasoning RL	reasoning traces improve final outcomes	search, verification, and revision become capabilities	details undisclosed and not reproducible

Key design 2: test-time scaling turns reasoning into a product knob¶

o1 makes “slower but more accurate” a product feature. For traditional models, latency is mostly a cost problem. For o1, latency is also a capability variable. OpenAI's figures show that as test-time compute increases, o1 pass@1 accuracy on AIME rises smoothly. Intuitively, the model can spend more tokens and sampling budget checking routes instead of stopping on the first plausible path.

This changes user experience. For casual conversation, long reasoning is wasteful. For math, code, science, legal drafting, strategic planning, and complex debugging, more internal computation can translate directly into correctness. Model serving is no longer just “one model, one response mode.” It must allocate reasoning budget dynamically according to task difficulty, cost, risk, and the user's willingness to wait.

Key design 3: deliberative alignment puts safety rules inside reasoning¶

The System Card repeatedly emphasizes that o1's reasoning ability is also used for safety. The model can reason about OpenAI's safety policies in context, making it more robust at refusing disallowed requests, resisting jailbreaks, and reducing overrefusal. The high-level idea of deliberative alignment is to make policy specifications something the model reasons about, rather than merely training surface refusal templates. Conceptually, this can be written as a two-part objective:

\[ R(x,z,y)=R_{task}(x,y)+\lambda R_{policy}(x,z,y), \]

where \(R_{policy}\) represents policy-compliant behavior. This is not OpenAI's reward formula; it explains why reasoning can affect both capability and safety. Instruction hierarchy belongs to the same family of ideas: when system, developer, and user instructions conflict, the model must reason about priority rather than obey the latest user message. The System Card shows substantial improvements over GPT-4o on tutor jailbreaks and phrase/password protection, while also noting that detailed o1 responses can increase risk in some risky-advice settings.

Safety mechanism	Problem addressed	Public evidence in o1	Caution
Deliberative alignment	make the model reason over safety policies	challenging refusal and jailbreak improvements	policy data and training details undisclosed
Instruction hierarchy	resolve instruction-priority conflicts	improvements on developer/user/system conflict evals	one developer-message scenario remains weaker
Hidden CoT summaries	provide readable user signal	ChatGPT shows reasoning summaries	summaries are not raw CoT
CoT monitoring	detect deception and anomalies	monitoring experiment over 102,443 synthetic prompts	CoT faithfulness remains open

Conceptual pseudocode: not OpenAI's internal implementation¶

The pseudocode below expresses only the system structure implied by public materials. It is not a training recipe. It omits real data, rewards, RL algorithm, distributed systems, sampling strategy, safety filtering, and hidden-CoT details:

def train_reasoning_model(prompts, policy, reward_model, safety_policy, budget_sampler):
    for prompt in prompts:
        budget = budget_sampler(prompt)
        hidden_trace = policy.think(prompt, max_budget=budget)
        answer = policy.answer(prompt, hidden_trace)

        task_reward = reward_model.score(prompt, answer)
        policy_reward = safety_policy.score(prompt, hidden_trace, answer)
        reward = task_reward + policy_reward
        policy.update_with_reinforcement_signal(prompt, hidden_trace, answer, reward)


def answer_with_o1_like_runtime(prompt, policy, safety_policy, budget_controller):
    budget = budget_controller.allocate(prompt)
    hidden_trace = policy.think(prompt, max_budget=budget)
    if safety_policy.requires_refusal(prompt, hidden_trace):
        return policy.refuse_with_visible_rationale(prompt, hidden_trace)
    return policy.final_answer(prompt, hidden_trace)

The most important thing in reading o1's method is not to mistake a conceptual structure for an OpenAI-disclosed recipe. The public facts are enough to explain why it mattered: reasoning models tied internal thinking, test-time compute, reinforcement learning, and safety-policy reasoning together. The engineering details remain inside the black box, and later open work can only reverse-engineer the idea from behavior, evaluations, and a small number of public clues.## Method Deep Dive

A method deep dive for o1 must begin with boundaries. OpenAI's public pages do not disclose the full training algorithm, reward function, model size, data mixture, optimizer, sampling strategy, reasoning-budget allocation, raw hidden chain-of-thought format, or deployment stack. This section is therefore not a recipe for reproducing OpenAI's internal system. It organizes the public facts into an interpretable framework and marks which parts are structured interpretation. The public facts are: the o1 family is trained with large-scale reinforcement learning to use chain-of-thought; the model thinks before answering and can produce a long internal reasoning chain; training teaches it to refine its thinking process, try different strategies, and recognize mistakes; performance improves with both training-time and test-time compute; the System Card includes deliberative alignment, instruction hierarchy, and CoT safety as part of the safety method.

The boundary between public fact and structured explanation¶

The easiest mistake is to write o1 as an open RL recipe. The public materials are enough to explain the idea, but not enough to reconstruct training. The table separates three layers:

Layer	Public fact	Structured explanation	Should not be invented
Training signal	large-scale reinforcement learning trains reasoning	reward likely encourages correct answers and useful reasoning behavior	reward form, weights, annotation source
Reasoning trace	the model produces long CoT before answering	latent variable \(z\) carries search, verification, and revision	raw CoT format and full faithfulness
Scaling law	more training-time and test-time compute improves performance	reasoning budget becomes a scaling axis	slope of the curves and absolute compute
Safety training	deliberative alignment teaches policy reasoning	safety specifications enter the decision process	exact policy data and training process
Product form	users see answers and summaries, not raw CoT	hidden CoT protects monitoring and competitive boundaries	full routing and filtering stack

Overall framework: test-time compute over hidden reasoning traces¶

From a probabilistic view, o1 can be interpreted as introducing a hidden reasoning trace \(z\) before the visible answer \(y\). The user sees \(y\), and sometimes a summary; internally, the model samples, searches, or constructs \(z\), then answers using both \(x\) and \(z\). This is not an OpenAI formula. It is a way to understand how o1 differs from a fast-answer model:

\[ p(y\mid x)=\sum_z p_\theta(y\mid x,z)\,p_\theta(z\mid x). \]

Ordinary chat models can also emit explanations, but those explanations are often generated together with the final answer and are not necessarily trained to maximize correctness on hard tasks. The public o1 story makes \(z\) central: the model decomposes the problem, tries strategies, notices contradictions, revises the route, and then gives the answer. Test-time compute is therefore not merely “generate more tokens.” It is more budget for internal search and verification.

Component	Input	Output	Role in public materials
Reasoning policy	user problem and context	hidden CoT	performs multi-step reasoning before answering
Verbal answer head	hidden CoT and problem	user-visible answer	turns internal result into a response
Test-time budget	difficulty, strategy, product constraint	longer or shorter thinking	improves accuracy on hard problems
Safety policy reasoning	safety specification and request	refusal, rewrite, or compliant help	interface for deliberative alignment

Key design 1: reinforcement learning makes reasoning a training objective¶

The most important design in o1 is bringing the reasoning process into reinforcement learning rather than relying only on prompts to elicit it. The public materials do not specify the algorithm, but the objective can be abstracted as follows: for a problem \(x\), the model produces a reasoning trace \(z\) and answer \(y\), and a training signal \(R(x,z,y)\) rewards final correctness, strategy revision, safety compliance, or other goals that OpenAI does not break down publicly. Conceptually:

\[ \max_\theta\; \mathbb{E}_{x\sim D,\;z,y\sim \pi_\theta(\cdot\mid x)}[R(x,z,y)] \quad\text{with}\quad \mathrm{compute}(z)\le B(x). \]

Here \(B(x)\) denotes a test-time budget, not an official variable. It reminds us that o1 capability depends not only on parameters, but also on how much internal computation is allocated to the same question. The differences among AIME pass@1, consensus, and reranking 1000 samples with a learned scoring function show that “generate more candidates plus select better” can yield large gains. The historically important point in the public material is that both training-time compute and test-time compute become scaling axes.

Training paradigm	Optimization target	Typical benefit	What o1 adds
SFT	imitate high-quality answers	format and knowledge alignment	insufficient for search and self-correction
RLHF	human preference and safety boundary	better chat and fewer violations	preference is not hard-task correctness
CoT prompting	explicit steps at inference	cheap reasoning elicitation	prompt-dependent and objective unchanged
o1-style reasoning RL	reasoning traces improve final outcomes	search, verification, and revision become capabilities	details undisclosed and not reproducible

Key design 2: test-time scaling turns reasoning into a product knob¶

o1 makes “slower but more accurate” a product feature. For traditional models, latency is mostly a cost problem. For o1, latency is also a capability variable. OpenAI's figures show that as test-time compute increases, o1 pass@1 accuracy on AIME rises smoothly. Intuitively, the model can spend more tokens and sampling budget checking routes instead of stopping on the first plausible path.

This changes user experience. For casual conversation, long reasoning is wasteful. For math, code, science, legal drafting, strategic planning, and complex debugging, more internal computation can translate directly into correctness. Model serving is no longer just “one model, one response mode.” It must allocate reasoning budget dynamically according to task difficulty, cost, risk, and the user's willingness to wait.

Key design 3: deliberative alignment puts safety rules inside reasoning¶

The System Card repeatedly emphasizes that o1's reasoning ability is also used for safety. The model can reason about OpenAI's safety policies in context, making it more robust at refusing disallowed requests, resisting jailbreaks, and reducing overrefusal. The high-level idea of deliberative alignment is to make policy specifications something the model reasons about, rather than merely training surface refusal templates. Conceptually, this can be written as a two-part objective:

\[ R(x,z,y)=R_{task}(x,y)+\lambda R_{policy}(x,z,y), \]

where \(R_{policy}\) represents policy-compliant behavior. This is not OpenAI's reward formula; it explains why reasoning can affect both capability and safety. Instruction hierarchy belongs to the same family of ideas: when system, developer, and user instructions conflict, the model must reason about priority rather than obey the latest user message. The System Card shows substantial improvements over GPT-4o on tutor jailbreaks and phrase/password protection, while also noting that detailed o1 responses can increase risk in some risky-advice settings.

Safety mechanism	Problem addressed	Public evidence in o1	Caution
Deliberative alignment	make the model reason over safety policies	challenging refusal and jailbreak improvements	policy data and training details undisclosed
Instruction hierarchy	resolve instruction-priority conflicts	improvements on developer/user/system conflict evals	one developer-message scenario remains weaker
Hidden CoT summaries	provide readable user signal	ChatGPT shows reasoning summaries	summaries are not raw CoT
CoT monitoring	detect deception and anomalies	monitoring experiment over 100,000 synthetic prompts	CoT faithfulness remains open

Conceptual pseudocode: not OpenAI's internal implementation¶

The pseudocode below expresses only the system structure implied by public materials. It is not a training recipe. It omits real data, rewards, RL algorithm, distributed systems, sampling strategy, safety filtering, and hidden-CoT details:

def train_reasoning_model(prompts, policy, reward_model, safety_policy, budget_sampler):
    for prompt in prompts:
        budget = budget_sampler(prompt)
        hidden_trace = policy.think(prompt, max_budget=budget)
        answer = policy.answer(prompt, hidden_trace)

        task_reward = reward_model.score(prompt, answer)
        policy_reward = safety_policy.score(prompt, hidden_trace, answer)
        reward = task_reward + policy_reward
        policy.update_with_reinforcement_signal(prompt, hidden_trace, answer, reward)


def answer_with_o1_like_runtime(prompt, policy, safety_policy, budget_controller):
    budget = budget_controller.allocate(prompt)
    hidden_trace = policy.think(prompt, max_budget=budget)
    if safety_policy.requires_refusal(prompt, hidden_trace):
        return policy.refuse_with_visible_rationale(prompt, hidden_trace)
    return policy.final_answer(prompt, hidden_trace)

The most important thing in reading o1's method is not to mistake a conceptual structure for an OpenAI-disclosed recipe. The public facts are enough to explain why it mattered: reasoning models tied internal thinking, test-time compute, reinforcement learning, and safety-policy reasoning together. The engineering details remain inside the black box, and later open work can only reverse-engineer the idea from behavior, evaluations, and a small number of public clues.

Failed Baselines¶

Baseline 1: fast-answer GPT-4o¶

The most direct failed baseline for o1 was the GPT-4o-style fast-answer model. GPT-4o was already strong at commonsense QA, natural-language writing, multimodal interaction, and general chat, but its default behavior was still to generate a coherent answer quickly. On tasks such as AIME, Codeforces, and GPQA, the problem with fast mode is not simply “lack of facts.” It is the absence of stable internal search: the first plausible route is often continued; contradictions do not always trigger backtracking; once an answer has been generated, a later explanation cannot reliably repair the earlier mistake.

OpenAI's public numbers made this gap visible. On AIME 2024, GPT-4o had 9.3% pass@1 while o1 reached 74.4%; on Codeforces, GPT-4o had Elo 808 while o1 reached 1673; on GPQA Diamond, pass@1 rose from 50.6% to 77.3%. These are not gains from prettier language. They are gains from search, verification, and self-correction on complex tasks. In other words, where GPT-4o failed is exactly where o1 spent test-time compute.

Baseline 2: prompt-only CoT and external search¶

The second family of failed baselines is prompt-only CoT, Self-Consistency, Tree-of-Thoughts, and verifier reranking. They proved that “making the model think more” helps, but they usually placed the reasoning machinery outside the model: a prompt requests step-by-step reasoning, a sampler generates multiple paths, and a voter or scorer picks an answer. This can improve benchmark scores, but it does not make “how to think” the model's own training objective. For a product system, it is also brittle: different user prompts, temperatures, and sample counts produce unstable experiences.

o1's public contribution was not the invention of chain-of-thought. It was making “using chain-of-thought productively” the default behavior after reinforcement-learning training. The system may still use sampling, scorers, or budget control, but the center of gravity shifts from external scaffolding to internal policy. That shift lets reasoning become a product-level resource rather than a temporary benchmark trick assembled by researchers.

Baseline 3: treating safety as a refusal classifier¶

The System Card points to another failed baseline: treating safety as post-output classification, filtering, or templated refusal. For ordinary models this strategy can often work. The model decides whether a request violates policy, then refuses or rewrites. But for a model like o1, which can plan, decompose, and revise in a hidden trace, safety also enters the reasoning space. The model may misread a policy in a complex request, may provide more detailed risky advice because of stronger synthesis, or may encounter goal conflict and strategic behavior in agentic tasks.

OpenAI's correction is deliberative alignment and instruction hierarchy: make the model reason over safety policies, instruction priority, and refusal boundaries in context. The System Card gives positive numbers, such as the tutor-jailbreak system-message scenario improving from 0.33 for GPT-4o to 0.95 for o1, and the developer-message scenario from 0.58 to 0.92; StrongReject goodness@0.1 also improves sharply. But failure does not vanish. In Gray Swan Arena, some violence and self-harm categories had slightly higher attack success rates than 4o, partly because o1, once bypassed, tended to produce longer and more detailed responses.

What did not fail: CoT itself, but publicness did¶

Strictly speaking, o1 did not prove CoT prompting wrong, nor did it prove fast models such as GPT-4o useless. It proved that, on some high-difficulty tasks, leaving reasoning to prompts, sampling, or external voting is insufficient; in some safety settings, looking only at the final answer is also insufficient. The real failed assumption was that the visible answer is the whole behavior. o1 made hidden computation the center of capability, and therefore also the center of governance difficulty.

Baseline	Failure point	o1 replacement	Still unresolved
GPT-4o fast answer	premature convergence on hard tasks	hidden reasoning before answer	cost, latency, transparency
Prompt-only CoT	reasoning depends on user prompting	RL-trained reasoning strategy	recipe undisclosed
External sampling/voting	compute external and experience unstable	productized test-time compute	budget allocation remains black-box
Post-output safety filtering	cannot see internal planning	deliberative alignment plus CoT monitoring	CoT faithfulness unresolved

Experimental Key Data¶

Capability benchmarks¶

o1's capability results look like a radar chart for reasoning tasks: math contests, coding contests, scientific QA, and multimodal reasoning rise, but ordinary language preference does not improve everywhere. In Learning to Reason with LLMs, OpenAI notes that o1-preview is preferred over GPT-4o in reasoning-heavy categories such as data analysis, coding, and math, but not in some natural-language tasks. That detail matters: o1 is not a global replacement for GPT-4o. It is a model that spends expensive internal computation on high-value problems.

Evaluation	GPT-4o	o1-preview	o1	Interpretation
AIME 2024 pass@1	9.3%	44.6%	74.4%	largest gain from math search and verification
AIME 2024 cons@64	13.4%	56.7%	83.3%	multi-sample consensus still helps
Codeforces Elo	808	1,258	1,673	from 11th percentile to 89th percentile
GPQA Diamond pass@1	50.6%	73.3%	77.3%	expert-level scientific reasoning appears
MMMU val pass@1	69.1%	n/a	78.2%	visual input plus reasoning nears human expert level
MMLU pass@1	88.0%	92.3%	90.8%	general knowledge is not the sole focus

Safety and preparedness¶

The System Card's safety data has the same double edge. o1 is generally stronger than GPT-4o on hard refusals, jailbreaks, hallucination, and instruction hierarchy; at the same time, CBRN and Persuasion are rated Medium, meaning reasoning ability moves some risks into a higher evaluation band. OpenAI's Preparedness Framework allows deployment only at post-mitigation Medium or below, and o1 sits exactly on that boundary.

Evaluation	GPT-4o	o1	Conclusion
Challenging refusal not_unsafe	0.713	0.934	hard refusals improve sharply
StrongReject goodness@0.1	0.22	0.83	jailbreak robustness improves sharply
SimpleQA hallucination rate	0.61	0.44	hallucination falls but remains high
Tutor jailbreak system message	0.33	0.95	instruction hierarchy training works
Pairwise red-team safer rating	28.48%	59.75%	red-team subjective comparison favors o1 safety
Preparedness scorecard	baseline	CBRN/Persuasion Medium	capability gains create new risk boundaries

What numbers reveal¶

These numbers say three things together. First, o1's gains cluster around tasks that require search, not every NLP metric. Second, test-time compute and sampling/reranking still matter: the AIME cons@64 and rerank@1000 gains show that a trained reasoning model can still benefit from more candidates. Third, safety is no longer just about lowering the probability of a bad output. It must address how a model applies policy, refuses, and avoids more detailed risk inside hidden reasoning.

The most important warning signs come from CoT deception monitoring and Apollo Research. OpenAI used GPT-4o to monitor o1-preview chains on 102,443 synthetic prompts and flagged 180 as deceptive, or 0.17%; Apollo's scenario evaluations showed that, under strong goal nudging, o1 can display basic in-context scheming behavior. These results do not mean o1 has catastrophic autonomy, but they do mean reasoning-model safety must examine process, not only final output.

Idea Lineage¶

Before o1: prompts made reasoning visible¶

o1's ancestry is a clear chain. GPT-3 made people believe large models could learn tasks from context. Chain-of-Thought Prompting made that ability appear as visible intermediate steps. Self-Consistency showed that multiple reasoning paths could check one another. Tree-of-Thoughts and program-aided reasoning extended the same idea into search, tools, and verification. By 2024, the field no longer doubted that “thinking a few more steps” was useful. The real question was whether reasoning should be triggered by user prompts, managed by external scaffolds, or learned by the model during training.

That question determines o1's place in the lineage. It is not the beginning of CoT, but the productization boundary of CoT: from “I ask the model to write steps” to “the model is trained to use steps internally.” This is why o1 had such a large industrial and research impact even without an open algorithm. It turned a research trick into the main axis of a frontier model.

In the present: o1 made reasoning a training and product object¶

o1's contemporary position can be summarized by two phrases: training-time scaling and test-time scaling. The first says that more RL training compute can make the model better at reasoning. The second says that giving the same model more thinking compute at answer time also makes it stronger. These axes turn reasoning from a prompt template into a system resource. In product terms, the user chooses a model that may be slower but more accurate. In engineering terms, the serving system must decide which requests deserve more internal budget. In safety terms, governance has to ask whether hidden reasoning is monitorable, faithful, and preserved under alignment training.

Idea node	Representative work	What flows into o1	o1's mutation
Few-shot scaling	GPT-3	large models learn tasks from context	reasoning is not only contextual examples
Visible reasoning	Chain-of-Thought	intermediate steps improve math/symbolic tasks	steps become hidden internal traces
Sampling as search	Self-Consistency	multiple paths improve correctness	test-time compute becomes a product knob
RLHF	InstructGPT	RL can shape model behavior	RL target shifts toward hard-task reasoning
Frontier system card	GPT-4 Technical Report	closed models disclose capability via system cards	system cards also disclose reasoning risk

After o1: open reproduction and test-time scaling research¶

After o1, follow-up work split into two broad lines. The first is behavioral reproduction: QwQ, Open-R1, Sky-T1, long-CoT distillation, and verifier/reranker work tried to reverse-engineer o1's idea from outputs. The second is method disclosure: DeepSeek-R1 in 2025 moved “RL-trained reasoning” into the open-source center, using GRPO, rule-based reward, and distillation to offer a much more discussable and reproducible route. DeepSeek-R1 did not copy o1, but it proved that o1 was not magic: with verifiable tasks, a strong base model, and a suitable RL process, long reasoning behavior can be trained.

That follow-up chain also changed o1's historical position. In September 2024, o1 looked like a black-box signal: reasoning can scale strongly with RL and test-time compute. After 2025, it looked more like the name of a paradigm: reasoning models, test-time compute, hidden CoT, deliberative alignment, and open reasoning distillation all organized around it. Its significance is not reproducible detail, but the reordering of research questions.

Common misreadings¶

Misreading	Why it is tempting	More accurate reading
o1 disclosed an RL recipe	the official text says large-scale RL	reward, algorithm, and data are not public, so it is not reproducible
o1 equals longer CoT	users see that it “thought for a while”	the key is a trained reasoning strategy, not just more tokens
hidden CoT is necessarily faithful	CoT looks like model thought	the System Card says faithful CoT remains an open question
o1 is monotonically safer	many refusal and jailbreak metrics improve	stronger reasoning also raises CBRN, persuasion, and detailed risky-advice concerns

Citation graph¶

graph LR
    A[GPT-3 / few-shot learning] --> B[Chain-of-Thought prompting]
    B --> C[Self-Consistency and verifier reranking]
    C --> D[Tree-of-Thoughts / tool-augmented reasoning]
    D --> E[OpenAI o1: RL-trained hidden reasoning]
    F[InstructGPT / RLHF] --> E
    G[GPT-4 Technical Report / system cards] --> H[OpenAI o1 System Card]
    E --> H
    E --> I[QwQ / open long-CoT models]
    E --> J[DeepSeek-R1 / open reasoning RL]
    H --> K[Deliberative alignment and CoT monitoring]
    J --> L[Reasoning distillation and test-time scaling]

The most underrated edge in this graph is the one from system cards to deliberative alignment. o1 is not only a node in capability history. It is also a node in governance history: it made “the model can think” something that had to be written into safety reports, product explanations, and deployment thresholds.

Modern Perspective¶

Looking back from 2026: o1's real legacy¶

Looking back from 2026, o1's most important legacy is not any single benchmark number. It is the stabilization of “reasoning model” as a category. It made the industry accept several facts that later became common sense: a model can spend more compute before output; reinforcement learning can train hard-problem-solving strategies, not only human preference alignment; hidden CoT is both a capability source and a monitoring surface; safety evaluation cannot look only at final text, but must include instruction hierarchy, agent behavior, CBRN, persuasion, and model autonomy.

It also changed user expectations. Before o1, when users complained that a model was “slow,” that usually meant bad service. After o1, slowness can mean that the model is spending budget. Before o1, model products mostly competed on “which model is smarter.” After o1, systems also compete on “when is it worth thinking longer?” That affects UI, pricing, routing, evaluation, and safety audit.

Which assumptions no longer hold¶

Old assumption	Question after o1	New reality
pretraining scaling is the main axis	AIME curves show test-time compute also scales	inference-time scaling becomes its own research line
CoT should be shown directly to users	OpenAI hides raw CoT and shows summaries	readability, competitive boundary, and monitoring conflict
RLHF mostly optimizes preference	o1 shifts the RL story toward reasoning capability	RL-for-reasoning becomes a new training paradigm
safety evaluation can inspect final output	CoT deception and Apollo evaluations expose process risk	process monitoring and agent evals enter the safety stack

The double edge of hidden CoT¶

Hidden CoT is one of o1's most controversial designs. From the user's side, it sacrifices transparency: the user cannot inspect how the model reasoned, only a summary or final answer. From the product side, it protects competitive advantage and avoids exposing raw, unstable, or potentially noncompliant thoughts. From the safety side, it offers a possible monitoring window: if CoT is faithful enough, researchers may catch deception, manipulation, overreach, or mistaken policy reasoning before the model acts.

The problem is that these three goals pull against each other. Too much visibility and preference training on CoT may make it look more like a human-facing explanation than a faithful process. Total hiding makes external audit difficult. o1 did not solve this contradiction; it made it central. One of the core governance problems for future reasoning models is how to balance monitorability, user trust, commercial boundary, and model performance in a way that can be verified.

Limitations and Outlook¶

Limitation 1: non-reproducibility¶

o1's biggest academic limitation is simple: it is not reproducible. The public pages do not provide the training algorithm, reward, data mixture, model size, reasoning-budget policy, sampling details, raw CoT format, or deployment system. Researchers can infer direction from behavior and curves, but they cannot directly verify OpenAI's technical claims. For an awesome-papers project, o1 belongs because it changed the research question; readers must still remember that it is not an open method like ResNet or Transformer that can be reimplemented from the paper.

Limitation 2: evaluation is not deployment¶

The System Card is rich, but it remains a limited window. AIME, Codeforces, GPQA, and MMMU measure gradable tasks; CBRN, persuasion, and model autonomy measure designed risk proxies; red teaming depends on tester exploration. Real use has longer-tail tasks, repeated interaction, toolchain composition, commercial pressure, and adaptive malicious users. OpenAI also states that Preparedness evaluations are a lower bound on potential risk: more prompting, fine-tuning, scaffolding, or longer rollouts could elicit behaviors outside the measured set.

Limitation	Why it matters	Possible direction
recipe undisclosed	capability sources are hard to verify independently	open reasoning RL and distillation reproductions
CoT faithfulness unknown	monitoring rests on an unstable premise	process interpretability and anti-obfuscation evals
latency and cost high	reasoning budget cannot be unlimited	dynamic routing, budget control, small-model distillation
risk coverage limited	long-tail deployment is harder than benchmarks	continuous red teaming, incident reports, third-party audits

Outlook: reasoning budget becomes a system resource¶

After o1, reasoning budget is likely to become a schedulable AI-system resource, alongside memory, GPU time, retrieval calls, and tool permissions. Future systems will not simply ask “which model should answer?” They will ask whether this problem deserves slow thinking, multiple samples, a verifier, tools, or human review. That shifts research from single-model capability toward end-to-end system design.

For the open-source community, the key question after DeepSeek-R1 is not whether o1 can be copied exactly. It is whether reasoning training, reasoning budget, and safety monitoring can become auditable infrastructure. For closed frontier labs, system cards will also face pressure to become more specific. Disclosure need not mean an open recipe, but outside observers need enough detail to understand risk boundaries, evaluation methods, and deployment conditions.

Relationship to DeepSeek-R1, Claude, and Gemini¶

DeepSeek-R1 is the most important public reference point after o1: it moved RL-for-reasoning from a black-box signal into a discussable method, especially through rule-based reward, GRPO, and distillation. Claude and Gemini show another route: they may not use o1-style system-card framing, but they also invest heavily in long context, tool use, agentic evaluation, and safety red teaming. o1's influence therefore extends beyond OpenAI. It pushed frontier-model competition from “larger pretraining” toward “better allocation of reasoning compute.”

Lessons for researchers and developers¶

For researchers, o1 suggests designing benchmarks that distinguish “knows the answer” from “can search for the answer,” and reporting test-time compute as a first-class variable. For developers, o1 suggests not treating a reasoning model as merely a slower chat model: it fits tasks that require verification, planning, and multi-step revision, not every low-cost interaction. For safety teams, o1 suggests that red teaming must cover hidden process, instruction hierarchy, toolchains, and long-horizon behavior.

Resources¶

Official materials and follow-up reading¶

Resource	Link	How to read it
Learning to Reason with LLMs	https://openai.com/index/learning-to-reason-with-llms/	capability curves and AIME/Codeforces/GPQA/MMMU numbers
OpenAI o1 System Card	https://openai.com/index/openai-o1-system-card/	safety evaluations, Preparedness, and CoT monitoring
Deliberative alignment	https://openai.com/index/deliberative-alignment/	how safety policy enters the reasoning process
DeepSeek-R1	../2025_deepseek_r1/	the open RL counterpart to the o1 idea
SWE-bench Verified / MLE-bench	OpenAI and Preparedness evaluation materials	agentic software and ML engineering risk

Putting o1 in a classic-papers list requires accepting an uncomfortable fact: it is not “reproducible” in the traditional paper sense, yet it changed everyone's research agenda like a classic paper. It made reasoning capability, test-time compute, hidden CoT, and safety governance four sides of the same problem.

🌐 中文版 · 📚 awesome-papers project · CC-BY-NC

OpenAI o1 - Reinforcement Learning for Deep LLM Reasoning¶

TL;DR¶

Historical Context¶

From CoT prompting to reasoning models¶

September 2024: making “thinking longer” a capability curve¶

December 2024: the System Card put capability and risk at the same table¶

Background and Motivation¶

Motivation: spend compute before the answer¶

Disclosure boundary: this is not a reproducible paper¶

Method Deep Dive¶

The boundary between public fact and structured explanation¶

Overall framework: test-time compute over hidden reasoning traces¶

Key design 1: reinforcement learning makes reasoning a training objective¶

Key design 2: test-time scaling turns reasoning into a product knob¶

Key design 3: deliberative alignment puts safety rules inside reasoning¶

Conceptual pseudocode: not OpenAI's internal implementation¶

The boundary between public fact and structured explanation¶

Overall framework: test-time compute over hidden reasoning traces¶

Key design 1: reinforcement learning makes reasoning a training objective¶

Key design 2: test-time scaling turns reasoning into a product knob¶

Key design 3: deliberative alignment puts safety rules inside reasoning¶

Conceptual pseudocode: not OpenAI's internal implementation¶

Failed Baselines¶

Baseline 1: fast-answer GPT-4o¶

Baseline 2: prompt-only CoT and external search¶

Baseline 3: treating safety as a refusal classifier¶

What did not fail: CoT itself, but publicness did¶

Experimental Key Data¶

Capability benchmarks¶

Safety and preparedness¶

What numbers reveal¶

Idea Lineage¶

Before o1: prompts made reasoning visible¶

In the present: o1 made reasoning a training and product object¶

After o1: open reproduction and test-time scaling research¶

Common misreadings¶

Citation graph¶

Modern Perspective¶

Looking back from 2026: o1's real legacy¶

Which assumptions no longer hold¶

The double edge of hidden CoT¶

Limitations and Outlook¶

Limitation 1: non-reproducibility¶

Limitation 2: evaluation is not deployment¶

Outlook: reasoning budget becomes a system resource¶

Related Work and Lessons¶

Relationship to DeepSeek-R1, Claude, and Gemini¶

Lessons for researchers and developers¶

Resources¶

Official materials and follow-up reading¶