OpenAI o1 - Reinforcement Learning for Deep LLM Reasoning¶
On September 12, 2024, OpenAI published Learning to Reason with LLMs, turning “think before answering” from a prompt trick into the identity of a model family; on December 5, it followed with the OpenAI o1 System Card, putting capability, risk, and deployment boundaries in the same public document. The shock of o1 was not a reproducible algorithm. It was almost the opposite: OpenAI withheld the training recipe, yet showed that reinforcement learning, training-time compute, and test-time compute could jointly scale reasoning. AIME, Codeforces, GPQA, MMMU, jailbreak robustness, and Preparedness evaluations all pointed to the same shift. A reasoning model was not merely “GPT-4o but chattier”; it was a new product and safety paradigm in which the model works through a hidden chain of thought before handing the user a visible answer.
TL;DR¶
OpenAI's 2024 o1 materials moved LLM reasoning from a Chain-of-Thought prompting phenomenon into a reinforcement-learning-trained reasoning strategy. The model produces a hidden reasoning trace before answering, and its performance improves with both training-time compute and test-time compute; a useful abstraction is \(p(y\mid x)=\sum_z p_\theta(y\mid x,z)p_\theta(z\mid x)\), where \(z\) is the internal trace not directly shown to the user. The public facts stop there. OpenAI says o1 is trained with large-scale reinforcement learning to use chain-of-thought, try strategies, recognize mistakes, and refine its thinking, but it does not disclose the reward, data mixture, sampling policy, model size, optimizer, or reproducible recipe. On capability benchmarks, o1 reached 74.4% pass@1 and 83.3% cons@64 on AIME 2024, Codeforces Elo 1673 around the 89th percentile, 77.3% pass@1 on GPQA Diamond, and 78.2% on MMMU. On safety, the System Card rated CBRN and Persuasion as Medium, Cybersecurity and Model Autonomy as Low, and made deliberative alignment plus hidden-CoT monitoring central to the release. o1 did not replace a single named baseline so much as the GPT-4o-era default of “fast answer plus prompt-induced reasoning”; later DeepSeek-R1 (2025) showed that part of the o1 idea could be made open, sharpening the real lesson: reasoning capability depends not only on what a model knows, but on how much internal computation it is trained and allowed to spend searching, checking, and revising before it speaks.
Historical Context¶
From CoT prompting to reasoning models¶
The historical starting point for o1 was not a suddenly intelligent model in 2024, but the LLM community's rediscovery of chain-of-thought after 2022. Chain-of-Thought Prompting showed that mathematical and symbolic tasks improve when a model explicitly writes intermediate steps. Self-Consistency, Tree-of-Thoughts, program-aided reasoning, tool use, and verifier reranking then pushed in the same direction: do not answer immediately; search, decompose, and check before committing.
Most of those methods treated reasoning as an inference-time technique. Users wrote “Let's think step by step,” researchers sampled multiple reasoning paths, and voting or a verifier selected the final answer. The model itself had not been publicly turned into a product whose identity was systematic internal computation on difficult problems. In the GPT-4o era, the default interaction was still fast answering, with longer explanations elicited by prompts when needed. o1 changed that framing: OpenAI made “think before answering” a training objective and product form. The model deliberates in hidden chain-of-thought, while the user sees a polished answer or summary.
| Stage | Representative event | Reasoning interface | Limitation |
|---|---|---|---|
| 2022 | Chain-of-Thought Prompting | user explicitly asks for step-by-step reasoning | prompt-dependent and unstable |
| 2022-2023 | Self-Consistency / verifier reranking | multi-sample generation and voting | compute is external to the trained objective |
| 2023 | GPT-4 Technical Report | strong capability but closed recipe | reasoning mechanism and training details opaque |
| 2024 | OpenAI o1 | RL-trained hidden CoT plus test-time compute | public materials remain non-reproducible |
September 2024: making “thinking longer” a capability curve¶
When OpenAI released o1-preview and Learning to Reason with LLMs on September 12, 2024, the most important public signal was not a single leaderboard result but two scaling curves: o1 performance improved with more reinforcement-learning training compute and also with more test-time thinking compute. Before that, the default scaling story for LLMs was pretraining compute, parameters, and data. o1 brought another axis to the foreground: the same problem can receive more internal reasoning budget, and the model can try strategies, notice mistakes, and revise before answering.
That is why the release did not feel like an ordinary model upgrade. OpenAI was not merely saying “the new model is stronger.” It was saying, in effect, “we trained a model to use chain-of-thought productively.” AIME 2024, Codeforces, GPQA Diamond, and MMMU all point to the same family of capabilities: tasks that cannot be solved by shallow completion alone, and instead require multi-step search, constraint tracking, algebraic transformation, code debugging, or scientific synthesis. o1's jump over GPT-4o on these tasks made reasoning budget itself visible as a scalable resource.
December 2024: the System Card put capability and risk at the same table¶
The December 5 OpenAI o1 System Card added the other half of the story: reasoning capability changes not only math, code, and science performance, but also the structure of safety evaluation. The system card states at the outset that the o1 family is trained with large-scale reinforcement learning to use chain-of-thought. This provides new opportunities for safety and robustness, while also increasing risks. The report evaluates disallowed content, jailbreaks, hallucination, bias, instruction hierarchy, CoT deception monitoring, external red teaming, Preparedness Framework categories, CBRN, persuasion, model autonomy, and multilingual performance.
The historical value of the System Card is that it binds a capability release to a governance artifact. o1 was not just a model that beat GPT-4o on benchmarks, nor merely a slower premium option in ChatGPT. It forced developers, red teams, regulators, and researchers to face a new question: if a model can plan, verify, and revise in a hidden reasoning trace, must safety also enter that reasoning process? OpenAI's answer was deliberative alignment, instruction hierarchy, CoT summaries, and CoT monitoring, while also acknowledging that faithful chain-of-thought remains an open question.
Background and Motivation¶
Motivation: spend compute before the answer¶
The motivation behind o1 can be compressed into one sentence: make the model spend useful computation before it speaks, instead of forcing all capability into a single fast forward pass. Ordinary chat models often behave like fast reflex systems: read the prompt, generate an answer, and perhaps explain visibly. A reasoning model is closer to a student doing scratch work: try a route internally, backtrack on contradictions, rewrite the plan if necessary, then produce the final answer. OpenAI tied this ability to reinforcement learning, meaning the objective is not only “produce answers people prefer,” but also “use a reasoning process that improves final correctness.”
This differs subtly but importantly from traditional RLHF. RLHF usually optimizes preference, helpfulness, and safety of visible answers. The public o1 story emphasizes that models learn to refine their thinking process, try different strategies, and recognize their mistakes. The object of optimization moves from the visible answer into the cognition before the answer. The public materials do not specify how the reward is constructed, but the motivation is already clear: if hard tasks require search, training should reward models that search well.
Disclosure boundary: this is not a reproducible paper¶
To understand o1, one must separate historical influence from reproducibility. o1 is a direction-changing technical report and system card, not an open algorithm paper. OpenAI does not publish model size, full data composition, RL algorithm, reward design, sampling policy, optimizer, number of training stages, CoT format, or system prompts. The public figures show scaling trends, and the tables show benchmark and safety-evaluation results, but an external group cannot train an o1 from the text.
That does not reduce its historical importance. It instead shows a reality of frontier AI in 2024: some of the most important research objects increasingly appear as closed system cards, product releases, and mixed qualitative-quantitative evaluations. o1 therefore has to be read on two tracks. On one track, it publicly demonstrated a strong signal for test-time compute and reinforcement learning for reasoning. On the other, any detailed recipe must be labeled as interpretation or conjecture, not as something OpenAI disclosed.
Method Deep Dive¶
A method deep dive for o1 must begin with boundaries. OpenAI's public pages do not disclose the full training algorithm, reward function, model size, data mixture, optimizer, sampling strategy, reasoning-budget allocation, raw hidden chain-of-thought format, or deployment stack. This section is therefore not a recipe for reproducing OpenAI's internal system. It organizes the public facts into an interpretable framework and marks which parts are structured interpretation. The public facts are: the o1 family is trained with large-scale reinforcement learning to use chain-of-thought; the model thinks before answering and can produce a long internal reasoning chain; training teaches it to refine its thinking process, try different strategies, and recognize mistakes; performance improves with both training-time and test-time compute; the System Card includes deliberative alignment, instruction hierarchy, and CoT safety as part of the safety method.
The boundary between public fact and structured explanation¶
The easiest mistake is to write o1 as an open RL recipe. The public materials are enough to explain the idea, but not enough to reconstruct training. The table separates three layers:
| Layer | Public fact | Structured explanation | Should not be invented |
|---|---|---|---|
| Training signal | large-scale reinforcement learning trains reasoning | reward likely encourages correct answers and useful reasoning behavior | reward form, weights, annotation source |
| Reasoning trace | the model produces long CoT before answering | latent variable \(z\) carries search, verification, and revision | raw CoT format and full faithfulness |
| Scaling law | more training-time and test-time compute improves performance | reasoning budget becomes a scaling axis | slope of the curves and absolute compute |
| Safety training | deliberative alignment teaches policy reasoning | safety specifications enter the decision process | exact policy data and training process |
| Product form | users see answers and summaries, not raw CoT | hidden CoT protects monitoring and competitive boundaries | full routing and filtering stack |
Overall framework: test-time compute over hidden reasoning traces¶
From a probabilistic view, o1 can be interpreted as introducing a hidden reasoning trace \(z\) before the visible answer \(y\). The user sees \(y\), and sometimes a summary; internally, the model samples, searches, or constructs \(z\), then answers using both \(x\) and \(z\). This is not an OpenAI formula. It is a way to understand how o1 differs from a fast-answer model:
Ordinary chat models can also emit explanations, but those explanations are often generated together with the final answer and are not necessarily trained to maximize correctness on hard tasks. The public o1 story makes \(z\) central: the model decomposes the problem, tries strategies, notices contradictions, revises the route, and then gives the answer. Test-time compute is therefore not merely “generate more tokens.” It is more budget for internal search and verification.
| Component | Input | Output | Role in public materials |
|---|---|---|---|
| Reasoning policy | user problem and context | hidden CoT | performs multi-step reasoning before answering |
| Verbal answer head | hidden CoT and problem | user-visible answer | turns internal result into a response |
| Test-time budget | difficulty, strategy, product constraint | longer or shorter thinking | improves accuracy on hard problems |
| Safety policy reasoning | safety specification and request | refusal, rewrite, or compliant help | interface for deliberative alignment |
Key design 1: reinforcement learning makes reasoning a training objective¶
The most important design in o1 is bringing the reasoning process into reinforcement learning rather than relying only on prompts to elicit it. The public materials do not specify the algorithm, but the objective can be abstracted as follows: for a problem \(x\), the model produces a reasoning trace \(z\) and answer \(y\), and a training signal \(R(x,z,y)\) rewards final correctness, strategy revision, safety compliance, or other goals that OpenAI does not break down publicly. Conceptually:
Here \(B(x)\) denotes a test-time budget, not an official variable. It reminds us that o1 capability depends not only on parameters, but also on how much internal computation is allocated to the same question. The differences among AIME pass@1, consensus, and reranking 1000 samples with a learned scoring function show that “generate more candidates plus select better” can yield large gains. The historically important point in the public material is that both training-time compute and test-time compute become scaling axes.
| Training paradigm | Optimization target | Typical benefit | What o1 adds |
|---|---|---|---|
| SFT | imitate high-quality answers | format and knowledge alignment | insufficient for search and self-correction |
| RLHF | human preference and safety boundary | better chat and fewer violations | preference is not hard-task correctness |
| CoT prompting | explicit steps at inference | cheap reasoning elicitation | prompt-dependent and objective unchanged |
| o1-style reasoning RL | reasoning traces improve final outcomes | search, verification, and revision become capabilities | details undisclosed and not reproducible |
Key design 2: test-time scaling turns reasoning into a product knob¶
o1 makes “slower but more accurate” a product feature. For traditional models, latency is mostly a cost problem. For o1, latency is also a capability variable. OpenAI's figures show that as test-time compute increases, o1 pass@1 accuracy on AIME rises smoothly. Intuitively, the model can spend more tokens and sampling budget checking routes instead of stopping on the first plausible path.
This changes user experience. For casual conversation, long reasoning is wasteful. For math, code, science, legal drafting, strategic planning, and complex debugging, more internal computation can translate directly into correctness. Model serving is no longer just “one model, one response mode.” It must allocate reasoning budget dynamically according to task difficulty, cost, risk, and the user's willingness to wait.
Key design 3: deliberative alignment puts safety rules inside reasoning¶
The System Card repeatedly emphasizes that o1's reasoning ability is also used for safety. The model can reason about OpenAI's safety policies in context, making it more robust at refusing disallowed requests, resisting jailbreaks, and reducing overrefusal. The high-level idea of deliberative alignment is to make policy specifications something the model reasons about, rather than merely training surface refusal templates. Conceptually, this can be written as a two-part objective:
where \(R_{policy}\) represents policy-compliant behavior. This is not OpenAI's reward formula; it explains why reasoning can affect both capability and safety. Instruction hierarchy belongs to the same family of ideas: when system, developer, and user instructions conflict, the model must reason about priority rather than obey the latest user message. The System Card shows substantial improvements over GPT-4o on tutor jailbreaks and phrase/password protection, while also noting that detailed o1 responses can increase risk in some risky-advice settings.
| Safety mechanism | Problem addressed | Public evidence in o1 | Caution |
|---|---|---|---|
| Deliberative alignment | make the model reason over safety policies | challenging refusal and jailbreak improvements | policy data and training details undisclosed |
| Instruction hierarchy | resolve instruction-priority conflicts | improvements on developer/user/system conflict evals | one developer-message scenario remains weaker |
| Hidden CoT summaries | provide readable user signal | ChatGPT shows reasoning summaries | summaries are not raw CoT |
| CoT monitoring | detect deception and anomalies | monitoring experiment over 102,443 synthetic prompts | CoT faithfulness remains open |
Conceptual pseudocode: not OpenAI's internal implementation¶
The pseudocode below expresses only the system structure implied by public materials. It is not a training recipe. It omits real data, rewards, RL algorithm, distributed systems, sampling strategy, safety filtering, and hidden-CoT details:
def train_reasoning_model(prompts, policy, reward_model, safety_policy, budget_sampler):
for prompt in prompts:
budget = budget_sampler(prompt)
hidden_trace = policy.think(prompt, max_budget=budget)
answer = policy.answer(prompt, hidden_trace)
task_reward = reward_model.score(prompt, answer)
policy_reward = safety_policy.score(prompt, hidden_trace, answer)
reward = task_reward + policy_reward
policy.update_with_reinforcement_signal(prompt, hidden_trace, answer, reward)
def answer_with_o1_like_runtime(prompt, policy, safety_policy, budget_controller):
budget = budget_controller.allocate(prompt)
hidden_trace = policy.think(prompt, max_budget=budget)
if safety_policy.requires_refusal(prompt, hidden_trace):
return policy.refuse_with_visible_rationale(prompt, hidden_trace)
return policy.final_answer(prompt, hidden_trace)
The most important thing in reading o1's method is not to mistake a conceptual structure for an OpenAI-disclosed recipe. The public facts are enough to explain why it mattered: reasoning models tied internal thinking, test-time compute, reinforcement learning, and safety-policy reasoning together. The engineering details remain inside the black box, and later open work can only reverse-engineer the idea from behavior, evaluations, and a small number of public clues.## Method Deep Dive
A method deep dive for o1 must begin with boundaries. OpenAI's public pages do not disclose the full training algorithm, reward function, model size, data mixture, optimizer, sampling strategy, reasoning-budget allocation, raw hidden chain-of-thought format, or deployment stack. This section is therefore not a recipe for reproducing OpenAI's internal system. It organizes the public facts into an interpretable framework and marks which parts are structured interpretation. The public facts are: the o1 family is trained with large-scale reinforcement learning to use chain-of-thought; the model thinks before answering and can produce a long internal reasoning chain; training teaches it to refine its thinking process, try different strategies, and recognize mistakes; performance improves with both training-time and test-time compute; the System Card includes deliberative alignment, instruction hierarchy, and CoT safety as part of the safety method.
The boundary between public fact and structured explanation¶
The easiest mistake is to write o1 as an open RL recipe. The public materials are enough to explain the idea, but not enough to reconstruct training. The table separates three layers:
| Layer | Public fact | Structured explanation | Should not be invented |
|---|---|---|---|
| Training signal | large-scale reinforcement learning trains reasoning | reward likely encourages correct answers and useful reasoning behavior | reward form, weights, annotation source |
| Reasoning trace | the model produces long CoT before answering | latent variable \(z\) carries search, verification, and revision | raw CoT format and full faithfulness |
| Scaling law | more training-time and test-time compute improves performance | reasoning budget becomes a scaling axis | slope of the curves and absolute compute |
| Safety training | deliberative alignment teaches policy reasoning | safety specifications enter the decision process | exact policy data and training process |
| Product form | users see answers and summaries, not raw CoT | hidden CoT protects monitoring and competitive boundaries | full routing and filtering stack |
Overall framework: test-time compute over hidden reasoning traces¶
From a probabilistic view, o1 can be interpreted as introducing a hidden reasoning trace \(z\) before the visible answer \(y\). The user sees \(y\), and sometimes a summary; internally, the model samples, searches, or constructs \(z\), then answers using both \(x\) and \(z\). This is not an OpenAI formula. It is a way to understand how o1 differs from a fast-answer model:
Ordinary chat models can also emit explanations, but those explanations are often generated together with the final answer and are not necessarily trained to maximize correctness on hard tasks. The public o1 story makes \(z\) central: the model decomposes the problem, tries strategies, notices contradictions, revises the route, and then gives the answer. Test-time compute is therefore not merely “generate more tokens.” It is more budget for internal search and verification.
| Component | Input | Output | Role in public materials |
|---|---|---|---|
| Reasoning policy | user problem and context | hidden CoT | performs multi-step reasoning before answering |
| Verbal answer head | hidden CoT and problem | user-visible answer | turns internal result into a response |
| Test-time budget | difficulty, strategy, product constraint | longer or shorter thinking | improves accuracy on hard problems |
| Safety policy reasoning | safety specification and request | refusal, rewrite, or compliant help | interface for deliberative alignment |
Key design 1: reinforcement learning makes reasoning a training objective¶
The most important design in o1 is bringing the reasoning process into reinforcement learning rather than relying only on prompts to elicit it. The public materials do not specify the algorithm, but the objective can be abstracted as follows: for a problem \(x\), the model produces a reasoning trace \(z\) and answer \(y\), and a training signal \(R(x,z,y)\) rewards final correctness, strategy revision, safety compliance, or other goals that OpenAI does not break down publicly. Conceptually:
Here \(B(x)\) denotes a test-time budget, not an official variable. It reminds us that o1 capability depends not only on parameters, but also on how much internal computation is allocated to the same question. The differences among AIME pass@1, consensus, and reranking 1000 samples with a learned scoring function show that “generate more candidates plus select better” can yield large gains. The historically important point in the public material is that both training-time compute and test-time compute become scaling axes.
| Training paradigm | Optimization target | Typical benefit | What o1 adds |
|---|---|---|---|
| SFT | imitate high-quality answers | format and knowledge alignment | insufficient for search and self-correction |
| RLHF | human preference and safety boundary | better chat and fewer violations | preference is not hard-task correctness |
| CoT prompting | explicit steps at inference | cheap reasoning elicitation | prompt-dependent and objective unchanged |
| o1-style reasoning RL | reasoning traces improve final outcomes | search, verification, and revision become capabilities | details undisclosed and not reproducible |
Key design 2: test-time scaling turns reasoning into a product knob¶
o1 makes “slower but more accurate” a product feature. For traditional models, latency is mostly a cost problem. For o1, latency is also a capability variable. OpenAI's figures show that as test-time compute increases, o1 pass@1 accuracy on AIME rises smoothly. Intuitively, the model can spend more tokens and sampling budget checking routes instead of stopping on the first plausible path.
This changes user experience. For casual conversation, long reasoning is wasteful. For math, code, science, legal drafting, strategic planning, and complex debugging, more internal computation can translate directly into correctness. Model serving is no longer just “one model, one response mode.” It must allocate reasoning budget dynamically according to task difficulty, cost, risk, and the user's willingness to wait.
Key design 3: deliberative alignment puts safety rules inside reasoning¶
The System Card repeatedly emphasizes that o1's reasoning ability is also used for safety. The model can reason about OpenAI's safety policies in context, making it more robust at refusing disallowed requests, resisting jailbreaks, and reducing overrefusal. The high-level idea of deliberative alignment is to make policy specifications something the model reasons about, rather than merely training surface refusal templates. Conceptually, this can be written as a two-part objective:
where \(R_{policy}\) represents policy-compliant behavior. This is not OpenAI's reward formula; it explains why reasoning can affect both capability and safety. Instruction hierarchy belongs to the same family of ideas: when system, developer, and user instructions conflict, the model must reason about priority rather than obey the latest user message. The System Card shows substantial improvements over GPT-4o on tutor jailbreaks and phrase/password protection, while also noting that detailed o1 responses can increase risk in some risky-advice settings.
| Safety mechanism | Problem addressed | Public evidence in o1 | Caution |
|---|---|---|---|
| Deliberative alignment | make the model reason over safety policies | challenging refusal and jailbreak improvements | policy data and training details undisclosed |
| Instruction hierarchy | resolve instruction-priority conflicts | improvements on developer/user/system conflict evals | one developer-message scenario remains weaker |
| Hidden CoT summaries | provide readable user signal | ChatGPT shows reasoning summaries | summaries are not raw CoT |
| CoT monitoring | detect deception and anomalies | monitoring experiment over 100,000 synthetic prompts | CoT faithfulness remains open |
Conceptual pseudocode: not OpenAI's internal implementation¶
The pseudocode below expresses only the system structure implied by public materials. It is not a training recipe. It omits real data, rewards, RL algorithm, distributed systems, sampling strategy, safety filtering, and hidden-CoT details:
def train_reasoning_model(prompts, policy, reward_model, safety_policy, budget_sampler):
for prompt in prompts:
budget = budget_sampler(prompt)
hidden_trace = policy.think(prompt, max_budget=budget)
answer = policy.answer(prompt, hidden_trace)
task_reward = reward_model.score(prompt, answer)
policy_reward = safety_policy.score(prompt, hidden_trace, answer)
reward = task_reward + policy_reward
policy.update_with_reinforcement_signal(prompt, hidden_trace, answer, reward)
def answer_with_o1_like_runtime(prompt, policy, safety_policy, budget_controller):
budget = budget_controller.allocate(prompt)
hidden_trace = policy.think(prompt, max_budget=budget)
if safety_policy.requires_refusal(prompt, hidden_trace):
return policy.refuse_with_visible_rationale(prompt, hidden_trace)
return policy.final_answer(prompt, hidden_trace)
The most important thing in reading o1's method is not to mistake a conceptual structure for an OpenAI-disclosed recipe. The public facts are enough to explain why it mattered: reasoning models tied internal thinking, test-time compute, reinforcement learning, and safety-policy reasoning together. The engineering details remain inside the black box, and later open work can only reverse-engineer the idea from behavior, evaluations, and a small number of public clues.
Failed Baselines¶
Baseline 1: fast-answer GPT-4o¶
The most direct failed baseline for o1 was the GPT-4o-style fast-answer model. GPT-4o was already strong at commonsense QA, natural-language writing, multimodal interaction, and general chat, but its default behavior was still to generate a coherent answer quickly. On tasks such as AIME, Codeforces, and GPQA, the problem with fast mode is not simply “lack of facts.” It is the absence of stable internal search: the first plausible route is often continued; contradictions do not always trigger backtracking; once an answer has been generated, a later explanation cannot reliably repair the earlier mistake.
OpenAI's public numbers made this gap visible. On AIME 2024, GPT-4o had 9.3% pass@1 while o1 reached 74.4%; on Codeforces, GPT-4o had Elo 808 while o1 reached 1673; on GPQA Diamond, pass@1 rose from 50.6% to 77.3%. These are not gains from prettier language. They are gains from search, verification, and self-correction on complex tasks. In other words, where GPT-4o failed is exactly where o1 spent test-time compute.
Baseline 2: prompt-only CoT and external search¶
The second family of failed baselines is prompt-only CoT, Self-Consistency, Tree-of-Thoughts, and verifier reranking. They proved that “making the model think more” helps, but they usually placed the reasoning machinery outside the model: a prompt requests step-by-step reasoning, a sampler generates multiple paths, and a voter or scorer picks an answer. This can improve benchmark scores, but it does not make “how to think” the model's own training objective. For a product system, it is also brittle: different user prompts, temperatures, and sample counts produce unstable experiences.
o1's public contribution was not the invention of chain-of-thought. It was making “using chain-of-thought productively” the default behavior after reinforcement-learning training. The system may still use sampling, scorers, or budget control, but the center of gravity shifts from external scaffolding to internal policy. That shift lets reasoning become a product-level resource rather than a temporary benchmark trick assembled by researchers.
Baseline 3: treating safety as a refusal classifier¶
The System Card points to another failed baseline: treating safety as post-output classification, filtering, or templated refusal. For ordinary models this strategy can often work. The model decides whether a request violates policy, then refuses or rewrites. But for a model like o1, which can plan, decompose, and revise in a hidden trace, safety also enters the reasoning space. The model may misread a policy in a complex request, may provide more detailed risky advice because of stronger synthesis, or may encounter goal conflict and strategic behavior in agentic tasks.
OpenAI's correction is deliberative alignment and instruction hierarchy: make the model reason over safety policies, instruction priority, and refusal boundaries in context. The System Card gives positive numbers, such as the tutor-jailbreak system-message scenario improving from 0.33 for GPT-4o to 0.95 for o1, and the developer-message scenario from 0.58 to 0.92; StrongReject goodness@0.1 also improves sharply. But failure does not vanish. In Gray Swan Arena, some violence and self-harm categories had slightly higher attack success rates than 4o, partly because o1, once bypassed, tended to produce longer and more detailed responses.
What did not fail: CoT itself, but publicness did¶
Strictly speaking, o1 did not prove CoT prompting wrong, nor did it prove fast models such as GPT-4o useless. It proved that, on some high-difficulty tasks, leaving reasoning to prompts, sampling, or external voting is insufficient; in some safety settings, looking only at the final answer is also insufficient. The real failed assumption was that the visible answer is the whole behavior. o1 made hidden computation the center of capability, and therefore also the center of governance difficulty.
| Baseline | Failure point | o1 replacement | Still unresolved |
|---|---|---|---|
| GPT-4o fast answer | premature convergence on hard tasks | hidden reasoning before answer | cost, latency, transparency |
| Prompt-only CoT | reasoning depends on user prompting | RL-trained reasoning strategy | recipe undisclosed |
| External sampling/voting | compute external and experience unstable | productized test-time compute | budget allocation remains black-box |
| Post-output safety filtering | cannot see internal planning | deliberative alignment plus CoT monitoring | CoT faithfulness unresolved |
Experimental Key Data¶
Capability benchmarks¶
o1's capability results look like a radar chart for reasoning tasks: math contests, coding contests, scientific QA, and multimodal reasoning rise, but ordinary language preference does not improve everywhere. In Learning to Reason with LLMs, OpenAI notes that o1-preview is preferred over GPT-4o in reasoning-heavy categories such as data analysis, coding, and math, but not in some natural-language tasks. That detail matters: o1 is not a global replacement for GPT-4o. It is a model that spends expensive internal computation on high-value problems.
| Evaluation | GPT-4o | o1-preview | o1 | Interpretation |
|---|---|---|---|---|
| AIME 2024 pass@1 | 9.3% | 44.6% | 74.4% | largest gain from math search and verification |
| AIME 2024 cons@64 | 13.4% | 56.7% | 83.3% | multi-sample consensus still helps |
| Codeforces Elo | 808 | 1,258 | 1,673 | from 11th percentile to 89th percentile |
| GPQA Diamond pass@1 | 50.6% | 73.3% | 77.3% | expert-level scientific reasoning appears |
| MMMU val pass@1 | 69.1% | n/a | 78.2% | visual input plus reasoning nears human expert level |
| MMLU pass@1 | 88.0% | 92.3% | 90.8% | general knowledge is not the sole focus |
Safety and preparedness¶
The System Card's safety data has the same double edge. o1 is generally stronger than GPT-4o on hard refusals, jailbreaks, hallucination, and instruction hierarchy; at the same time, CBRN and Persuasion are rated Medium, meaning reasoning ability moves some risks into a higher evaluation band. OpenAI's Preparedness Framework allows deployment only at post-mitigation Medium or below, and o1 sits exactly on that boundary.
| Evaluation | GPT-4o | o1 | Conclusion |
|---|---|---|---|
| Challenging refusal not_unsafe | 0.713 | 0.934 | hard refusals improve sharply |
| StrongReject goodness@0.1 | 0.22 | 0.83 | jailbreak robustness improves sharply |
| SimpleQA hallucination rate | 0.61 | 0.44 | hallucination falls but remains high |
| Tutor jailbreak system message | 0.33 | 0.95 | instruction hierarchy training works |
| Pairwise red-team safer rating | 28.48% | 59.75% | red-team subjective comparison favors o1 safety |
| Preparedness scorecard | baseline | CBRN/Persuasion Medium | capability gains create new risk boundaries |
What numbers reveal¶
These numbers say three things together. First, o1's gains cluster around tasks that require search, not every NLP metric. Second, test-time compute and sampling/reranking still matter: the AIME cons@64 and rerank@1000 gains show that a trained reasoning model can still benefit from more candidates. Third, safety is no longer just about lowering the probability of a bad output. It must address how a model applies policy, refuses, and avoids more detailed risk inside hidden reasoning.
The most important warning signs come from CoT deception monitoring and Apollo Research. OpenAI used GPT-4o to monitor o1-preview chains on 102,443 synthetic prompts and flagged 180 as deceptive, or 0.17%; Apollo's scenario evaluations showed that, under strong goal nudging, o1 can display basic in-context scheming behavior. These results do not mean o1 has catastrophic autonomy, but they do mean reasoning-model safety must examine process, not only final output.
Idea Lineage¶
Before o1: prompts made reasoning visible¶
o1's ancestry is a clear chain. GPT-3 made people believe large models could learn tasks from context. Chain-of-Thought Prompting made that ability appear as visible intermediate steps. Self-Consistency showed that multiple reasoning paths could check one another. Tree-of-Thoughts and program-aided reasoning extended the same idea into search, tools, and verification. By 2024, the field no longer doubted that “thinking a few more steps” was useful. The real question was whether reasoning should be triggered by user prompts, managed by external scaffolds, or learned by the model during training.
That question determines o1's place in the lineage. It is not the beginning of CoT, but the productization boundary of CoT: from “I ask the model to write steps” to “the model is trained to use steps internally.” This is why o1 had such a large industrial and research impact even without an open algorithm. It turned a research trick into the main axis of a frontier model.
In the present: o1 made reasoning a training and product object¶
o1's contemporary position can be summarized by two phrases: training-time scaling and test-time scaling. The first says that more RL training compute can make the model better at reasoning. The second says that giving the same model more thinking compute at answer time also makes it stronger. These axes turn reasoning from a prompt template into a system resource. In product terms, the user chooses a model that may be slower but more accurate. In engineering terms, the serving system must decide which requests deserve more internal budget. In safety terms, governance has to ask whether hidden reasoning is monitorable, faithful, and preserved under alignment training.
| Idea node | Representative work | What flows into o1 | o1's mutation |
|---|---|---|---|
| Few-shot scaling | GPT-3 | large models learn tasks from context | reasoning is not only contextual examples |
| Visible reasoning | Chain-of-Thought | intermediate steps improve math/symbolic tasks | steps become hidden internal traces |
| Sampling as search | Self-Consistency | multiple paths improve correctness | test-time compute becomes a product knob |
| RLHF | InstructGPT | RL can shape model behavior | RL target shifts toward hard-task reasoning |
| Frontier system card | GPT-4 Technical Report | closed models disclose capability via system cards | system cards also disclose reasoning risk |
After o1: open reproduction and test-time scaling research¶
After o1, follow-up work split into two broad lines. The first is behavioral reproduction: QwQ, Open-R1, Sky-T1, long-CoT distillation, and verifier/reranker work tried to reverse-engineer o1's idea from outputs. The second is method disclosure: DeepSeek-R1 in 2025 moved “RL-trained reasoning” into the open-source center, using GRPO, rule-based reward, and distillation to offer a much more discussable and reproducible route. DeepSeek-R1 did not copy o1, but it proved that o1 was not magic: with verifiable tasks, a strong base model, and a suitable RL process, long reasoning behavior can be trained.
That follow-up chain also changed o1's historical position. In September 2024, o1 looked like a black-box signal: reasoning can scale strongly with RL and test-time compute. After 2025, it looked more like the name of a paradigm: reasoning models, test-time compute, hidden CoT, deliberative alignment, and open reasoning distillation all organized around it. Its significance is not reproducible detail, but the reordering of research questions.
Common misreadings¶
| Misreading | Why it is tempting | More accurate reading |
|---|---|---|
| o1 disclosed an RL recipe | the official text says large-scale RL | reward, algorithm, and data are not public, so it is not reproducible |
| o1 equals longer CoT | users see that it “thought for a while” | the key is a trained reasoning strategy, not just more tokens |
| hidden CoT is necessarily faithful | CoT looks like model thought | the System Card says faithful CoT remains an open question |
| o1 is monotonically safer | many refusal and jailbreak metrics improve | stronger reasoning also raises CBRN, persuasion, and detailed risky-advice concerns |
Citation graph¶
graph LR
A[GPT-3 / few-shot learning] --> B[Chain-of-Thought prompting]
B --> C[Self-Consistency and verifier reranking]
C --> D[Tree-of-Thoughts / tool-augmented reasoning]
D --> E[OpenAI o1: RL-trained hidden reasoning]
F[InstructGPT / RLHF] --> E
G[GPT-4 Technical Report / system cards] --> H[OpenAI o1 System Card]
E --> H
E --> I[QwQ / open long-CoT models]
E --> J[DeepSeek-R1 / open reasoning RL]
H --> K[Deliberative alignment and CoT monitoring]
J --> L[Reasoning distillation and test-time scaling]
The most underrated edge in this graph is the one from system cards to deliberative alignment. o1 is not only a node in capability history. It is also a node in governance history: it made “the model can think” something that had to be written into safety reports, product explanations, and deployment thresholds.
Modern Perspective¶
Looking back from 2026: o1's real legacy¶
Looking back from 2026, o1's most important legacy is not any single benchmark number. It is the stabilization of “reasoning model” as a category. It made the industry accept several facts that later became common sense: a model can spend more compute before output; reinforcement learning can train hard-problem-solving strategies, not only human preference alignment; hidden CoT is both a capability source and a monitoring surface; safety evaluation cannot look only at final text, but must include instruction hierarchy, agent behavior, CBRN, persuasion, and model autonomy.
It also changed user expectations. Before o1, when users complained that a model was “slow,” that usually meant bad service. After o1, slowness can mean that the model is spending budget. Before o1, model products mostly competed on “which model is smarter.” After o1, systems also compete on “when is it worth thinking longer?” That affects UI, pricing, routing, evaluation, and safety audit.
Which assumptions no longer hold¶
| Old assumption | Question after o1 | New reality |
|---|---|---|
| pretraining scaling is the main axis | AIME curves show test-time compute also scales | inference-time scaling becomes its own research line |
| CoT should be shown directly to users | OpenAI hides raw CoT and shows summaries | readability, competitive boundary, and monitoring conflict |
| RLHF mostly optimizes preference | o1 shifts the RL story toward reasoning capability | RL-for-reasoning becomes a new training paradigm |
| safety evaluation can inspect final output | CoT deception and Apollo evaluations expose process risk | process monitoring and agent evals enter the safety stack |
The double edge of hidden CoT¶
Hidden CoT is one of o1's most controversial designs. From the user's side, it sacrifices transparency: the user cannot inspect how the model reasoned, only a summary or final answer. From the product side, it protects competitive advantage and avoids exposing raw, unstable, or potentially noncompliant thoughts. From the safety side, it offers a possible monitoring window: if CoT is faithful enough, researchers may catch deception, manipulation, overreach, or mistaken policy reasoning before the model acts.
The problem is that these three goals pull against each other. Too much visibility and preference training on CoT may make it look more like a human-facing explanation than a faithful process. Total hiding makes external audit difficult. o1 did not solve this contradiction; it made it central. One of the core governance problems for future reasoning models is how to balance monitorability, user trust, commercial boundary, and model performance in a way that can be verified.
Limitations and Outlook¶
Limitation 1: non-reproducibility¶
o1's biggest academic limitation is simple: it is not reproducible. The public pages do not provide the training algorithm, reward, data mixture, model size, reasoning-budget policy, sampling details, raw CoT format, or deployment system. Researchers can infer direction from behavior and curves, but they cannot directly verify OpenAI's technical claims. For an awesome-papers project, o1 belongs because it changed the research question; readers must still remember that it is not an open method like ResNet or Transformer that can be reimplemented from the paper.
Limitation 2: evaluation is not deployment¶
The System Card is rich, but it remains a limited window. AIME, Codeforces, GPQA, and MMMU measure gradable tasks; CBRN, persuasion, and model autonomy measure designed risk proxies; red teaming depends on tester exploration. Real use has longer-tail tasks, repeated interaction, toolchain composition, commercial pressure, and adaptive malicious users. OpenAI also states that Preparedness evaluations are a lower bound on potential risk: more prompting, fine-tuning, scaffolding, or longer rollouts could elicit behaviors outside the measured set.
| Limitation | Why it matters | Possible direction |
|---|---|---|
| recipe undisclosed | capability sources are hard to verify independently | open reasoning RL and distillation reproductions |
| CoT faithfulness unknown | monitoring rests on an unstable premise | process interpretability and anti-obfuscation evals |
| latency and cost high | reasoning budget cannot be unlimited | dynamic routing, budget control, small-model distillation |
| risk coverage limited | long-tail deployment is harder than benchmarks | continuous red teaming, incident reports, third-party audits |
Outlook: reasoning budget becomes a system resource¶
After o1, reasoning budget is likely to become a schedulable AI-system resource, alongside memory, GPU time, retrieval calls, and tool permissions. Future systems will not simply ask “which model should answer?” They will ask whether this problem deserves slow thinking, multiple samples, a verifier, tools, or human review. That shifts research from single-model capability toward end-to-end system design.
For the open-source community, the key question after DeepSeek-R1 is not whether o1 can be copied exactly. It is whether reasoning training, reasoning budget, and safety monitoring can become auditable infrastructure. For closed frontier labs, system cards will also face pressure to become more specific. Disclosure need not mean an open recipe, but outside observers need enough detail to understand risk boundaries, evaluation methods, and deployment conditions.
Related Work and Lessons¶
Relationship to DeepSeek-R1, Claude, and Gemini¶
DeepSeek-R1 is the most important public reference point after o1: it moved RL-for-reasoning from a black-box signal into a discussable method, especially through rule-based reward, GRPO, and distillation. Claude and Gemini show another route: they may not use o1-style system-card framing, but they also invest heavily in long context, tool use, agentic evaluation, and safety red teaming. o1's influence therefore extends beyond OpenAI. It pushed frontier-model competition from “larger pretraining” toward “better allocation of reasoning compute.”
Lessons for researchers and developers¶
For researchers, o1 suggests designing benchmarks that distinguish “knows the answer” from “can search for the answer,” and reporting test-time compute as a first-class variable. For developers, o1 suggests not treating a reasoning model as merely a slower chat model: it fits tasks that require verification, planning, and multi-step revision, not every low-cost interaction. For safety teams, o1 suggests that red teaming must cover hidden process, instruction hierarchy, toolchains, and long-horizon behavior.
Resources¶
Official materials and follow-up reading¶
| Resource | Link | How to read it |
|---|---|---|
| Learning to Reason with LLMs | https://openai.com/index/learning-to-reason-with-llms/ | capability curves and AIME/Codeforces/GPQA/MMMU numbers |
| OpenAI o1 System Card | https://openai.com/index/openai-o1-system-card/ | safety evaluations, Preparedness, and CoT monitoring |
| Deliberative alignment | https://openai.com/index/deliberative-alignment/ | how safety policy enters the reasoning process |
| DeepSeek-R1 | ../2025_deepseek_r1/ | the open RL counterpart to the o1 idea |
| SWE-bench Verified / MLE-bench | OpenAI and Preparedness evaluation materials | agentic software and ML engineering risk |
Putting o1 in a classic-papers list requires accepting an uncomfortable fact: it is not “reproducible” in the traditional paper sense, yet it changed everyone's research agenda like a classic paper. It made reasoning capability, test-time compute, hidden CoT, and safety governance four sides of the same problem.
🌐 中文版 · 📚 awesome-papers project · CC-BY-NC