首页研究报告机构研究人工智能DeepSeek最新论文《CODEIO:通过代码输入输出预测凝练推理模式》(英文)
壹方

文档

3598

关注

1

好评

0
PDF

DeepSeek最新论文《CODEIO:通过代码输入输出预测凝练推理模式》(英文)

阅读 684 下载 92 大小 2.1M 总页数 19 页 2025-02-27 分享
价格:¥ 9.90
下载文档
/ 19
全屏查看
DeepSeek最新论文《CODEIO:通过代码输入输出预测凝练推理模式》(英文)
还有 19 页未读 ,您可以 继续阅读 或 下载文档
1、本文档共计 19 页,下载后文档不带水印,支持完整阅读内容或进行编辑。
2、当您付费下载文档后,您只拥有了使用权限,并不意味着购买了版权,文档只能用于自身使用,不得用于其他商业用途(如 [转卖]进行直接盈利或[编辑后售卖]进行间接盈利)。
3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。
4、如文档内容存在违规,或者侵犯商业秘密、侵犯著作权等,请点击“违规举报”。
CODEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionRunxin Xu!Yu Wu!Junxian He3Abstractnized as a comerstone of advanced Large Language Mod-els (LLMs)and a critical step toward achieving ArtificialReasoning is a fundamental capability of LargeGeneral Intelligence (AGD(Huang Chang,2022;QiaoLanguage Models.While prior research predom-et al.,2022;Jaech et al.,2024;Xiang et al.,2025).Currentinantly focuses on enhancing narow skills likeapproaches,however,face a fundamental paradox:whilemath or code generation,improving performancetasks like math problem solving (Shao et al.,2024;Yangon many other reasoning tasks remains challeng-et al.,2024;Zeng et al.,2024;Ying et al.,2024;Toshni-ing due to sparse and fragmented training data.wal et al.,2024)and code generation(Roziere et al.,2023;To address this issue,we propose CoDEI/O,aMistral-AI,2024;Zhu et al.,2024:Hui et al.,2024)benefitnovel approach that systematically condenses di-from abundant structured training data,most other reasoningverse reasoning pattems inherently embedded indomains-including logical deduction,scientific inference,contextually-grounded codes,through transform-ing the original code into a code input-outputand symbolic reasoning-suffer from sparse and fragmentedprediction format.By training models to pre-supervision signals.As a result,it becomes crucial to iden-tify training data that is rich in diverse reasoning patternsdict inputs/outputs given code and test cases en-while remaining scalable to obtain.tirely in natural language as Chain-of-Thought(CoT)rationales,we expose them to universalWe believe that real-world code programs reflect the inte-reasoning primitives-like logic flow planning.gration of a wide range of reasoning patterns across diversestate-space searching,decision tree traversal,andcontexts,making them an ideal source for training whilemodular decomposition-while decoupling struc-minimizing the risk of overfitting.However,conventionaltured reasoning from code-specific syntax andcontinual pre-training on raw code is suboptimal becausepreserving procedural rigor.Experimental re-the relevant reasoning signals are often implicit and inter-sults demonstrate CoDEI/O leads to consistenttwined with noisy information.Even the cleaner objectiveimprovements across symbolic,scientific,logic,of directly training on text-to-code generation also facesmath numerical,and commonsense reasoningchallenges,as it is constrained by the requirement to gen-tasks.By matching the existing ground-truth out-erate code-specific syntax,making it difficult to generalizeputs or re-executing the code with predicted in-to tasks beyond code-specific ones.To address such lim-puts,we can verify each prediction and furtherenhance the CoTs through multi-turn revision,cutable functions and designing a more straightforward task:resulting in CoDEI/O++and achieving highergiven a function along with its corresponding textual query,performance.Our data and models are availablethe model needs to predicteither the execution outputs givenat https://github.com/hkust-nlp/CodeIO.inputs or feasible inputs given outputs entirely in natural lan-guage as CoT rationales.This approach aims to disentanglecore reasoning flow from code-specific syntax while preserv-1.Introductioning logical rigor.By gathering and transforming functionsfrom diverse sources,the resulting data incorporates a va-Reasoning is a fundamental aspect of human cognition andriety of foundational reasoning skills,such as logic flowproblem-solving,forming the basis for quickly transferringorchestration,state-space exploration,recursive decompo-and adapting to new tasks (Dehaene et al.,2004;Knauffsition,and decision-making.Learning from these samplesWolf,2010;Wang Chiew,2010).It is also recog-across the diverse contexts provided by the raw code files*Work done during intership at DeepSeek-AI.DeepSeek-AIenables models to gain repeated exposure to these reasoning2Shanghai Jiao Tong University 3HKUST.Correspondence to:processes,allowing them to better internalize these skills.Junlong Li ,Junxian He .put/output prediction leaming is introduced as a distincttraining stage positioned before general instruction tuningCoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionVerification:Re-Execute(ReferenceRevision(optiona).xecuteWhat is the input?…{"input":)InputGenerator Sample……What is the output?..……Raw CodeQuery“You are given."("output":)Figure 1:Overview of our training data construction:Raw code files are gathered from various sources and converted into aunified format.Input-output pairs are then generated by executing the code,while natural language CoTs for predictions arecollected from DeepSeek-V2.5.The verified CoTs can undergo optional revisions to further enhance reasoning chains.in a two-stage manner,serving as an intermediate step to en-2.CODEI/Ohance the reasoning abilities of the base model.The promptincludes the function,the textual query,and the given in-Our data construction pipeline is presented in this section.put or output,while the response is directly collected byWe begin with collecting raw code files from various sourcesprompting a strong open-source model,DeepSeek-V2.5(DeepSeek-AI et al.,2024).Notably,the instances for input-($2.2).Next,I/O pairs are sampled from the transformedoutput prediction are highly scalable to collect,as we canfunctions($2.3).Finally,the complete training dataset issample hundreds of inputs from a separate Python inputassembled($2.4).An overview is depicted in Figure 1.generator for each function and execute the code to ob-tain ground-truth outputs.Finally,we collect over 450K2.1.Collecting Raw Code Filesfunctions from multiple sources,and for each function.sev-The effectiveness of CoDEI/O lies in selecting diverse raweral input-output pairs are generated by executing the cor-code sources that encompass a wide range of reasoningresponding code.Synthesizing CoTs for them results inpatterns.To achieve this,we select sources with differenta total of 3.5M training samples,yielding the CoDEI/Oemphases:CodeMix,a large collection of raw Python codedata.To further leverage the verifiable characteristics offiles retrieved from an in-house code pre-training corpus,code,we verify all predictions based on code executionwhere we filter out files that are either overly simplistic orand prompt DeepSeek-V2.5 for a second tum of revisionsexcessively complex;and PyEdu-R(reasoning),a subset ofon the responses it initially got wrong.These multi-turnPython-Edu(Ben Allal et al.,2024)that focuses on complexrevisions are concatenated into longer responses.The re-reasoning tasks such as STEM,system modeling or logicsulting CoDEI/O++dataset further enhances performance,puzzles.To avoid overlap with CodeMix,we deliberatelydemonstrating the effectiveness of this refinement process.exclude files centered on pure algorithms.Beyond these twoWe validate the effectiveness of CoDEI/O and CoDEI/O++sources,we also incorporate high-quality code files from aacross four base models with parameter sizes ranging fromvariety of smaller,reputable sources,including comprehen-7B to 30B.Assessments across 14 different benchmarkssive algorithm repositories,challenging math problems,andshow training on them enhances performance on a diversewell-known online coding platforms.In total,merging theserange of reasoning tasks,not only limited to code-relatedsources yields approximately 810.5K code files.Furthertasks but also more generalized tasks such as logic,sym-details on the data sources can be found in Appendix C.1.bolic,mathematical numerical,scientific,commonsense,etc.Compared to several strong data baselines,such as2.2.Transforming to a Unified FormatOpenMathInstruct2 (Toshniwal et al.,2024),OpenCoder-The collected raw code files often lack structure,contain ir-SFT-Stagel (Huang et al.,2024),WebInstruct (Yue et al.,relevant elements,and are hard to execute in a self-contained2024),and high-quality raw code (Ben Allal et al.,2024),CoDEI/O achieves not only higher average scores acrossway.Therefore,we preprocess them using DeepSeek-V2.5(DeepSeek-AI et al.,2024),which refines them into a uni-all four tested base models but also more balanced perfor-mance-Instead of boosting scores on only a small subsetfied format that emphasizes main logical functionality andof evaluation benchmarks while causing declines on others,makes it executable for us to collect input-output pairs forCoDEI/O delivers consistent improvements across nearlylater prediction tasks.This transformation organizes the dataall benchmarks,demonstrating balanced and generalizableinto the following components,and we provide a completereasoning abilities.example in Table 8 in Appendix G:1)Cleaned ReferenceCode:We preprocess the raw code files by cleaning andCoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionQuerydef change_ref(ant,coins):return eReference CodeYou are given an amount of money amt'and a list of coinand not coins:return float("inf")elif coins[e]ant:return change_ref(amt,denominations 'coins'.Your task is to determine the minimum numbercains[1:]of coins needed to make up the amount amt using the givenuse_it 1 change_ref(ant coins[0],denominations.If it is not possible to make the amount with the givencains)coins,the function should return inf (infinity).lose_it change_ref(amt,coins[1:]return min(use_it,lose_it)Given input ={"amt":25,"coins":[1,4,7]),predict outputGiven output 4,predict inputto make up the amount 'amt'is'4.combination of coins that sums up to 'amt'requires excactly4'coins.the amount '8'.The optimal way to make up '8 using these coins isAfter using'3"coins of 7,we have 25-21=4 left.To make'4.weuse'1'coin of'41 coin of '1'This sums up to '8'using '3'coins,which is less than'4'.So,this is not the correct4.Another Example:Let's consider coin denominations'[1 3.4]'and the amount'6'.coins of 7,we have 25-14=11'leftThe optimal way to make up'6 using these coins is:1 coin of'4This sunto'6 usinghich is still less than。Using'1'coin of 7ns'[l,3,4J and the amount'8.-To make'18,we can use coins of'4(totaling'16)and '2'coins of'1ch isUsing no'7 coinsoptimal way to make up 13 using these coins is:Total coins used:'6 (for 4)+1(for 1)=7 coins.3.Optimal Solution:The combination that uses the fewest coins is the one where weThis sums up to'13'using'4coins,which matches our desired output.use 3 coins of 7'and '1'coin of'4',totaling'4 coins.('input:{amt':13,"coins":[.25D)Figure 2:Two examples for the collected responses for input and output prediction respectively.refactoring the code to extract core logical functionalitiesexecuting the code.To ensure the outputs are deterministic.into functions.Non-essential elements like visualizationwe skip all functions that include randomness.such as those(e.g.,print,plot)and file processing (e.g.,read,write)areusing import random.During the execution of these codes,excluded.2)Main Entrypoint Function:A main entry-we alsoimpose a series of limits on the runtime and the com-point function is added to summarize the overall logic of theplexity of the input/output objects (details in Appendix A).code.It can call other functions or import external librariesFor each transformed function,we sample multiple input-and must have non-empty arguments (inputs)as well as re-output pairs,with the exact number depending on the sourcetum meaningful outputs.All inputs and outputs are requiredfrom which it originates (details in Appendix C.2).Afterto be JSON-serializable to facilitate further processing.3)filtering out non-executable code,samples that exceed theInput/Output Description:The inputs and outputs of theruntime limit,and input-output pairs surpassing the desiredmain entrypoint function are clearly defined,including in-complexity,we obtain 3.5M instances derived from 454.9Kformation on data types,constraints (e.g.,output ranges),orraw code files.The distribution of input and output predic-more complex requirements (e.g.,keys in a dictionary).4)tion instances is roughly balanced at 50%/50%.Input Generator:Rather than generating test cases directly,a standalone rule-based python input generator function is2.4.Building Samples for Input-Output Predictioncreated.This generator retums non-trivial inputs that followthe requirements of the main entrypoint function.Random-After collecting the input-output pairs as well as the trans-ness is applied subject to constraints,enabling scalable dataformed functions,we need to assemble them into a trainablegeneration.5)Query:A concise problem statement is gen-format.For the supervised fine-tuning process we adopt,aerated based on the main entrypoint function,serving as aprompt and a response are needed for each training samplequery to describe its intended functionality of the code.Since we aim for the input-output prediction tasks,we con-struct the prompt using a designed template to combine the2.3.Collecting Input and Output Pairsfunction,the query,the reference code,and either a specificinput or output.We provide an example prompt in FigureAfter converting the collected raw code files into a unified8 in Appendix G.The response should ideally be a naturalformat,we sample multiple inputs using the input generatorlanguage CoT to reason about how to derive the correct out-for each function and obtain the corresponding outputs byput or a feasible input.In general,we choose the followingCoDEI/O:Condensing Reasoning Patterns via Code Input-Output Predictiontwo ways to construct the desired CoT responses:pre-training focus.Notably,we include two coder models.as previous studies have shown that coder models exhibitDirect Prompting -CoDEI/O While having full exe-cutable code theoretically allows us to generate reliable exe.stronger reasoning capabilities compared to general-purposecution trajectories as responses,two challenges arise:1)Ob-models (Suzgun et al.,2023;Shao et al.,2024).taining a deterministic reverse function for input predictionInstruction Tuning Data We utilize an in-house instruction-is impractical;2)Automatically constructed trajectories aretuning dataset containing approximately 1.18M samplesconstrained by pre-designed templates and lack the expres-from different languages,encompassing a wide range ofsiveness and generalizability of free-form natural languagedomains such as math,coding,writing,and more.Tuningreasoning.Thus,we adopt a fully LLM-based approachthe model on this dataset enables it to effectively followfor synthesizing all the desired responses using DeepSeek-diverse instructions,making it applicable to and testable onV2.5,as it has top-tier performance but extremely low cost.a broad spectrum of downstream tasks.The dataset generated here is referred to as CoDEI/O.Weprovide two examples of collected responses in Figure 2.Training Setups Similar to continual pre-training,we em-ploy a two-stage training strategy in most of our experi-Making Full Use of Code-CoDEI/O++A common ap-ments.The first stage involves training on the CoDEI/O orproach to enhance data quality is reject sampling(Yuan et al.,CoDEI/O++dataset,followed by a second stage of general2023),where incorrect predictions are discarded.Thoughin struction-tuning.this approach suits CoDEI/O well as we can verify all re-sponses by re-executing the codes,we find it leads to subop-The reason for adopting this two-stage training approachis rooted in the characteristics of our datasets.Thetimal performance($4.1).Therefore,we take an alternativeapproach to fully utilize the execution feedback from ourCODEI/O(++)dataset contains a significantly larger numberof samples compared to the instruction-tuning data.Sim-reference code.For responses with incorrect predictions,weappend the feedback as the second tum of input messagesply mixing the two datasets would result in a biased dis-and ask DeepSeek-V2.5 to regenerate another response.Intribution,which could lead to insufficient learning on thepractice,we capture multiple types of feedback:For outputinstruction-tuning data.This might prevent the model fromprediction,we simply inform the model that it generatedfully demonstrating its capacity to follow diverse instruc-an incorrect answer.For input prediction,we additionallytions in downstream tasks.To address this,the two-stagetraining first strengthens the model as a more robust baseFor instances where the code fails to execute (e.g.,due tomodel for general reasoning,and then adapts it into a versa-a format error,argument mismatch error,or other runtimetile instruction-following model through instruction tuningerror),we also include these feedback explicitly.Detailed training hyper-parameters are in Appendix E.Evaluation Benchmarks We evaluate all models on theseAfter the second turn,we re-check the correctness of thenewly generated responses.We then construct the finalbenchmarks:DROP (Dua et al.,2019),WinoGrande(Sak-response by concatenating all of the four components:Tumaguchi et al,2020),GSM8K(Cobbe et al.,2021),MATH1 response Turn 1 feedback +Turn 2 response Turn 2(Hendrycks et al.,2021b),MMLU-STEM (Hendrycks et al.,feedback.For correct responses in the first turn,the Tum 12021a),BBH (Suzgun et al.,2023),GPQA (Rein et al.,feedback is simply "Success"with no Turn 2 contents.In2024),CruxEval (Gu et al.,2024),ZebraGrid (Lin et al.,general,in first turn,50%of the responses are correct and2025).These benchmarks span multiple key reasoning do-10%of the incorrect ones can be successfully revised in themains,including science,math numerical,symbolic,com-second turn.Similar to CoDEI/O,we keep all responses,monsense,logic,and code understanding.We also includeeither correct or incorrect,after the revision.The dataset wetwo comprehensive benchmarks as well:LiveBench (Whiteet al..2024)'.and KorBench (Ma et al..2024).Besidescollect through this way is referred to as CoDEI/O++,andwe provide a complete example in Table 9 in Appendix G.these established benchmarks,we test on two extra ones:BBH-ZH,a Chinese version of 9 BBH subtasks2 as ourinstruction tuning data contains both English and Chinese3.Experimentsexamples,and LeetCode-O(LC-O),designed for bilingual3.1.Settingsoutput prediction for LeetCode questions with test cases.All evaluations are done with greedy decoding in a zero-Models We select the following base models as the back-shot setting,except for BBH-EN/-ZH where we use a 3-shotbones:Qwen 2.5 7B Coder (Hui et al.,2024),Deepseek v2setup.Details of all benchmarks are in Appendix B.Lite Coder (MoE)(Zhuet al.,2024),LLaMA 3.1 8B(Dubeyet al.,2024),and Gemma 2 27B (GemmaTeam et al.,2024)IWe adopt the 2406-2407 split,excluding the code generationand instruction-following subtasks as they are not our focus.These models were chosen for being the most advanced"For clarity.BBH is referred to as BBH-EN in later sections.base models currently,differing in architecture,size,andCoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionTable 1:Main evaluation results on all benchmarks.WI WebInstruct.OMI2 OpenMathInstruct2,OC-SFT-1OpenCoder-SFT-Stage-1,PyEdu =PythonEdu.We also report the number of training samples for each dataset.Color-codedcells(green/red)are employed to denote improvements or declines relative to the single-stage baseline,with deeper shadesindicating larger score shiftsWinoGSMMMLU LC CRUXBBH Zebra KorLiveDataset#M)Grande DROP8KMATH GPOA-STEM-O-I-O -EN -ZH Logic BenchBench AVGOwen 2.5 Coder 7B2nd Stage Only66.970.783.471.641.577.220.761.360.068.370.610.938.726.054.8WI3.566.373.587.071.439.177.518359.161.668.668.710242.526.055.011.667.075.087.071.142.978.619.159.359.868.470.410.941.927.655.6OMI23.567.674.384.172.336277.420.960.461.568.869.310.142.727.255.2OMI2 (Full)14.066.974.088.573.240.977.819.959.562.468.371.311.241.228.456.0OC-SFT-14.266.675.386.770.937.778.020.360.960.167.567.610.840.127.555.0PyEdu7.766.774.885.871.440.977.419.158.962.467.865.710.639.354.8CODEI/O3.567.976.486.471.943.377.323.763.664.969.372.810.744.328.557.2CODEI/O++3.566.979.185.772.140.677.924.262.567.971.074.210.745.729.157.7LLaMA 3.1 8B2nd Stage Only71.373.183.249.940.670.04.144.546.965.865.69.839.825.7|49.33.572.176.382.852.842.969.64.144.044.864.567.810.042.723.149.8OMI23.572.274.886.258.938.270.15.846.146.467.468.640.324.550.6OC-SFT-14.271.071.981.851.138.268.45.743.544.965.667.610.542.024.749.1PyEdu7.770.669.683.249.842.469.15.243.144.564.065.610.242.625.749.0CODEI/O3.571.773.983.653.843.569.09.350.153.367.565.310.440.924.751.2CODEI/O++3.571.875.184.053.240.968.410.050.453.170.070.610.543.228.152.1DeepSeek Coder v2 Lite 16B2nd Stage Only68.473.482.560.038.668.514.853.054.961.169.26.744.726.6|51.668.573.883.760.539.568.714.353.557.161.665.76.943.125.4■516OMI23.567.674.184.764.738.470.114.453.855.863.666.46.442.024.751.9OC-SFT-14.268.273.683.360.937.369.114.752.856.160.967.96.142.725.251.3PyEdu7.768.374.683.060.638.269.715.654.957.061.968.67.044.724.652.1CODEI/O68.474.683.660.938.670.318.758.462.863.170.87.846.026.153.6CODEI/O++3.569.073.582.860.938.870.020.359.561.064.269.46.746.326.953.5Gemma 2 27B2nd Stage Only72.480.190.166.344.482.819.162.566.977.180.413.547.830.0159.53.573.279.091.570.644.982.720.763.566.377.677.217.147.333.360.4OMI273.179.390.867.144.083.419.261.466.077.180.513.949.740.760.4OC-SFT-14.273.579.991.166.146.981.820.262.865.677.378.914.046.935.360.0PyEdu7.773.779.590.366.045.382.818.761.364.977.479.014.248.934.059.7CODEI/O75.980.791.267.444.983.322.465.070.377.978.714.649.131.360.9CODEI/O++73.182.091.466.946.083.026.664.470.678.477.816.449.435.361.5Baselines The primary baseline is to directly fine-tune theOpenCoder-SFT-Stage-I (Huang et al.,2024):A 4.2M QA-base model on the instruction-tuning dataset in a singlepair dataset synthesized from general code data,coveringstage(2nd Stage only).This serves to evaluate whetherdiverse computer science domains.Python-Edu (Ben Allalthe additional training stage provides any tangible bene-et al.,2024):Following findings that continued pre-trainingfits.We also select several strong datasets as baselines inon code tends to enhance reasoning,we adopt its full 7.7Mthe first Stage:Weblnstruct (Yue et al.,2024):A largecode corpus and train on it using a standard language model-instruction-tuning dataset with 11.6M samples mined fromthe Internet and refined by LLMs.OpenMath/nstruct-23.5M subsets for most experiments to align with the size(Toshniwal et al.,2024):A 14M-sample dataset focusedof our CoDEI/O dataset,but also report the scores whenon math problem solving,augmented from GSM8K andtraining on the complete datasets for Qwen 2.5 7B Coder.MATH using LLaMA 3.1 405B-Inst (Dubey et al.,2024).CoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionTable 2:Key ablations we tested and the number of training samples under each condition.For a fairer comparison,we alsoprovide results on a~50%subset of CoDEI/O to ensure the number of training samples remains comparable.GSMMMLU LC CRUXBBH Zebra Kor Live#(M)Grande DROP 8KMATH GPQA -STEM -O-I-O -EN -ZH Logic Bench Bench AVGCODEI/O3.5267.976.486.471.943.377.323.763.664.969.372.810.744.328.5|57.250%subset 1.5967.574.786.771.642.977.323.062.865.969.170.810.542.128.956.7Effect of prediction inputs or outputs only.I.Pred.only1.7566.375.985.871.638.877.722.962.864.568.369.411.444.426.2156.1O.Pred.only 1.76 66.975.284.671.542.476.523.361.165.670.172.111.442.226.956.4Effect of rejection sampling.w/o wrong1.7966.874.987.471.539.176.722.662.666.668.371.911.542.627.8156.5wrong→gt3.5266.476.886.070.642.476.524.362.167.668.071.111.543.126.656.658.052.0across model sizes and architectures.The further validatesthat our training approach,predicting code inputs and out-puts,enables models to excel in diverse reasoning tasks56.0without sacrificing specialized benchmark performance.50.055.04.Analysis54.0Codel/OWIWI-DS2549.0To examine the influence of different critical aspects ofFirst Stage Training Dataour approach,we carry out multiple analysis experiments.Figure 3:Average scores of Stage 1 training on CODEI/O.Unless explicitly stated otherwise,all experiments are per-a 3.5M WebInstruct subset (WI)and an enhanced versionformed using Qwen 2.5 Coder 7B for simplicity,and thedistilled from DeepSeek-V2.5 Directly (WI-DS25)results reported are those obtained after the second-stagegeneral instruction tuning.3.2.Main Results4.1.Ablation StudiesWe demostrate the main evaluation results in Table 1.Asshown,CoDEI/O provides universal gains across bench-We first perfomm two key ablation studies on our data con-struction process,with results presented in Table 2:other datasets,even larger ones.While competing datasetsInput/Output Prediction We examine input and output pre-may excel in specific tasks (e.g.,OpenMathInstruct2 ondiction by training on each separately.The scores are gener-math)but regress in others (mixed green and red cells),ally similar,but input prediction excels on KorBench whileCoDEI/O shows consistent improvements (mainly greenslightly hurting GPQA,and output prediction shows greaterpatterns).Despite using only code-centric data,it enhancesbenefits on symbolic reasoning tasks like BBH.CRUXEval-all other tasks beyond code reasoning as well,suggestingI and-O also favor input and output prediction,respectively.its generalizable capabilities.We also observe that trainingon raw code files (PythonEdu)results in only minor,andsponses using rejection sampling,which removes 50%ofoccasionally even negative,improvements compared to thethe training data.However,this results in a general per-single-stage baseline,significantly underperforming whencompared to CoDEI/O,suggesting that learning from suchformance drop,suggesting a loss of data diversity.Wealso experiment with replacing all incorrect responses withless-structured data is suboptimal.This further highlightsground-truth answers through code execution (without CoT).that perfommance gains are driven not merely by data sizeWe see improvements on benchmarks like LeetCode-O andbut by thoughtfully designed training tasks that encompassCRUXEval-O designed to measure output prediction accu-diverse,structured reasoning pattems in generalized CoTs.racy,but it lowers scores elsewhere,reducing the averageAdditionally,CoDEI/O++systematically outperformsperformance.When comparing these two with training onCoDEI/O,boosting average scores without trade-offs ona~50%subset of CODEI/O where the number of samplesindividual tasks.This highlights how execution-feedback-are comparable,they still have no advantages.Therefore,based multi-turn revision improves data quality and en-to maintain performance balance,we retain all incorrect re-hances reasoning across domains.Most importantly,bothsponses in the main experiments without any modification.CoDEI/O:Condensing Reasoning Patterns via Code Input-Output Prediction3.52M 1.91M 0.96M 0.32M w/o Mid-Training6/6 4/62/61/6 w/o Mid-TrainingLiveBenchLiveBenchMATHMMLUMMLUSTEMMATH-STEM77771977DROP764ZebraZebra109 LogicDROP 799 LogicKor44Kor45.1WinoBenchGrandeBenchGrandeGPQA728BBH-ZHBBH-ZH59Crux-lBBH-ENCrux-lBBH-EN653Crux-OLeetCode-OCrux-O87.5LeetCode-OGSM8KGSM8K(a)Size of randomly sampled subset.(b)Ratio of testcases per sample compared to the full set.Figure 4:The scaling effect of CoDEI/O in the first stage training.58.0152.5LLaMA 3.1 88Some of our baselines such as WebInstruct synthesize re-57.852.2sponses with Qwen-72B (Bai et al.,2023)and Mixtral57.622Bx8 (Jiang et al.,2024),while CoDEI/O uses DeepSeek-V2.5.To ablate the effect of different synthesis models,we57.4regenerate responses for the 3.5M WebInstruct (as it covers57.2massive domains)subset using DeepSeek-V2.5,creatingan updated dataset called WebInstruct-DS25.As shown in57.051.0Figure 3,while WebInstruct-DS25 outperforms the vanilladataset on Qwen 2.5 Coder 7B and LLaMA 3.1 8B,it stillfalls short of CoDEI/O.This highlights the value of diverseFigure 5:Average benchmark scores from training on datafrom different turns of revision.reasoning patterns in code and the importance of task se-lection in training.Overall,this comparison shows thatpredicting code inputs and outputs improves reasoning be-pairs by fixing and using all unique raw code samples butyond mere knowledge distillation from an advanced modelchanging the number of input-output prediction instancesfor each sample.Figure 4b shows the ratio of used I/O pairscompared to the full set.While the scaling effect is less4.3.Scaling Effect of CoDEI/Opronounced than with training samples,we still observeWe evaluate how CoDEI/O scales with varying amountsclear benefits,particularly when increasing from 1/6 to 6/6.of training data.By randomly sampling training instances.This suggests some reasoning patterns require multiple testFigure 4a reveals a clear trend:increasing the number ofcases to fully capture and leam their complex logic flow.training samples generally leads to improved performanceacross benchmarks.Specifically,using the smallest amount4.4.Different Data Formatof data exhibits relatively weak performance on most bench-marks,as the model lacks sufficient training to generalizeWe investigate how to best arrange the query,reference code,effectively.In contrast,when trained on the full dataset,and CoT in training samples.As shown in Table 3,placingCoDEI/O achieves the most comprehensive and robust per-the query and reference code in the prompt and the CoT inthe response achieves the highest average score and mostformance.Intermediate amounts of data yield results thatbalanced performance across benchmarks.Other formatsshow slightly lower but comparable perfommance,with theimprovement in performance as more training samples areintroduced.This highlights CoDEI/O's scalability and ef-worst results occurring when the query is in the prompt andfectiveness in enhancing reasoning capabilities.the reference code in the response,resembling a standardcode generation task but with much fewer training samples.We also scale the data on the dimension of input-outputThis highlights the importance of CoT and the scaling oftest cases for learning transferable reasoning ability.CoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionTable 3:The effect of different data formats.We make bold the highest and underline the lowest scores in each column.Data FormatWinoGSMMMLU LC CRUXBBH Zebra Kor LivePromptResponseGrandeDROP8KMATH GPQA -STEM -O-I -O -EN -ZH Logic Bench BenchAVG0+CodeCoT67.976.486.471.943.377.323.763.664.969.372.810.744.328.5QCoT67.276.887.270.437.577.325.262.665.369.271.111.544.928.556.8CodeCoT67.976.487.070.839.576.525.064.165.868.871.310.645.228.557.0QCode+CoT65.976.187.571.742.276.922.963.966.169.672.910.941.428.556.9QCode66.973.184.871.640.077.420.859.562.467.268.310.140.326.354.9Table 4:Average benchmark score under different training5.Related Workstrategy.IT stands for our instruction-tuning data.Learning about Code Execution The topic of learningFirstSecondModelcode execution has existed long before the era of LLMsStageStageQwen LLaMA(Zaremba Sutskever,2014;Graves et al.,2014).However,T54.849.3most related works focus solely on the output prediction taskCODEI/O(10%)+IT56.650.5itself when leaming from code execution (Nye et al.,2021;CODEI/O+IT55.949.7Liu et al.,2023;Ding et al.,2024c).Other works seek to uti-CoDEI/OT57.251.2lize code execution,either through the final feedback(DingCODEI/O+ITT56.851.5et al.,2024a:Wang et al.,2024)or the intermediate traceCoDEI/OCODEI/O(10%)+IT57.052.7(Ding et al.,2024b;Ni et al.,2024),to improve code gener-ation abilities.There are also specific benchmarks designed4.5.Multi-turn Revisionto evaluate a model's ability to predict execution results,such as CRUXEval (Gu et al.,2024)and LiveCodeBench-Based on CODEI/O (no revision)and CoDEI/O++(single-Exec (Jain et al.,2024).Unlike the above works,which setturn revision).we extended revisions to a second turn toa narrow scope within code-related tasks,we are the first toevaluate further improvements by regenerating predictionstrain LLMs on large-scale,diverse code input-output pre-for instances still incorrect after the first revision.We visual-dictions and demonstrate its efficacy in improving generalize the distribution of response types in each tum in Figurereasoning ability beyond code.7 in Appendix D.It shows that most correct responses arepredicted in the initial turn,with about 10%of incorrectresponses corrected in the first-turn revision.However.theInference Time Scaling A very recent approach to en-second turn yields significantly fewer corrections,we findhance reasoning is inference-time scaling,such as OpenAI'sby checking the cases that the model often repeats the sameo1 (Jaech et al.,2024)or DeepSeek's R1 (DeepSeek-AIincorrect CoT without adding new useful information.Afteret al.,2025),which typically encourages models to gener-incorporating multi-turn revisions,we observe consistentate ultra-long reasoning process to solve problems throughimprovement from tum 0 to turn 1 but minimal gains fromlarge-scale reinforcement leaming.Such methods are push-turn 1 to turn 2 in Figure 5-showing slight improvementing models to new limits on massive challenge tasks,whilefor LLaMA 3.1 8B but regression for Qwen 2.5 Coder 7Balso significantly altering the output patterns of models.WeHence,we stop at single-tum revision,i.e.,CoDEI/O++,inbelieve that CoDEI/O is orthogonal to these methods,andour main experimentswe hope it can provide a better basis to further incentivizethe reasoning abilities of LLMs.4.6.The Necessity of Two Stage Training6.ConclusionLastly,we highlight the necessity of a separate training stagewith CoDEI/O data by testing both single-stage mixed train-In conclusion,we introduced CoDEI/O,an approach to im-ing and two-stage training with different data mixtures.Asprove the reasoning abilities of LLMs by training them toshown in Table 4,all two-stage variants outperform single-predict code inputs and outputs in pure natural languagestage training.Meanwhile,the effect of mixing data dur-CoTs.This approach leverages the structured and scalableing two-stage training varies across models.For Qwennature of code to leam diverse reasoning pattems,including2.5 Coder 7B,the best result is keeping CoDEI/O andsymbolic,logical,mathematical,and commonsense reason-instruction-tuning data fully separate,while LLaMA 3.1ing.Extensive experiments show that CoDEI/O as well8B performs better with mixed data,either in the first stageas the enhanced CoDEI/O++consistently outperforms ex-or in the second stage.To simplify our methodology,weisting baselines,delivering balanced improvements acrossuse fully separated data in our main experiments,leavingbenchmarks without sacrificing performance in any domain,optimal data-mixing strategies for future work.underscoring its robustness and versatility.CoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionReferencescapability in llms via reinforcement learning.2025.URLBai,J.,Bai,S.,Chu,Y.,Cui,Z.,Dang,K.,Deng,X.,Fan,https://arxiv.org/abs/2501.12948.Y.,Ge,W.,Han,Y.,Huang,F,et al.Qwen technicalDehaene,S.,Molko,N.,Cohen,L.,and Wilson,A.J.Arith-report.arXiv preprint arXiv:2309.16609,2023.metic and the brain.Current opinion in neurobiology,14Ben Allal,L,Lozhkov,A.,Penedo,G.,Wolf,(2):218-224.2004.T.,and von Werra,L.Smollm-corpus,2024.URLhttps://huggingface.co/datasets/Ding.Y.,Min,M.J.,Kaiser,G.,and Ray,B.Cycle:Leam-ing to self-refine the code generation.Proceedings ofHuggingFaceTB/smollm-corpus.the ACM on Programming Languages,8(OOPSLA1):Cobbe,K.,Kosaraju,V.,Bavarian,M.,Chen,M.,Jun,H.,392-418,2024a.Kaiser,L.,Plappert,M.,Tworek,J.,Hilton,J.,Nakano,R.,et al.Training verifiers to solve math word problems.Ding,Y.,Peng,J.,Min,M.J.,Kaiser,G.,Yang,J.,and Ray,arXiv preprint arXiv:2110.14168,2021.B.Semcoder:Training code language models with com-prehensive semantics reasoning.In The Thirty-eighth An-DeepSeek-AI,Liu,A.,Feng,B.,Wang,B.,Wang,B.,Liu,nual Conference on Neural Information Processing Sys-B.,Zhao,C.,Dengr,C.,Ruan,C.,Dai,D.,Guo,D.tems,2024b.URL https://openreview.net/forum?et al.Deepseek-v2:A strong,economical,and effi-id=PnlCHQrM69.cient mixture-of-experts language model.arXiv preprintarXi:2405.04434.2024.Ding,Y.,Steenhoek,B.,Pei,K.,Kaiser,G.,Le,W.,andRay,B.Traced:Execution-aware pre-training for sourceDeepSeek-AI,Guo,D.,Yang,D.,Zhang,H.,Song,J.code.In Proceedings of the 46th IEEE/ACM InternationalZhang,R.,Xu,R.,Zhu,Q,Ma,S.,Wang,P.,Bi,X.,Conference on Software Engineering,pp.1-12,2024c.Zhang,X.,Yu,X.,Wu,Y,Wu,Z.F.,Gou,Z.,Shao,Z.,Li,Z.,Gao,Z.,Liu,A.,Xue,B.,Wang,B.,Wu,B.,Dua,D.,Wang,Y.,Dasigi,P.,Stanovsky,G.,Singh,S.,andFeng,B.,Lu,C.,Zhao,C.,Deng,C.,Zhang,C.,Ruan,Gardner,M.Drop:A reading comprehension benchmarkC.,Dai,D.,Chen,D.,Ji,D.,Li,E.,Lin,F.,Dai,F.,Luo,requiring discrete reasoning over paragraphs.In Pro-F.,Hao,G.,Chen,G.,Li,G.,Zhang,H.,Bao,H.,Xu,ceedings of the 2019 Conference of the North AmericanH.,Wang,H.,Ding,H.,Xin,H.,Gao,H.,Qu,H.,Li,Chapter of the Association for Computational Linguis-H.,Guo,J.,Li,J.,Wang,J.,Chen,J.,Yuan,J.,Qiu,J.,tics:Human Language Technologies,Volume I (LongLi,J.,Cai,J.L.,Ni,J.,Liang,J.,Chen,J.,Dong,K.,and Short Papers以,pp.2368-2378,2019.Hu,K.,Gao,K.,Guan,K.,Huang.K.,Yu,K.,Wang,L..Zhang,L.,Zhao,L.,Wang,L,Zhang,L.,Xu,L.,Xia,Dubey,A.,Jauhri,A.,Pandey,A.,Kadian,A.,Al-Dahle,L.,Zhang,M.,Zhang,M.,Tang,M.,Li,M.,Wang,M.,A.,Letman,A.,Mathur,A.,Schelten,A.,Yang,A.,Fan,Li,M.,Tian,N.,Huang,P.,Zhang,P.,Wang,Q.,Chen,A.,et al.The llama 3 herd of models.arXiv preprintQ.,Du,Q.,Ge,R.,Zhang,R.,Pan,R.,Wang,R.,Chen,arXi:2407.21783,2024.R.J.,Jin,R.L.,Chen,R.,Lu,S.,Zhou,S.,Chen,S.,Ye,GemmaTeam,Riviere,M.,Pathak,S.,Sessa,P.G.,Hardin,S.,Wang,S.,Yu,S.,Zhou,S.,Pan,S.,Li,S.S.,Zhou,C.,Bhupatiraju,S.,Hussenot,L.,Mesnard,T.,Shahri-S.,Wu,S.,Ye,S.,Yun,T.,Pei,T.,Sun,T.,Wang,T.,ari,B.,Rame,A.,et al.Gemma 2:Improving openZeng.W..Zhao,W.,Liu,W.,Liang,W.,Gao,W.,Yu,W.language models at a practical size.arXiv preprintZhang,W.,Xiao,W.L.,An,W.,Liu,X.,Wang,X.,Chen,arXi:2408.00118,2024.X.,Nie,X.,Cheng,X.,Liu,X.,Xie,X.,Liu,X.,Yang,X.,Li,X.,Su,X.,Lin,X.,Li,X.Q.,Jin,X.,Shen,X.Graves,A.,Wayne,G.,and Danihelka,I.Neural turingChen,X.,Sun,X.,Wang.X.,Song,X.,Zhou,X.,Wang.machines.arXiv preprint arXiv:1410.5401,2014.X.,Shan,X.,Li,Y.K.Wang,Y.Q.,Wei,Y.X.,Zhang,Y.,Xu,Y.,Li,Y.,Zhao,Y.,Sun,Y.,Wang.Y.,Yu,Y.,Gu,A.,Roziere,B.,Leather,H.J.,Solar-Lezama,A.,Syn-Zhang,Y.,Shi,Y.,Xiong,Y.,He,Y.,Piao,Y.,Wang,Y.,naeve,G.,and Wang,S.CRUXEval:A benchmarkTan,Y.,Ma,Y.,Liu,Y.,Guo,Y.,Ou,Y.,Wang.Y.,Gong,for code reasoning,understanding and execution.InY.,Zou,Y.,He,Y,Xiong,Y.,Luo,Y.,You,Y.,Liu,Y.,Salakhutdinov,R.,Kolter,Z.,Heller,K.,Weller,A.,Zhou,Y.,Zhu,Y.X.,Xu,Y.,Huang,Y..Li,Y.,Zheng,Oliver,N.,Scarlett,J.,and Berkenkamp,F.(eds.),Pro-Y.,Zhu,Y.,Ma,Y.,Tang,Y.,Zha,Y.,Yan,Y..Ren,Z.Z..Ren,Z.,Sha,Z,Fu,Z.,Xu,Z,Xie,Z,Zhang.Z.,Hao,Learning,volume 235 of Proceedings of Machine Lear-Z.,Ma,Z.,Yan,Z.,Wu,Z.,Gu,Z.,Zhu,Z.,Liu,Z.,Li,ing Research,pp.16568-16621.PMLR,21-27 Jul 2024.Z.,Xie,Z..Song,Z.,Pan,Z.,Huang.Z.,Xu,Z.,Zhang.URL https://proceedings.mlr.press/v235/gu24cZ.,and Zhang,Z.Deepseek-r1:Incentivizing reasoninghtml.CoDEI/O:Condensing Reasoning Patterns via Code Input-Output PredictionHendrycks,D.,Burns,C.,Basart,S.,Zou,A.,Mazeika,M.,Liu,C.,Lu,S.,Chen,W.,Jiang,D.,Svyatkovskiy,A.,Song.D.,and Steinhardt,J.Measuring massive multitaskFu,S.,Sundaresan,N.,and Duan,N.Code executionwith pre-trained language models.In Findings of theon Learning Representations,2021a.URL https://Association for Computational Linguistics:ACL 2023,openreview.net/forum?id=d7KBjmI3GmQ.pp.4984-4999,2023.Hendrycks,D.,Bums,C..Kadavath,S.,Arora,A.,Basart,Lozhkov,A.,Li,R.,Allal,L.B.,Cassano,F,Lamy-Poirier,S.,Tang,E.,Song,D.,and Steinhardt,J.MeasuringJ.,Tazi,N.,Tang,A.,Pykhtar,D.,Liu,J.,Wei,Y.,et al.mathematical problem solving with the MATH dataset.Starcoder 2 and the stack v2:The next generation.arXivIn Thirty-fifth Conference on Neural Information Process-preprint arXiv:2402.19173,2024.ing Systems Datasets and Benchmarks Track (Round 2),2021b.Ma,K.,Du,X.,Wang,Y.,Zhang,H.,Wen,Z.,Qu,X.,Yang.J.,Liu,J.,Liu,M.,Yue,X.,et al.Kor-bench:Benchmark-Huang,J.and Chang,K.C.-C.Towards reasoning ining language models on knowledge-orthogonal reasoninglarge language models:A survey.arXiv preprinttasks.arXiv preprint arXiv:2410.06526,2024.arXi:2212.10403,2022.Mistral-AI.Codestral,2024.URL https://mistral.ai/Huang,S.,Cheng,T.,Liu,J.K.,Hao,J.,Song,L.,Xu,Y,news/codestral/.Yang,J.,Liu,J.,Zhang,C.,Chai,L.,et al.Opencoder:The open cookbook for top-tier code large language mod-Ni,A.,Allamanis,M.,Cohan,A.,Deng,Y.,Shi,K.,Sutton,els.arXiv preprint arXiv:241 1.04905,2024.C.,and Yin,P.NExt:Teaching large language models toreason about code execution.In Forty-first InternationalHui,B..Yang,J.,Cui,Z.Yang,J.,Liu,D..Zhang.L.,Liu,Conference on Machine Leaming,2024.URL https:T.,Zhang,J.,Yu,B.,Dang,K.,et al.Qwen2.5-coder//openreview.net/forum?id=B1W712hMBitechnical report.arXiv preprint arXiv:2409.12186,2024.Nye,M.,Andreassen,A.J.,Gur-Ari,G.,Michalewski,H.,Jaech,A.,Kalai,A.,Lerer,A.,Richardson,A.,El-Kishky,Austin,J.,Bieber,D.,Dohan,D.,Lewkowycz,A.,Bosma,A.,Low,A.,Helyar,A.,Madry,A.,Beutel,A.,Car-M.,Luan,D.,et al.Show your work:Scratchpads forney,A.,et al.Openai ol system card.arXiv preprintintermediate computation with language models.arXivarXi:2412.16720,2024.preprint arXiv:2112.00114,2021.Jain,N.,Han,K.,Gu,A.,Li,W.-D.,Yan,F,Zhang,T.Qiao,S.,Ou,Y.,Zhang,N.,Chen,X.,Yao,Y,Deng,Wang,S.,Solar-Lezama,A,Sen,K.,and Stoica,I.S.,Tan,C.,Huang,F.,and Chen,H.Reasoning withLivecodebench:Holistic and contamination free eval-language model prompting:A survey.arXiv preprintuation of large language models for code.arXiv preprintarXi:2212.09597.2022.arXi:2403.07974,2024.Rein,D.,Hou,B.L.,Stickland,A.C.,Petty,J.,Pang,R.Y.,Jiang,A.Q.,Sablayrolles,A.,Roux,A.,Mensch,A.,Savary,Dirani,J.,Michael,J.,and Bowman,S.R.GPQA:AB.,Bamford,C.,Chaplot,D.S.,Casas,D.d.1.,Hanna,graduate-level google-proof q&a benchmark.In FirstE.B.,Bressand,F,et al.Mixtral of experts.arXivConference on Language Modeling,2024.URLhttps://openreview.net/forum?id=Ti67584b98.Knauff,M.and Wolf,A.G.Complex cognition:the sci-Roziere,B.,Gehring,J.,Gloeckle,F.,Sootla,S.,Gat,I.,ence of human reasoning,problem-solving,and decision-Tan,X.E.,Adi,Y.,Liu,J.,Sauvestre,R.,Remez,T.,et almaking,2010.Code llama:Open foundation models for code.arXivLambert,N.,Morrison,J.,Pyatkin,V.,Huang,S.,Ivison,H.,Brahman,F.,Miranda,L.J.V.,Liu,A.,Dziri,N.,Sakaguchi,K.,Le Bras,R.,Bhagavatula,C.,and Choi,Y.Lyu,S.,et al.Tulu 3:Pushing frontiers in open languageWinogrande:An adversarial winograd schema challengemodel post-training.arXiv preprint arXiv:2411.15124,at scale.In Proceedings of the AAAI Conference on2024.Artificial Intelligence,volume 34,pp.8732-8740,2020.Lin,B.Y.,Bras,R.L,Richardson,K.,Sabharwal,A.,Shao,Z.,Wang,P.,Zhu,Q.,Xu,R.,Song.J.,Bi,X.,Zhang.Poovendran,R.,Clark,P.,and Choi,Y.Zebralogic:OnH.,Zhang,M.,Li,Y.,Wu,Y.,et al.Deepseekmath:Push-the scaling limits of llms for logical reasoning.arXiving the limits of mathematical reasoning in open languagepreprint arXiv:2502.01100,2025.models.arXiv preprint arXiv:2402.03300,2024.10
文档评分
    请如实的对该文档进行评分
  • 0
发表评论

特惠

限量优惠活动

正在火热进行

站长

添加站长微信

领取运营礼包

下载

便携运营智库

立即下载APP

工具

运营工具导航

AI工具导航

帮助

帮助中心

常见问题

顶部