The "GPT-3 moment" framing is a bit hype-y I think? GPT-3 eliminated the need for task-specific fine-tuning, but from the article RL wouldn't replace LLM-style pretraining. So this is more of an incremental advance than the paradigm shift GPT-3 represented. That said, if it unlocks RL generalization that would be huge.
The core claim that massive-scale RL will unlock generalization doesn't seem that surprising since we've seen the scaling hypothesis play out across ML. But "replication training" on software is interesting: learning by copying existing programs potentially unlocks a ton of complex training data with objective evaluation criteria.
To me, the big unanswered question is whether skills learned from replicating software would generalize to other reasoning tasks. That's a significant "if" - great if it works, pointless if it doesn't.
It's a very big "if" because other fields are comparatively underspecified. There's no equivalent to a compiler or interpreter in most cases (with spreadsheets being the lingua franca that comes even close for most industries).
It would "work" but I think it will need even more scrutiny by experts to confirm what's correct and what needs to be re-generated. Please please no vibe accounting.
Accounting, specifically book-keeping, really plays to the strengths of LLMs - pattern matching within a bounded context.
The primary task in book-keeping is to classify transactions (from expense vouchers, bank transactions, sales and purchase invoices and so on) and slot them into the Chart of Accounts of the business.
LLMs can already do this well without any domain/business specific context. For example - a fuel entry is so obvious that they can match it into a similar sounding account in the CoA.
And for others where human discretion is required, we can add a line of instruction in the prompt, and that classification is permanently encoded. A large chunk of these kind of entries are repetitive in nature, and so each such custom instruction is a long-term automation.
You might have not been speaking about simple book-keeping. If so, I'm curious to learn.
I'm sorry I don't follow. The fact that you use an LLM to classify a transaction does not mean there is no audit trail for the fact. There should also be a manual verifier who's ultimately responsible for the entries, so that we do not abdicate responsibility to black boxes.
If you mark data as "Processed by LLM", that in turn taints all inference from it.
Requirements for a human in the loop devolve to ticking a box by someone who doesn't realise the responsibility they have been burdened with.
Mark my words, some unfortunate soul will be then be thrown under the bus once a major scandal arises from such use of LLMs.
As an example, companies aren't supposed to use AI for hiring, they are supposed to have all decisions made by a human-in-the-loop. Inevitably this just means presenting a massive grid of outcomes to someone who never actually goes against the choices of the machine.
The more junior the employee, the "better". They won't challenge the system, and they won't realise the liability they're setting themselves up with, and the company will more easily shove them under the proverbial bus if there ever is an issue.
Hiring is too nebulous, too hard to get concrete data for, and too hard to inspect outcomes to properly check.
Financial auditing however is the opposite of that. It's hard numbers. Inevitably when discrepancies arise, people run around chasing other people to get all their numbers close enough to something that makes sense. There's enough human wiggle-room to get away with chaotic processes that still demand accountability.
This is possibly the worst place you could put LLMs, if you care about actual outcomes:
1. Mistakes aren't going to get noticed.
2. If they are noticed, people aren't going to be empowered to actually challenge them, especially once they're used to the LLM doing the work.
3. People will be held responsible for the LLM's mistakes, despite pressure (And the general sense of time-pressure in audit is already immense ) to sign-off.
4. It's a black-box, so any faults cannot be easily diagnosed, the best you can do is try to re-prompt in a way that doesn't happen.
Well put. It should always be "Created by <person>" rather than "Processed by LLM". We can already see it with Claude Code - its commit messages contain a "Generated by Claude Code" line, and it guarantees a pandemic of diffused responsibility in software engineering. But I think there is no point in railing against it - market forces, corporate incentives, and tragedy of the commons all together make it an inevitability.
I've seen some of those but all of the ones I've looked at also had a panel of experts who could give it a once-over (or re-work) before sending it back to the client. I'd compare it more to cruise control or driver-assist but not quite automated driving.
This article stands as complete hype. They just seem to offer an idea of "replication training" which is just some vague agentic distributed RL. Multi-agent distributed reinforcement learning algorithms have been in the actual literature for a while. I suggest studying what DeepMind is doing for current state of the art in agentic distributed RL.
I didn’t think it was vague. Given an existing piece of software, write a detailed spec on what it does and then reward the model for matching its performance.
The vague part is whether this will generalize to other non software domains.
Arguably LLM with RL has already had its GPT-3 moment with DeepSeek R1 doing so well that it deleted a trillion $ + of stock value in big tech. If you see the GPT-3 moment as the moment where people definitely took notice, this was one of those moment.
This works great for software, math and games where you can have cheap validation. But what about messy real world tasks? I think hindsight learning from chat logs could fit the bill. What do I mean?
Imagine a long conversation. It is hard to judge if an AI response was useful or not immediately, but if you know the following 20 messages, it might be easy to infer. Not only you can see how it went, but sometimes you get real world validation.
For example a user comes to a LLM with a task, takes an idea, tries it in reality. Later they return, maybe in a new chat session, and continue iterating. You get real world testing of LLM responses through people.
This can be used to generate "preference scores", and train a preference model, with which you can do RLHF. So the user privacy is protected.
I call this the human-AI experience flywheel. Of course the larger the user base, the more experience the model collects. At the moment OpenAI has 500M users, they probably generate 0.5T interactive tokens/day. Those tokens go both into human brains and LLM logs.
It’s not about environment engineering anymore, it's about consequence harvesting. Meaningful validation emerges from systems actually being used by humans for real purposes.
I worked through this for a tax company. They had a huge pile of artifacts from tax questions worked up for clients. What we did is we "reverse engineered" the process of the questions that would lead to that tax memo and the research steps to find the sources and conclusions. It worked well and we were able to replicate the process which the SME's created these memos.
For a given tax question, could you come up with the same memo quoting the same sources and same conclusion?
> Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior. This clear-cut approach significantly simplifies evaluation, as the grading criteria are objective and direct: either the generated implementation behaves identically to the reference, or it doesn’t.
OK, but then you have to produce the detailed specification, working backward from the reference implementation. This is extremely non-trivial and it significantly weakens the TFA's parallels to pre-training, in which you don't need really need inputs other than raw text corpora.
I'm not saying this eliminates the idea outright, but I do think it hobbles it badly.
I’d like to courteously disagree. I think existing models and existing tools are good enough to bootstrap this at least.
I’d propose the following architecture:
Step 1: Microsoft phi style - read code and write specifications using a frontier model. You could use an ensemble here to nitpick the spec; it’s only going to get written once. We also have of course many many rfcs and codebases that conform to them or where they do not we have an existing repository of bug reports, patches, forum complaints, etc.
Step 2-4: implement multilayer evaluation: does it compile? Does an existing model think the code complies with the spec on inspection? When it’s run on qemu are the key evals the same as the original software?
I propose most of steps 2-4 are automatable and rely on existing tooling and provide a framework that is, if not cheap, achievable. I’m also sure someone could improve this plan with a few more minutes of thought.
To me the interesting question is - will this add capabilities at current model sizes? My prior is yes in that the current behemoth size models feel like they are only incrementally better than 1/10 size distills. I interpret that to mean we haven’t gotten the most out of these larger scales. I will note Dario disagrees on this - he’s publicly said we need at least 10x more scale than we have now.
When prompted correctly, models could generate good specification in form of pretty exhaustive tests. While all tests have weaknesses and are not formal specification, they could get us 99% there.
I’ve been exploring this too, since I rely on LLMs a lot to build software. I’ve noticed that our dev loop-writing, testing-is often mostly human-guided, but language models frequently outperform us in reasoning. If we plug in more automation; MCP tools controlling browsers, documentation readers, requirement analysers, we can make the cycle much more automated, with less human involvement.
This article suggests scaling up RL by exposing models to thousands of environments
I think we can already achieve something similar by chaining multiple agents:
1. A “requirement” agent that uses browser tools to craft detailed specs from docs.
2. A coding agent that sets up environments (Docker, build tools) via browser or CLI.
3. A testing agent that validates code against specs, again through tooling.
4. A feedback loop where the tester guides the coder based on results.
Put together, this system becomes a fully autonomous development pipeline-especially for small projects. In practice, I’ve left my machine running overnight, and these agents propose new features, implement them, run tests, and push to repo once they pass. It works surprisingly well.
The main barrier is cost—spinning up many powerful models is expensive. But on a modest scale, this method is remarkably effective.
I very much disagree. For the larger, more sophisticated stuff that runs our world, it is not cost that prohibits wide and deep automation. It's deeply sophisticated and constrained requirements, highly complex existing behaviors that may or may not be able to change, systems of people who don't always hold the information needed, usually wildly out of date internal docs that describe the system or even how to develop for it, and so on.
Agents are nowhere near capable of replacing this, and even if they were, they'd change it differently in ways that are often undesirable or illegal. I get that there's this fascination with "imagine if it were good enough to..." but it's not, and the systems AI must exist in are both vast and highly difficult to navigate.
The status quo system you describe isn't objectively optimal. It sounds archaic to me. "We" would never intentionally design it this way if we had a fresh start. I believe it is this way due to a meriad of reasons, mostly stemming from the frailty and avarice of people.
I'd argue the opposite of your stance: we've never had a chance at a fresh start without destruction, but agents (or their near-future offspring) can hold our entire systems "in nemory", and therefore might be our only chance at a redo without literally killing ourselves to get there.
It's not claimed to be an "objectively optimal" solution, it's claimed to represent how the world works.
I don't know where you're going with discussion of destruction and killing, but even fairly simple consumer products have any number of edge cases that initial specifications rarely capture. I'm not sure what "objectively optimal" is supposed to mean here, either.
If a spec described every edge case it would basically be executable already.
The pain of developing software at scale is that you're creating the blueprint on the fly from high-level vague directions.
Something trivial that nevertheless often results in meetings and debate in the development world:
Spec requirement 1: "Give new users a 10% discount, but only if they haven't purchased in the last year."
Spec requirement 2, a year later: "Now offer a second product the user can purchase."
Does the 10% discount apply to the second product too? Do you get the 10% discount on the second product if you purchased the first product in the last year, or does a purchase on any product consume the discount eligibility? What if the prices are very different and customers would be pissed off if a $1 discount on the cheaper product (which didn't meet their needs in the end) prevented them from getting a 10$ discount 9 months later (which they think will)? What if the second product is a superset of the first product? What if there are different relevant laws in different jurisdictions where you're selling your product?
Agents aren't going to figure out the intent of the company's principal's automatically here because the decision maker doesn't actually even realize it's a question until the implementers get into the weeds.
A sufficiently advanced agent would present all the options to the person running the task, and then the humans could decide. But then you've slowed things back down the pace of the human decision makers.
The complexities only increase as the product grows. And once you get into distributed or concurrent systems even most of our code today is ambiguous enough about intent that bugs are common.
> The status quo system you describe isn't objectively optimal.
On the basis that I would challenge you or anyone to judge what is objectively optimal. Google Search is a wildly complex system, an iceberg or rules on top of rules specifically because it is a digital infrastructure surrounding an organic system filled with a diverse group of people with ever-changing preferences and behaviors. What, exactly, would be optimal here?
"deeply sophisticated and constrained requirements"
Yes this resonates completely. I think many are forgetting the purpose of formal language and code was because natural language has such high ambiguity that it doesn't capture complex behavior
LLMs are great at interpolating between implicit and unsaid requirements but whether their interpolation matches your mental model is a dice throw
Overall, I agree - it would take far more sophisticated and deterministic or 'logical' AI better capable of tracking constraints, knowing what to check and double check, etc... Right now, AI is far too scattered to pull that off (or, for the stuff that isn't scattered, it's largely just incapable), but a lot of smart people are thinking about it.
> but language models frequently outperform us in reasoning
what
99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.
There are also two kinds of people - those who are objective enough to tell when it happens and those who will never even see when they’re outperformed because of their cognitive biases.
The best part when a “thinking” model carefully thinks and then says something that is obviously illogical, when the model clearly has both the knowledge and context to know it’s wrong. And then you ask it to double check and you give it a tiny hint about how it’s wrong, and it profusely apologizes, compliments you on your wisdom, and then says something else dumb.
I fully believe that LLMs encode enormous amounts of knowledge (some of which is even correct, and much of which their operator does not personally possess), are capable of working quickly and ingesting large amounts of data and working quickly, and have essentially no judgment or particularly strong intelligence of the non-memorized sort. This can still be very valuable!
Maybe this will change over the next few years, and maybe it won’t. I’m not at all convinced that scraping the bottom of the barrel for more billions and trillions of low-quality training tokens will help much.
I feel like one coding benchmark should be just telling it to double check or fix something that's actually perfectly fine repeatedly and watch how bad it deep fries your code base.
They key difference between that and humans, if course, is that most humans will double down on their error and insist that your correction is wrong, throwing a kitchen sink of appeals to authority, motte/bailey, and other rhetorical techniques at you.
That's not any different in practice to the LLM "apologising" to placate you and then making a similar mistake again.
It's not even a different strategy. It's just using rhetoric in a more limited way, and without human emotion.
These are style over substance machines. Their cognitive abilities are extremely ragged and unreliable - sometimes brilliant, sometimes useless, sometimes wrong.
But we give them the benefit of the doubt because they hide behind grammatically correct sentences that appear to make sense, and we're primed to assume that language = sentience = intelligence.
True "interruption" requires continuous learning, and the current model is essentially a dead frog, and frozen weights cannot be truly grounded in real time.
Yea I don't understand how people are "leaving it running overnight" to successfully implement features. There just seems to be a large disconnect between people who are all in on AI development and those who aren't. I have a suspicion that the former are using Python/JS and the features they are implementing are simple CRUD APIs while the latter are using more than simple systems/languages.
I think the problem is that despite feeding it all the context and having all the right MCPs agents hooked up, is that there isn't a human-in-loop. So it will just reason against itself causing these laughable stupid decisions. For simple boilerplate tasks this isn't a problem. But as soon as the scope is outside of a CRUD/boilerplate problem, the whole thing crumbles.
I'd really like to know which use cases work and which don't. And when folks say they use agentic AI to churn through tokens to automate virtually the entire SDLC, are they just cherry picking the situations that turned out well, or do they really have prompting and workflow approaches that indeed increase their productivity 10-fold? Or, as you mention, is it possibly a niche area which works well?
My personal experience the past five months has been very mixed. If I "let 'er rip" it's mostly junk I need to refactor or redo by micro-managing the AI. At the moment, at least for what I do, AI is like a fantastic calculator that speeds up your work, but where you still should be pushing the buttons.
I haven't seen an LLM stay on task anywhere near that long, like...ever. The only thing that works better left running overnight that has anything to do with ML, in my experience, is training.
99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.
99% chance you're using the wrong model.
Effective tool use is a valuable skill, arguably the only one that still matters.
RL is a training method and it improves the model itself. So basically one step(e.g. successful test run, finding search result) could create positive and negative examples for the other step(e.g. coding agent, search agent). And using this the base itself will improve to satisfy other demands and if it reaches close to 100% accuracy(which I believe it could as models mostly fail due to dumb mistakes in tests), you don't need the testing agent altogether.
The issue is that it is a stretch to call it reinforcement learning when all we do currently (in the context of LLMs) is to multiply the reward with the learning rate.
It sounds cool as marketing. It helps improve LLMs a bit. And it will never yield s.th. like an AGI or anything that is "reasoning". Unless you also redefine the word reasoning of course.
Think we will see something like this in the near future. But there are two very bold claims: one is hope that RL will lead to generalization in other domains, which based on current evidence seems far fetched. Other is the softwares and specs are good "RL-gym" data. The whole idea behind RL is that model explores the best paths, but if the softwares written are suboptimal when it comes to agent interaction paradigm (they were written for humans, not agents), there is a high chance even with the compute the model would be suboptimal. There is a parallel trend where current AI systems are abstracting entire workflows. Not accounting for that would lead to outcomes which are not cognizant of the current requirements.
I had the opportunity to meet Tamay not too long ago, very sharp guy. A lot of people I know are working on approaches to meta RL or exploration-based RL, where the goal is to build a foundation model of sorts with a really good world model across diverse tasks, and can predict good policies (or good policy updates) from limited rollouts and/or a sparse reward signal. We're not there quite yet, but as Altman recently said, "we don't have AGI until we have something that learns continuously", and there's a huge race in this space to make that happen.
I recently wrote a post about scaling RL that has some similar ideas:
> How to Scale RL to 10^26 FLOPs (blog.jxmo.io/p/how-to-scale-rl-to-1026-flops)
The basic premise behind both essays is that for AI to make another big jump in capabilities, we need to find new data to train on.
My proposal was reusing text from the Internet and doing RL on next-token prediction. The linked post here instead suggests doing 'replication training', which they define as "tasking AIs with duplicating existing software products, or specific features within them".
They have a point about RLs increasing importance. From my outsider perspective, all major advances in model capabilities in the last period of time come from RL, so it's natural to expect that we can "milk" RL more for performance gains.
Scaling RL is a natural way to attempt that.
What I don't necessarily see is the generalization factor - say, we improve software engineering and math performance through RL learning (probably easier for software engineering than math due to available training corpus).
If that generalization factor doesn't hold, due the economics still work out? An expert-level software model would be useful to our profession, sure, but would it be enough to recoup the training costs if it's not applicable to other industries?
One detail the OP glosses over is the increasing costs of RL as the sequence length increases. If we’re just reasoning through an simple arithmetic problem, it’s a pretty manageable number of reasoning tokens and answer tokens.
For a complete piece of software the answer might be 10 million tokens, and that doesn’t even count the reasoning.
Now imagine that there was a mistake at some point. The model will need to go back to fix it, and understand the cascade of things the bugfix changed. It might be possible to keep that all in the context window but that seems like it won’t scale.
I'd expect that's manageable by some sort of agent-of-agent pattern. You have a high-level planning instance that calls upon fresh LLM instances (new context window!) for executing more targeted tasks or bug-fixes.
Currently, an LLM with everything under the sun in the context window behaves rather poorly and gets confused by that, even if we're not exceeding the context window length.
Although it'd be certainly also interesting to train for increasing the maximum _actually_ usable context window length, I don't know how feasible that would be.
1) If only there was a cryptocurrency tied to training AI models and make crypto grinding useful than maths that solve no real-world problem external to the token creation itself.
2) Larger and larger AI models, you start to get more hallucinations, maybe we should focus on dedicated highly tuned models for dedicated aspects and have a higher up conductor model that knows what to farm out to which models and from there combine and send out further requests to other models etc to come to a result. Certainly, the need for highly tuned niche models, after all, language recognition as an example, a model that could identify the language, local dialect and accent, that would then use a language model tuned better for that speaker it is recognising. That approach feels like the way over one large model that does it all itself.
I remember mining something called Gridcoin over a decade ago. It's a cryptocurrency tied to the BOINC project and rewards providing computing power to science.
I don't know if not ever defining the acronym is a sort of passive aggressive "if you have to ask, you're not the audience we're looking to hire" filter, but I am with you. Having AI investment dollars doesn't excuse you from standard expectations that you clearly state what you're describing.
> Simple command-line tools that implement obscure hashing and encryption algorithms are straightforward initial targets, but this approach can easily extend to more complex software, such as websites, professional software, and games.
>Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior.
I really don't see the connection from the statements in the article's content, and the assertion near the start that:
>Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks.
There's no clear reason outlined in the piece to describe why narrow & well-scoped 1-person-day tasks might scale up to 10,000-person-year projects. If they did, we should expect far more 10,000-person-year projects in the real economy, because the learning curve for firms scaling would be something approximating a straight line. There are very few 10,000-person-year projects, and very many 1-person-day projects.
It seems more like this will spend an unimaginable amount of compute, in order to produce models which are incredibly good at a very precise form of IP theft, and not especially good at any generalisable skills. It's so ludicrously rare that an engineer (or author, illustrator, etc) is tasked with "create a pixel-perfect reimplementation of this existing tool".
TL;DR: The OP believes that if we train large AI models via RL to duplicate the behavior of existing software (for example, train them to duplicate the behavior of an existing spreadsheet, an existing command-line tool, or an existing application), large AI models will get good at:
* reading and understanding long, complicated, detailed instructions,
* executing those instructions meticulously and precisely, without errors,
* noticing its mistakes, if there are any along the way, and recovering from them,
* not settling prematurely for solutions that look "good enough" but aren't, and
* undertaking large, complicated projects which previously could be completed only by teams of human experts.
There's a good chance the OP is right, in my view.
with RL it's hard to define score function in many categories.
rhis is especially visible in current coding capabilities.
LLM will very often create sloppy solutions because they work well in RL.
hardcoding API keys? ignoring errors? disabling lints? those pass in automated evaluation therefore are reinforced in training.
are they good solutions? of course not.
It's very hard to define (in way to create lints) what makes core readable and maintainable.
Using other LLM for this task could cause original model to game the system by abusing some weaknesses in the other model.
for other tasks, how do you even evaluate thinks like eg user experience/app design? how to properly evaluate pelican ridding bicycle?
> hardcoding API keys? ignoring errors? disabling lints?
These kind of "rookie mistakes" are not things that any modern LLM is likely to do. Indeed, I had to argue quite strongly with Gemini recently when I was learning a new tool (so basically just playing around with a fully local setup) and I hardcoded an API key then tried to commit it. The LLM did NOT like that! I had to carefully explain that this was a toy repo.
The argument against this (by Gemini) was that toy repos often grow into production tools so it's best to follow basic security rules from the start. Which, to be fair, is a good argument. I still committed the key though (and deleted the repo a day or so later).
> Rather than fine-tuning models on a small number of environments, we expect the field will shift toward massive-scale training across thousands of diverse environments.
This is a great hypothesis for you to prove one way or the other.
> Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks.
I am not sure if I buy that, frankly. Even if you were to develop radically efficient means to create "effective and comprehensive" test suites that power replication training, it is not at all a given that it will translate to entirely new tasks. Yes, there is the bitter lesson and all that but we don't know if this is _the_ right hill to climb. Again, at best, this is a hypothesis.
> But achieving this will require training environments at a scale and diversity that dwarf anything currently available.
Yes. You should try it. Let us know if it works. All the best!
Not black box, no. The spec presumably tells the model everything it needs to know or look up. But in contrast with Fibonacci, the exact code is unlikely to be in the training set verbatim.
When I see all the bad verbal reasoning being spewed around LLMs it becomes easier to understand why so many people think these LLMs are intelligent.
Non rigorous reasoning is at the root of the problem here. AI hype is often indistinguishable fro AI slop, because those who believe it are also not very good at demanding and asserting rigor.
Isn't "replication training" just adversarial training, as in GANs?
One network tries to clean-room implement the hash function. The other network tries to find an input for which the reference implementation behaves differently.
I like this idea. It's adjacent to differential testing of manually created software (as in compilers) and to mutation testing for evaluation and generation of test suites.
This makes no sense. RL training data is predicated on past behavior of the agent. Whoever wrote this doesn't seem to fundamentally grasp what they are saying.
LLMs can be trained on an unsupervised way on static documents. That is really the key feature that lets them be as smart and effective as they are. If you had every other technology that LLMs are built on, and you didn't have hundreds of terabytes of text laying around, there would be no practical way to make them even a tiny tiny fraction as effective as they are currently.
> Whoever wrote this doesn't seem to fundamentally grasp what they are saying.
RL != only online learning.
There's a ton of research on offline and imitation-based RL where the training data isn't tied to an agents past policy - which is exactly what this article is pointing to.
I'm not sufficiently familiar with the details on ML to assess the proposition made in the article.
From my understanding, RL is a tuning approach on LLMs, so the outcome is still the same kind of beast, albeit with a different parameter set.
So empirically, I actually thought that the lead companies would already be strongly focused on improving coding capabilities, since this is where LLMs are very effective, and where they have huge cashflows from token consumptions.
So, either the motivation isn't there, or they're already doing something like that, or they know it's not as effective as the approaches they already have.
Actually I didn't. Correct me if I am wrong, but my understanding is that RL is still an LLM tuning approach, i.e. an optimization of its parameter set, no matter if it's done at scale or via HF.
I like to imagine a world where we embraced nuclear and renewables where energy is free and we could throw gigawatthours at whatever crazy computation we can think of.
where can I learn about the nitty gritty of RL and RL training? For instance, I want to understand how say software could be used as input (tokenization/vectorization of the code?)
My understanding is that this is essentially how RLHF works, and it doesn't scale. As you run RL for longer, the model will learn how to cheat the imperfections of the grader, instead of getting better at the task at hand.
Therefore, to scale RL you really need good graders, and determinism is king.
Do you think constitutional approaches would help here? (Verifiable reward for the main score, but then asking the model to self-critique for security and quality.)
The "GPT-3 moment" framing is a bit hype-y I think? GPT-3 eliminated the need for task-specific fine-tuning, but from the article RL wouldn't replace LLM-style pretraining. So this is more of an incremental advance than the paradigm shift GPT-3 represented. That said, if it unlocks RL generalization that would be huge.
The core claim that massive-scale RL will unlock generalization doesn't seem that surprising since we've seen the scaling hypothesis play out across ML. But "replication training" on software is interesting: learning by copying existing programs potentially unlocks a ton of complex training data with objective evaluation criteria.
To me, the big unanswered question is whether skills learned from replicating software would generalize to other reasoning tasks. That's a significant "if" - great if it works, pointless if it doesn't.
It's a very big "if" because other fields are comparatively underspecified. There's no equivalent to a compiler or interpreter in most cases (with spreadsheets being the lingua franca that comes even close for most industries).
It would "work" but I think it will need even more scrutiny by experts to confirm what's correct and what needs to be re-generated. Please please no vibe accounting.
Accounting, specifically book-keeping, really plays to the strengths of LLMs - pattern matching within a bounded context.
The primary task in book-keeping is to classify transactions (from expense vouchers, bank transactions, sales and purchase invoices and so on) and slot them into the Chart of Accounts of the business.
LLMs can already do this well without any domain/business specific context. For example - a fuel entry is so obvious that they can match it into a similar sounding account in the CoA.
And for others where human discretion is required, we can add a line of instruction in the prompt, and that classification is permanently encoded. A large chunk of these kind of entries are repetitive in nature, and so each such custom instruction is a long-term automation.
You might have not been speaking about simple book-keeping. If so, I'm curious to learn.
Audit is at the heart of accounting, and LLMs are the antithesis of an audit trail.
I'm sorry I don't follow. The fact that you use an LLM to classify a transaction does not mean there is no audit trail for the fact. There should also be a manual verifier who's ultimately responsible for the entries, so that we do not abdicate responsibility to black boxes.
If you mark data as "Processed by LLM", that in turn taints all inference from it.
Requirements for a human in the loop devolve to ticking a box by someone who doesn't realise the responsibility they have been burdened with.
Mark my words, some unfortunate soul will be then be thrown under the bus once a major scandal arises from such use of LLMs.
As an example, companies aren't supposed to use AI for hiring, they are supposed to have all decisions made by a human-in-the-loop. Inevitably this just means presenting a massive grid of outcomes to someone who never actually goes against the choices of the machine.
The more junior the employee, the "better". They won't challenge the system, and they won't realise the liability they're setting themselves up with, and the company will more easily shove them under the proverbial bus if there ever is an issue.
Hiring is too nebulous, too hard to get concrete data for, and too hard to inspect outcomes to properly check.
Financial auditing however is the opposite of that. It's hard numbers. Inevitably when discrepancies arise, people run around chasing other people to get all their numbers close enough to something that makes sense. There's enough human wiggle-room to get away with chaotic processes that still demand accountability.
This is possibly the worst place you could put LLMs, if you care about actual outcomes:
1. Mistakes aren't going to get noticed.
2. If they are noticed, people aren't going to be empowered to actually challenge them, especially once they're used to the LLM doing the work.
3. People will be held responsible for the LLM's mistakes, despite pressure (And the general sense of time-pressure in audit is already immense ) to sign-off.
4. It's a black-box, so any faults cannot be easily diagnosed, the best you can do is try to re-prompt in a way that doesn't happen.
Well put. It should always be "Created by <person>" rather than "Processed by LLM". We can already see it with Claude Code - its commit messages contain a "Generated by Claude Code" line, and it guarantees a pandemic of diffused responsibility in software engineering. But I think there is no point in railing against it - market forces, corporate incentives, and tragedy of the commons all together make it an inevitability.
Now instead of having accountants audit transactions you will have accountants audit LLM output for possible hallucinations. Seems counter productive.
> Please please no vibe accounting.
Funny you mention; There are multiple companies in Sweden working on AI/ML based accounting. It's not so different from AI/ML based automated driving.
I've seen some of those but all of the ones I've looked at also had a panel of experts who could give it a once-over (or re-work) before sending it back to the client. I'd compare it more to cruise control or driver-assist but not quite automated driving.
This article stands as complete hype. They just seem to offer an idea of "replication training" which is just some vague agentic distributed RL. Multi-agent distributed reinforcement learning algorithms have been in the actual literature for a while. I suggest studying what DeepMind is doing for current state of the art in agentic distributed RL.
I didn’t think it was vague. Given an existing piece of software, write a detailed spec on what it does and then reward the model for matching its performance.
The vague part is whether this will generalize to other non software domains.
> write a detailed spec on what it does
A much harder task than writing said software
Arguably LLM with RL has already had its GPT-3 moment with DeepSeek R1 doing so well that it deleted a trillion $ + of stock value in big tech. If you see the GPT-3 moment as the moment where people definitely took notice, this was one of those moment.
This works great for software, math and games where you can have cheap validation. But what about messy real world tasks? I think hindsight learning from chat logs could fit the bill. What do I mean?
Imagine a long conversation. It is hard to judge if an AI response was useful or not immediately, but if you know the following 20 messages, it might be easy to infer. Not only you can see how it went, but sometimes you get real world validation.
For example a user comes to a LLM with a task, takes an idea, tries it in reality. Later they return, maybe in a new chat session, and continue iterating. You get real world testing of LLM responses through people.
This can be used to generate "preference scores", and train a preference model, with which you can do RLHF. So the user privacy is protected.
I call this the human-AI experience flywheel. Of course the larger the user base, the more experience the model collects. At the moment OpenAI has 500M users, they probably generate 0.5T interactive tokens/day. Those tokens go both into human brains and LLM logs.
It’s not about environment engineering anymore, it's about consequence harvesting. Meaningful validation emerges from systems actually being used by humans for real purposes.
I worked through this for a tax company. They had a huge pile of artifacts from tax questions worked up for clients. What we did is we "reverse engineered" the process of the questions that would lead to that tax memo and the research steps to find the sources and conclusions. It worked well and we were able to replicate the process which the SME's created these memos.
For a given tax question, could you come up with the same memo quoting the same sources and same conclusion?
> Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior. This clear-cut approach significantly simplifies evaluation, as the grading criteria are objective and direct: either the generated implementation behaves identically to the reference, or it doesn’t.
OK, but then you have to produce the detailed specification, working backward from the reference implementation. This is extremely non-trivial and it significantly weakens the TFA's parallels to pre-training, in which you don't need really need inputs other than raw text corpora.
I'm not saying this eliminates the idea outright, but I do think it hobbles it badly.
I’d like to courteously disagree. I think existing models and existing tools are good enough to bootstrap this at least.
I’d propose the following architecture:
Step 1: Microsoft phi style - read code and write specifications using a frontier model. You could use an ensemble here to nitpick the spec; it’s only going to get written once. We also have of course many many rfcs and codebases that conform to them or where they do not we have an existing repository of bug reports, patches, forum complaints, etc.
Step 2-4: implement multilayer evaluation: does it compile? Does an existing model think the code complies with the spec on inspection? When it’s run on qemu are the key evals the same as the original software?
I propose most of steps 2-4 are automatable and rely on existing tooling and provide a framework that is, if not cheap, achievable. I’m also sure someone could improve this plan with a few more minutes of thought.
To me the interesting question is - will this add capabilities at current model sizes? My prior is yes in that the current behemoth size models feel like they are only incrementally better than 1/10 size distills. I interpret that to mean we haven’t gotten the most out of these larger scales. I will note Dario disagrees on this - he’s publicly said we need at least 10x more scale than we have now.
The detailed specification is the output for a particular input.
And you can use a fuzzer to augument that.
When prompted correctly, models could generate good specification in form of pretty exhaustive tests. While all tests have weaknesses and are not formal specification, they could get us 99% there.
I’ve been exploring this too, since I rely on LLMs a lot to build software. I’ve noticed that our dev loop-writing, testing-is often mostly human-guided, but language models frequently outperform us in reasoning. If we plug in more automation; MCP tools controlling browsers, documentation readers, requirement analysers, we can make the cycle much more automated, with less human involvement.
This article suggests scaling up RL by exposing models to thousands of environments
I think we can already achieve something similar by chaining multiple agents:
1. A “requirement” agent that uses browser tools to craft detailed specs from docs.
2. A coding agent that sets up environments (Docker, build tools) via browser or CLI.
3. A testing agent that validates code against specs, again through tooling.
4. A feedback loop where the tester guides the coder based on results.
Put together, this system becomes a fully autonomous development pipeline-especially for small projects. In practice, I’ve left my machine running overnight, and these agents propose new features, implement them, run tests, and push to repo once they pass. It works surprisingly well.
The main barrier is cost—spinning up many powerful models is expensive. But on a modest scale, this method is remarkably effective.
> The main barrier is cost
I very much disagree. For the larger, more sophisticated stuff that runs our world, it is not cost that prohibits wide and deep automation. It's deeply sophisticated and constrained requirements, highly complex existing behaviors that may or may not be able to change, systems of people who don't always hold the information needed, usually wildly out of date internal docs that describe the system or even how to develop for it, and so on.
Agents are nowhere near capable of replacing this, and even if they were, they'd change it differently in ways that are often undesirable or illegal. I get that there's this fascination with "imagine if it were good enough to..." but it's not, and the systems AI must exist in are both vast and highly difficult to navigate.
The status quo system you describe isn't objectively optimal. It sounds archaic to me. "We" would never intentionally design it this way if we had a fresh start. I believe it is this way due to a meriad of reasons, mostly stemming from the frailty and avarice of people.
I'd argue the opposite of your stance: we've never had a chance at a fresh start without destruction, but agents (or their near-future offspring) can hold our entire systems "in nemory", and therefore might be our only chance at a redo without literally killing ourselves to get there.
It's not claimed to be an "objectively optimal" solution, it's claimed to represent how the world works.
I don't know where you're going with discussion of destruction and killing, but even fairly simple consumer products have any number of edge cases that initial specifications rarely capture. I'm not sure what "objectively optimal" is supposed to mean here, either.
If a spec described every edge case it would basically be executable already.
The pain of developing software at scale is that you're creating the blueprint on the fly from high-level vague directions.
Something trivial that nevertheless often results in meetings and debate in the development world:
Spec requirement 1: "Give new users a 10% discount, but only if they haven't purchased in the last year."
Spec requirement 2, a year later: "Now offer a second product the user can purchase."
Does the 10% discount apply to the second product too? Do you get the 10% discount on the second product if you purchased the first product in the last year, or does a purchase on any product consume the discount eligibility? What if the prices are very different and customers would be pissed off if a $1 discount on the cheaper product (which didn't meet their needs in the end) prevented them from getting a 10$ discount 9 months later (which they think will)? What if the second product is a superset of the first product? What if there are different relevant laws in different jurisdictions where you're selling your product?
Agents aren't going to figure out the intent of the company's principal's automatically here because the decision maker doesn't actually even realize it's a question until the implementers get into the weeds.
A sufficiently advanced agent would present all the options to the person running the task, and then the humans could decide. But then you've slowed things back down the pace of the human decision makers.
The complexities only increase as the product grows. And once you get into distributed or concurrent systems even most of our code today is ambiguous enough about intent that bugs are common.
Agents quite literally cannot do this today.
Additionally, I disagree with your point:
> The status quo system you describe isn't objectively optimal.
On the basis that I would challenge you or anyone to judge what is objectively optimal. Google Search is a wildly complex system, an iceberg or rules on top of rules specifically because it is a digital infrastructure surrounding an organic system filled with a diverse group of people with ever-changing preferences and behaviors. What, exactly, would be optimal here?
"deeply sophisticated and constrained requirements"
Yes this resonates completely. I think many are forgetting the purpose of formal language and code was because natural language has such high ambiguity that it doesn't capture complex behavior
LLMs are great at interpolating between implicit and unsaid requirements but whether their interpolation matches your mental model is a dice throw
Overall, I agree - it would take far more sophisticated and deterministic or 'logical' AI better capable of tracking constraints, knowing what to check and double check, etc... Right now, AI is far too scattered to pull that off (or, for the stuff that isn't scattered, it's largely just incapable), but a lot of smart people are thinking about it.
Imagine if...nevermind.
> they'd change it differently in ways that are often undesirable or illegal.
So...like SAP then?
> but language models frequently outperform us in reasoning
what
99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.
There are 2 kinds of people. Those who are outperformed on their most common tasks by LLMs and those who aren’t.
there are also two kinds of people - those who are excited by that and those who are not.
The result is a 2x2 matrix where several quadrants are deeply concerning to me.
There are also two kinds of people - those who are objective enough to tell when it happens and those who will never even see when they’re outperformed because of their cognitive biases.
I give you a 2x2x2 matrix.
> I give you a 2x2x2 matrix.
That'd be a tensor, no?
A rank-3 tensor, yes. Matrices are rank-2 tensors.
God I hated tensors in grad school. Give me a Taylor series any day.
Everything that seems complicated is just a "fancy matrix".
Sure, but if a person can find an easier way to do their job, they’ll usually do it. Usually the bias is towards less energy expenditure.
For many people, yes. For people who have their identity invested in being the smartest person in the room, life is considerably harder.
I'm sure if we work hard enough we can add a meta-meta-cognition level. Cognition is just 2^n series of binary states right?
Which quadrant is NOT concerning to you?
The best part when a “thinking” model carefully thinks and then says something that is obviously illogical, when the model clearly has both the knowledge and context to know it’s wrong. And then you ask it to double check and you give it a tiny hint about how it’s wrong, and it profusely apologizes, compliments you on your wisdom, and then says something else dumb.
I fully believe that LLMs encode enormous amounts of knowledge (some of which is even correct, and much of which their operator does not personally possess), are capable of working quickly and ingesting large amounts of data and working quickly, and have essentially no judgment or particularly strong intelligence of the non-memorized sort. This can still be very valuable!
Maybe this will change over the next few years, and maybe it won’t. I’m not at all convinced that scraping the bottom of the barrel for more billions and trillions of low-quality training tokens will help much.
I feel like one coding benchmark should be just telling it to double check or fix something that's actually perfectly fine repeatedly and watch how bad it deep fries your code base.
They key difference between that and humans, if course, is that most humans will double down on their error and insist that your correction is wrong, throwing a kitchen sink of appeals to authority, motte/bailey, and other rhetorical techniques at you.
That's not any different in practice to the LLM "apologising" to placate you and then making a similar mistake again.
It's not even a different strategy. It's just using rhetoric in a more limited way, and without human emotion.
These are style over substance machines. Their cognitive abilities are extremely ragged and unreliable - sometimes brilliant, sometimes useless, sometimes wrong.
But we give them the benefit of the doubt because they hide behind grammatically correct sentences that appear to make sense, and we're primed to assume that language = sentience = intelligence.
True "interruption" requires continuous learning, and the current model is essentially a dead frog, and frozen weights cannot be truly grounded in real time.
https://news.ycombinator.com/item?id=44488126
Yea I don't understand how people are "leaving it running overnight" to successfully implement features. There just seems to be a large disconnect between people who are all in on AI development and those who aren't. I have a suspicion that the former are using Python/JS and the features they are implementing are simple CRUD APIs while the latter are using more than simple systems/languages.
I think the problem is that despite feeding it all the context and having all the right MCPs agents hooked up, is that there isn't a human-in-loop. So it will just reason against itself causing these laughable stupid decisions. For simple boilerplate tasks this isn't a problem. But as soon as the scope is outside of a CRUD/boilerplate problem, the whole thing crumbles.
I'd really like to know which use cases work and which don't. And when folks say they use agentic AI to churn through tokens to automate virtually the entire SDLC, are they just cherry picking the situations that turned out well, or do they really have prompting and workflow approaches that indeed increase their productivity 10-fold? Or, as you mention, is it possibly a niche area which works well?
My personal experience the past five months has been very mixed. If I "let 'er rip" it's mostly junk I need to refactor or redo by micro-managing the AI. At the moment, at least for what I do, AI is like a fantastic calculator that speeds up your work, but where you still should be pushing the buttons.
Or - crazy idea here - they're just full of it.
I haven't seen an LLM stay on task anywhere near that long, like...ever. The only thing that works better left running overnight that has anything to do with ML, in my experience, is training.
Yes, if a LLM outperforms you, you have never reasoned in your life.
I will assume you passed high-school based on your looks and not on your abilities.
99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.
99% chance you're using the wrong model.
Effective tool use is a valuable skill, arguably the only one that still matters.
RL is a training method and it improves the model itself. So basically one step(e.g. successful test run, finding search result) could create positive and negative examples for the other step(e.g. coding agent, search agent). And using this the base itself will improve to satisfy other demands and if it reaches close to 100% accuracy(which I believe it could as models mostly fail due to dumb mistakes in tests), you don't need the testing agent altogether.
[dead]
The issue is that it is a stretch to call it reinforcement learning when all we do currently (in the context of LLMs) is to multiply the reward with the learning rate.
It sounds cool as marketing. It helps improve LLMs a bit. And it will never yield s.th. like an AGI or anything that is "reasoning". Unless you also redefine the word reasoning of course.
Think we will see something like this in the near future. But there are two very bold claims: one is hope that RL will lead to generalization in other domains, which based on current evidence seems far fetched. Other is the softwares and specs are good "RL-gym" data. The whole idea behind RL is that model explores the best paths, but if the softwares written are suboptimal when it comes to agent interaction paradigm (they were written for humans, not agents), there is a high chance even with the compute the model would be suboptimal. There is a parallel trend where current AI systems are abstracting entire workflows. Not accounting for that would lead to outcomes which are not cognizant of the current requirements.
I had the opportunity to meet Tamay not too long ago, very sharp guy. A lot of people I know are working on approaches to meta RL or exploration-based RL, where the goal is to build a foundation model of sorts with a really good world model across diverse tasks, and can predict good policies (or good policy updates) from limited rollouts and/or a sparse reward signal. We're not there quite yet, but as Altman recently said, "we don't have AGI until we have something that learns continuously", and there's a huge race in this space to make that happen.
I recently wrote a post about scaling RL that has some similar ideas:
> How to Scale RL to 10^26 FLOPs (blog.jxmo.io/p/how-to-scale-rl-to-1026-flops)
The basic premise behind both essays is that for AI to make another big jump in capabilities, we need to find new data to train on.
My proposal was reusing text from the Internet and doing RL on next-token prediction. The linked post here instead suggests doing 'replication training', which they define as "tasking AIs with duplicating existing software products, or specific features within them".
They have a point about RLs increasing importance. From my outsider perspective, all major advances in model capabilities in the last period of time come from RL, so it's natural to expect that we can "milk" RL more for performance gains. Scaling RL is a natural way to attempt that.
What I don't necessarily see is the generalization factor - say, we improve software engineering and math performance through RL learning (probably easier for software engineering than math due to available training corpus). If that generalization factor doesn't hold, due the economics still work out? An expert-level software model would be useful to our profession, sure, but would it be enough to recoup the training costs if it's not applicable to other industries?
One detail the OP glosses over is the increasing costs of RL as the sequence length increases. If we’re just reasoning through an simple arithmetic problem, it’s a pretty manageable number of reasoning tokens and answer tokens.
For a complete piece of software the answer might be 10 million tokens, and that doesn’t even count the reasoning.
Now imagine that there was a mistake at some point. The model will need to go back to fix it, and understand the cascade of things the bugfix changed. It might be possible to keep that all in the context window but that seems like it won’t scale.
I'd expect that's manageable by some sort of agent-of-agent pattern. You have a high-level planning instance that calls upon fresh LLM instances (new context window!) for executing more targeted tasks or bug-fixes.
Currently, an LLM with everything under the sun in the context window behaves rather poorly and gets confused by that, even if we're not exceeding the context window length. Although it'd be certainly also interesting to train for increasing the maximum _actually_ usable context window length, I don't know how feasible that would be.
A few things that this made me think about:
1) If only there was a cryptocurrency tied to training AI models and make crypto grinding useful than maths that solve no real-world problem external to the token creation itself.
2) Larger and larger AI models, you start to get more hallucinations, maybe we should focus on dedicated highly tuned models for dedicated aspects and have a higher up conductor model that knows what to farm out to which models and from there combine and send out further requests to other models etc to come to a result. Certainly, the need for highly tuned niche models, after all, language recognition as an example, a model that could identify the language, local dialect and accent, that would then use a language model tuned better for that speaker it is recognising. That approach feels like the way over one large model that does it all itself.
There are a couple:
https://ambient.xyz/
https://www.primeintellect.ai/
I remember mining something called Gridcoin over a decade ago. It's a cryptocurrency tied to the BOINC project and rewards providing computing power to science.
I have sadly lost access to my wallet since.
For 1) isn't prime intellect doing that or something like it?
Excellent commentary from Kaparthy on RL: https://x.com/karpathy/status/1944435412489171119
What is RL? Real Life? Or Reinforced Learning?
Real Life would be IRL.
Bad title. I thought RL = Real Life, instead of Reinforcement Learning. It should be clearly indicated in the title.
I don't know if not ever defining the acronym is a sort of passive aggressive "if you have to ask, you're not the audience we're looking to hire" filter, but I am with you. Having AI investment dollars doesn't excuse you from standard expectations that you clearly state what you're describing.
> Simple command-line tools that implement obscure hashing and encryption algorithms are straightforward initial targets, but this approach can easily extend to more complex software, such as websites, professional software, and games.
>Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior.
I really don't see the connection from the statements in the article's content, and the assertion near the start that:
>Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks.
There's no clear reason outlined in the piece to describe why narrow & well-scoped 1-person-day tasks might scale up to 10,000-person-year projects. If they did, we should expect far more 10,000-person-year projects in the real economy, because the learning curve for firms scaling would be something approximating a straight line. There are very few 10,000-person-year projects, and very many 1-person-day projects.
It seems more like this will spend an unimaginable amount of compute, in order to produce models which are incredibly good at a very precise form of IP theft, and not especially good at any generalisable skills. It's so ludicrously rare that an engineer (or author, illustrator, etc) is tasked with "create a pixel-perfect reimplementation of this existing tool".
> models which are incredibly good at a very precise form of IP theft
A smell big success? Copyright laundering is the killer app of AI this far.
TL;DR: The OP believes that if we train large AI models via RL to duplicate the behavior of existing software (for example, train them to duplicate the behavior of an existing spreadsheet, an existing command-line tool, or an existing application), large AI models will get good at:
* reading and understanding long, complicated, detailed instructions,
* executing those instructions meticulously and precisely, without errors,
* noticing its mistakes, if there are any along the way, and recovering from them,
* not settling prematurely for solutions that look "good enough" but aren't, and
* undertaking large, complicated projects which previously could be completed only by teams of human experts.
There's a good chance the OP is right, in my view.
We sure live in interesting times!
What is RL in this context?
Reinforcement Learning
with RL it's hard to define score function in many categories. rhis is especially visible in current coding capabilities. LLM will very often create sloppy solutions because they work well in RL. hardcoding API keys? ignoring errors? disabling lints? those pass in automated evaluation therefore are reinforced in training. are they good solutions? of course not.
It's very hard to define (in way to create lints) what makes core readable and maintainable. Using other LLM for this task could cause original model to game the system by abusing some weaknesses in the other model.
for other tasks, how do you even evaluate thinks like eg user experience/app design? how to properly evaluate pelican ridding bicycle?
> hardcoding API keys? ignoring errors? disabling lints?
These kind of "rookie mistakes" are not things that any modern LLM is likely to do. Indeed, I had to argue quite strongly with Gemini recently when I was learning a new tool (so basically just playing around with a fully local setup) and I hardcoded an API key then tried to commit it. The LLM did NOT like that! I had to carefully explain that this was a toy repo.
The argument against this (by Gemini) was that toy repos often grow into production tools so it's best to follow basic security rules from the start. Which, to be fair, is a good argument. I still committed the key though (and deleted the repo a day or so later).
[dead]
You can project them onto a linear space by gathering enough pairwise evaluations. PelicanElo.
[dead]
> Rather than fine-tuning models on a small number of environments, we expect the field will shift toward massive-scale training across thousands of diverse environments.
This is a great hypothesis for you to prove one way or the other.
> Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks.
I am not sure if I buy that, frankly. Even if you were to develop radically efficient means to create "effective and comprehensive" test suites that power replication training, it is not at all a given that it will translate to entirely new tasks. Yes, there is the bitter lesson and all that but we don't know if this is _the_ right hill to climb. Again, at best, this is a hypothesis.
> But achieving this will require training environments at a scale and diversity that dwarf anything currently available.
Yes. You should try it. Let us know if it works. All the best!
> Simple command-line tools that implement obscure hashing and encryption algorithms
So your plan is to train a MLP to black box replicate complex and highly non-linear encryption algorithms through gradient descent?
Not black box, no. The spec presumably tells the model everything it needs to know or look up. But in contrast with Fibonacci, the exact code is unlikely to be in the training set verbatim.
Upcoming sure, everything can be upcoming like the famous ASI
When I see all the bad verbal reasoning being spewed around LLMs it becomes easier to understand why so many people think these LLMs are intelligent.
Non rigorous reasoning is at the root of the problem here. AI hype is often indistinguishable fro AI slop, because those who believe it are also not very good at demanding and asserting rigor.
Isn't "replication training" just adversarial training, as in GANs?
One network tries to clean-room implement the hash function. The other network tries to find an input for which the reference implementation behaves differently.
I like this idea. It's adjacent to differential testing of manually created software (as in compilers) and to mutation testing for evaluation and generation of test suites.
This makes no sense. RL training data is predicated on past behavior of the agent. Whoever wrote this doesn't seem to fundamentally grasp what they are saying.
LLMs can be trained on an unsupervised way on static documents. That is really the key feature that lets them be as smart and effective as they are. If you had every other technology that LLMs are built on, and you didn't have hundreds of terabytes of text laying around, there would be no practical way to make them even a tiny tiny fraction as effective as they are currently.
> Whoever wrote this doesn't seem to fundamentally grasp what they are saying.
RL != only online learning.
There's a ton of research on offline and imitation-based RL where the training data isn't tied to an agents past policy - which is exactly what this article is pointing to.
I'm not sufficiently familiar with the details on ML to assess the proposition made in the article.
From my understanding, RL is a tuning approach on LLMs, so the outcome is still the same kind of beast, albeit with a different parameter set.
So empirically, I actually thought that the lead companies would already be strongly focused on improving coding capabilities, since this is where LLMs are very effective, and where they have huge cashflows from token consumptions.
So, either the motivation isn't there, or they're already doing something like that, or they know it's not as effective as the approaches they already have.
I wonder which one it is.
> From my understanding, RL is a tuning approach on LLMs,
What you're referring to is actually just one application of RL (RLHF). RL itself is much more than that
Actually I didn't. Correct me if I am wrong, but my understanding is that RL is still an LLM tuning approach, i.e. an optimization of its parameter set, no matter if it's done at scale or via HF.
I like to imagine a world where we embraced nuclear and renewables where energy is free and we could throw gigawatthours at whatever crazy computation we can think of.
I wish these guys would hire me
What a relief, I was horrified this was going to be an atrocious Rocket League update.
where can I learn about the nitty gritty of RL and RL training? For instance, I want to understand how say software could be used as input (tokenization/vectorization of the code?)
Step 1. Train a VLM to supervise the RL training.
Step 2. Train the RL network. In the mean time drink coffee or work on plan of world domination.
My understanding is that this is essentially how RLHF works, and it doesn't scale. As you run RL for longer, the model will learn how to cheat the imperfections of the grader, instead of getting better at the task at hand. Therefore, to scale RL you really need good graders, and determinism is king.
Do you think constitutional approaches would help here? (Verifiable reward for the main score, but then asking the model to self-critique for security and quality.)
You're talking about training an LLM. I'm talking about training robotic/motor skills and haptic feedback.