This mirrors what I experienced when I enrolled in "free drawing" (no teaching) classes:
While people considered me a good drawer since I was a child, I remember just repeating either similar detailed drawings I drew before, or otherwise just taking plenty of time to draw. I believe anyone with time and patience can make a nice drawing of a scene.
The "free drawing" class had no rules or lectures: you brought the materials you wanted to work with (some brought ink, others pencils, while I brought charcoal). The only thing determined was the timing between poses for the model: for each session the first few poses were very short (say a minute), and then the pose durations would progressively lengthen until say 5 minute poses. At all times you were free to tear your picture up and retry drawing the pose again.
My drawing skills improved considerably. The short "warmups" actually force you to get proportions and outlines correct on the first tries. Conventional wisdom says haste makes waste, but when learning or refining skills, it seems natural selection has hardcoded the sensation of haste as a stressor prompting attention and learning.
I am convinced I could have drawn similar quality drawings before enrolling in those classes, except they would have taken me easily 5 or 10 x as long to draw. Being forced not to beat around the bush and feeling the penalty of making a hasty mistake (further decreasing time left for the second try in the remaining time) does seem to work.
My only gripe is that the technique is termed "Consistency" whereas I would reserve such a term for an improvement in performance not inference speed, although I understand that they indicate "consistency with what would ultimately have been generated one token at a time". I would rather dub it "Proficiency LLM", where the same output is expected, only without the inhibition of stuttering to the same conclusion.
DoctorOetker
12 days ago
Hi we are CLLM authors and thanks for sharing your experience and insights! I can see this drawing skill refining process echoes with the training process in CLLM, the only thing is at this point stressor in CLLM training is not getting progressively demanding.
For example, while drawing, you can set very specific time limit on how long you are allowed to draw in each trial and make the time progressively shorter. In CLLM, maybe we can make this the learning process more and more difficult by mapping more and more distant states in Jacobi trajectory to its final state.
We are using the term "consistency" because we draw parallelism between consistency LLM and the consistency model in diffusion image generation where the training processes are analogous.
snyhlxde
11 days ago
Do you use same dataset to train / eval the model? Was the model used for example trained on GSM8K dataset for example?
boroboro4
11 days ago
Yes, we consider both domain-specific applications (spider for text2SQL, gsm8k for math, codesearchnet for python) as well as open-domain conversational applications (ShareGPT). We use test set from each application to evaluate CLLMsā performance in our paper.
On the other hand, technically CLLM works on any kind of queries. But the speedup might vary. Feel free to try out our codebase for your use cases!
snyhlxde
11 days ago
Is it just me, or does this read like it was written by an LLM ... ?!
Quarrel
11 days ago
It's just much more formal than people generally speak on HN.
jasonjmcghee
11 days ago
lol I take that as a compliment. Good try but sadly no LLM in this writing :)
snyhlxde
11 days ago
I had an interesting experience in an Invertebrate Zoology lab class one summer.
We students were brought into a lab, given specimens to draw, and the only instructions we received were 'You have 30 minutes to draw this. Go.'
There was no "here's how to draw. here's what to do and not to do". It was just basically "We don't care about any insecurities you might have. We don't care if you think you can't draw. No excuses, just fucking draw it. Now."
Not only did we draw, but we (all of us) improved enormously over the course of the class as more animals were brought in and the exercise was repeated over and over and over again throughout the summer.
What it taught us is that everyone, and I mean everyone, can draw. Our collective attitude shifted from "don't know if this is even possible" to "of course we can do this. this is easy. routine. trivial."
Highly recommended approach.
It was the most freeing and amazing class I had in college.
aamargulies
11 days ago
That sounds like a pretty awesome experience. Thanks for sharing.
Version467
11 days ago
Systems generally become more efficient when under stress. They are also forced into local optima - everything has upsides and downsides.
manmal
12 days ago
Interestingly - this is the idea behind Nassim Talebās book āAntifragileā and the concept of āanti-fragilityā.
In essence, it promotes dynamic/evolutionary/always learning behaviour than performing the same set of steps every time, and in the process, becoming stronger than before.
An example he shares is: how the breakdown of muscle tissue through exercise leads to more muscle development and an increase in strength. I guess itās similar to LLM training using error/loss reducing functions (practice makes perfect) but dissimilar in the sense that training is a oneātime action.
sheepscreek
11 days ago
> They are also forced into local optima
The good ol', "under pressure, you don't rise to the occasion, but sink to the level of your training"?
TeMPOraL
11 days ago
The authors mention that Jacobi decoding is equivalent to greedy autoregressive decoding, but in practice don't we often want the sampling temperature to be above zero to avoid repetitions and excessively generic responses?
I'm completely unfamiliar with this decoding strategy so maybe I'm just missing a simple way to account for that.
miven
12 days ago
Yes this is a great question! We are actively working on supporting other sampling strategies other than greedy sampling. In the context of CLLM training, instead of mapping to a static fixed point obtained from Jacobi decoding as the training ojbective, we term it dynamic fixed point. You can keep an eye on our github repo for new progress.
snyhlxde
11 days ago
Agreed. It's straightforward to check that a token was the argmax, but it seems difficult to check that a token appeared with the probability you wanted it to. You could still do the fine-tuning step I guess, where you train the trajectories to approach n-token completions with the statistics you want, but I can't see how you can replace the "check for a fixed point" step. Maybe "check the result was above this fixed threshold for likelihood".
matheist
12 days ago
I feel it's a pretty dangerous optimization before we REALLY understand what's going on inside of the LLM. e.g. guys believe in the geometric interpretation will have something to say, and it would probably hurt if you are using "filler" tokens.
Besides, the assumption (not a universal fact) that "forming complete sentences in mind before articulating word by word" seems overly simplifies activities happens in our mind: do we really have a complete planning before start talking/typing? as a Buddhist I lean towards it's an illusion. further more, what about simultaneous thoughts? are we linear thinker in the sentence level?
anyway, pretty neat math!
wangii
11 days ago
The optimization does not affect the result of LLM, it's guaranteed to produce equivalent results as decoding directly. Let's not treat that LLM as some magic that resembles our mind, it's just another program that produces sentences that happens to make sense.
renonce
11 days ago
> Let's not treat that LLM as some magic that resembles our mind,it's just another program that produces sentences that happens to make sense.
"That happen to make sense" is hiding a lot of magic. It would be statistically impossible to make as much sense as LLMs do in response to prompts if it did not actually make semantic distinctions. If it makes semantic distinctions, then it does resemble the human mind in at least one way.
naasking
11 days ago
According to the original Jacobi decoding paper, it's set in the machine translation tasks, with encoder + decoder, in which parallel algo applied only to the decoder part.
wangii
11 days ago
Lets not treat our mind as something magical. It's just another program that learned to speak by consuming lots of training input. The implementation might look slightly different from the outside, but from a mathematical perspective, artificial neural networks are proven to be at least as capable as the human mind.
sigmoid10
11 days ago
The best part is, your comment works both when sarcastic and completely serious.
baq
11 days ago
> artificial neural networks are proven to be at least as capable as the human mind
Do you have a source for this? I know we have models of neural networks designed to act like neurons, but those aren't what're being used.
ben-schaaf
10 days ago
See the universal approximation theorem for fully connected perceptrons.
sigmoid10
7 days ago
That's really nowhere near enough of a proof. You'd need to prove that a human brain is equivalent to a mathematical function, and that that function can be sufficiently approximated by a NN to be functionally identical.
Additionally UAT doesn't actually prove NNs can approximate any function. Non-continuous functions and infinitely large domains aren't covered.
ben-schaaf
4 days ago
Define ācapableā and most of the confusion and potential controversy goes away.
xpe
10 days ago
That assumption might be useful in this context, but I think it's pretty clearly not true. Ask anyone to tell you about a complex past event with a lot of parallel branches and you'll quickly see them add bits, pieces and tangents midsentence to cover the full range of events. I don't think I've seen the sentence granularity hypothesis in any serious scientific context before.
Etheryte
11 days ago
Can't speak for everyone but I definitely don't mentally form complete sentences before talking. Sometimes I grammatically talk myself into a corner in the middle of a sentence and need to use some awkward words/phrases to finish my thought, or simply pause and restart the phrase from the beginning.
hatthew
11 days ago
I feel surprisingly disconnected from my speaking self, acting as more of an observer, who is sometimes surprised at what I come up with. It just flows. I feel I have very little need for input.
But, I also feel fairly disconnected from my thinking self. I point my attention at something and solutions usually just pop out, maybe with some guidance/context forming required, in the form of internal dialog, which is usually of a rubber ducky style format [1], or mental testing of that mostly spontaneous solution.
I feel the "real" me is the one sensing/observing, which includes the observing of those spontaneous solutions, and what I say.
are you practicing any meditation? it's regarded as "awaken" state in some practice! if you have any method, please share with me! thanks!
wangii
10 days ago
We don't appear to be forming words sequentially from underlying parts, even though in many languages they are broken down in smaller units that carry semantic meaning themselves. There doesn't seem to be any clear reason for this to break down suddenly at sentence level.
int_19h
11 days ago
What is the geometric interpretation?
causal
11 days ago
Wow, I'm mindblown this isn't getting more attention. This seems like a clear win for inference. Fine tuning cost for this is reasonable (around 0.01% of the original pre-training cost). And the performance wins seem fairly consistent.
alfalfasprout
12 days ago
Yes, seems like a huge important result for LLM performance.
Iām not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?
At least while also:
- Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.
- Improving not just query latency but also global throughput
- Not requiring more compute
- Having a relatively practical implementation and not adding big challenges and complexity
You could argue the insight is incremental, as it builds on whatās been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.
WhitneyLand
11 days ago
Similar or greater inference wins are achieved with speculative decoding which is already widely used, so while this is really interesting (and was tried before with less success AFAIK), it's not yet clear how impactful it would be.
lopuhin
12 days ago
I donāt see where similar wins have ever been achieved.
Speculative decoding can reduce latency, but at the cost of using a lot more compute. The amazing thing here is latency and global throughput improvements would be realized because of the increase in efficiency.
From what I understand speculative decoding can also come with more challenges insofar as trying to maintain overall output quality.
WhitneyLand
11 days ago
Thanks for interesting in our work! Yes we found training with consistency loss + AR loss on even a subset of a dataset results in significant speedup (0.01% pre-training cost). Training on more data permits even further speedup: the model is able to learn from more frequently-appearing collocations and phrases.
For more details, please check out our paper and you can also see speedup saturates as the size of training data grows.
snyhlxde
11 days ago
At first I thoght that this was another Medusa-like paper, simply using more unembed heads for guessing subsequent tokes, but damn, not at all. This is amazing. And it doesn't even use extra parameters, it's just an auxiliary training loss.
andy12_
12 days ago
The only similarity between Medusa and CLLM is both train and adapt LLMs for fast inference. But they use completely different training technique, decoding technique and as you pointed out CLLMs don't need extra parameters or configuring attention mask for tree-based verification.
snyhlxde
11 days ago
Interesting
I think soon we are going to realize that we donāt really need training the models
We just need good indexing and sampling
Essentially at some level any LLM is equivalent to a DB of the dataset, with a great NLP interface on top
Both are just different methods of navigating stored data
nico
12 days ago
LLMs can easily produce data not in training dataset.
LLMs do not navigate stored data. An LLM is not a DB of the training data.
tempusalaria
11 days ago
I've had the same thought as above but unfounded (just a feeling, pretty much) so I'm curious to learn more. Do you have any references I can check out that supports these claims?
carlthome
11 days ago
Come up with a novel puzzle that is guaranteed to not be in the training set, and ask GPT-4 to solve it.
int_19h
11 days ago
To control for that doesn't seems trivial.
carlthome
8 days ago
But indexing *is* training. It's just not using end-to-end gradient descent.
sdrg822
12 days ago
The models are multiple orders of magnitude smaller than the compressed versions of their training data, they can not be the equivalent of a DB of it.
PeterisP
11 days ago
The training data is ideo-semantically compressed? News to me... is it perhaps stored in kanji?
lainga
11 days ago
You might like, the Infinigram paper then. It was discussed recently:
Anyone know somewhere someone dumb like me can "Ask an AI expert"?
I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?
I guess I want to learn this stuff and should maybe follow one of those "write an LLM in an hour" type videos on YouTube.
JKCalhoun
11 days ago
> how is it that an LLM when given the same prompt does not respond in the same deterministic way?
In software (not in the model) here's literally a random number generator that picks from a weighted set of "next-token" choices that the model spits out. The selection process can have a series of knobs to manipulate the responses. If you want it to be deterministic (if you have direct access to the software) you can tell it to set "top-k = 1" or "temperature = 0.0" (depending on your software) and it will be deterministic.
Usually the default settings are not for determinism, because for whatever reason the quality of the results tends to not be that good when you go fully d.
throwawaymaths
11 days ago
For that answer, you can refer to the 3blue1brown videos
The llm model outputs a vector of probabilities for tokens, and the llm user picks a token from the most likely list using a random number
8note
11 days ago
It's because an LLM is essentially a probability matrix. You type a prompt, then it calculates what's the probability of getting a next word and so on, eventually forming a sentence. The probability learned is based on the training data.
Because of the underlying probability model, it's not going to be 100% deterministic. Plus a model like ChatGPT purposefully have "temperature" parameter that will further add randomisation to the whole process.
My answer is based on this paper if you're interested to read more: The Matrix: A Bayesian learning model for LLMs, https://arxiv.org/abs/2402.03175
zipfcharge
11 days ago
Are there any ways to show the source of the information retrieved by the model? For instance, the LLM forms a sentence and it points to a stackoverflow answer with the same or similar content.
flopriore
11 days ago
As I understand it, pretty sure that is impossible. When it is input a single datum, sure, trivial. As soon as it is fed a second one though the weights are already a kind of blend of the two tokens (so to speak).
JKCalhoun
11 days ago
Its not impossible, but its definitely difficult. There is some overlap in the methods used to detect benchmark data contamination, though its not entirely the same thing. For the detection use case, you already know the text you're looking for and you are just trying to demonstrate that the model has "seen" the data in its training set. The challenge is proving that it is statistically improbable that the model could stochastically generate the same tokens without having seen them during training.
Some great research exists in this area [1] and I expect much of it may be repurposed for black box attribution in the future (in addition to all the work being done in the mechanistic interpretability field)
> I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?
You can control that in most systems with an inference-set parameter called "temperature". But setting the temperature as low as possible tends to lead to very low-quality answers - the system can't crawl out of some local optimum and ends up repeating itself over and over. Such answers may be "deterministic" but they're also not good.
zozbot234
11 days ago
I found this to be a good start that explains things fairly methodically, but without losing the high-level perspective.
For this particular question, ask chatgpt how temperature affects llm softmax sampling.
For other things, study using Karpathy's videos.
rahimnathwani
11 days ago
> ... speculative decoding methods ... incurs extra memory cost during inference time.
Any detail on this? For speculative decoding you need a smaller model to generate "branches" which are fast but maybe inaccurate and verify these branches later with a larger model. However, only memory equivalent to a single token is needed for speculative decoding, and tokens in other branches are simply masked out during inference. With a context size of 1000 and ~30 branches for 5 tokens, the memory overhead would be 3% which is negligible. If your context size is much smaller compared to the number of branches - would someone who use a generative LLM with a context window of just 50 tokens care about generation speed?
Also, speculative decoding techniques are not restricted to greedy sampling - it's expected to behave exactly the same as the original model and sample with the expected probabilities. Most literature on speculative decoding already reports 2.6x-3.5x speedup. The blog post here reports 2.4x-3.4x generation speed - which isn't that much of an upgrade?
While I mentioned speculative decoding above and Medusa2 and Eagle seems to be the techniques that the author compares against, the core problem remains: whatever method you use to predict tokens ahead of time, there is a specific point where the previous tokens are absolutely needed before predicting the next token. It doesn't depend on what your model is or what your techniques are, it's just about what is mathematically achievable. How can you predict 5 tokens at once if the probability distribution of the 5th next token depends heavily on the previous 4 tokens? Speculative decoding, Jacobi decoding, multi-token parallel decoding, whatever.
If only greedy sampling is supported for this, then I wonder what are the advantages of this method, not to mention that other techniques already achieve the expected speedup. Comparing greedy sampling speedups to random sampling speedups is comparing apples to oranges, and I doubt if the speedup described by the method would remain after this method is adapted to random sampling (due to the core problem mentioned above).
renonce
11 days ago
`the previous tokens are absolutely needed before predicting the next token'
Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?
On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.
cxczz
11 days ago
Speculative decoding requires you to load the smaller model into memory and run inference on it.
Palmik
11 days ago
I think the smaller model is at least 20 times smaller. If you do speculative decoding on a 70B model an 1B model would be appropriate.
renonce
11 days ago
There's no free lunchā¢, so from what I can tell there's some pathway loss here. E.g. some Jacobi trajectories definitionally exclude higher temperature paths. Which might actually be a positive given data retrieval (but a negative if we want to maximize for creativity?).
dvt
12 days ago
There are better and worse algorithms. I'm not sure "there is no free lunch" always applies in a particularly meaningful way. Some things aren't on the pareto frontier.
wrsh07
11 days ago
Kinda like the aiff -> mp3 conversion process. A lot of data is lost, but we human can really tell the too much of a difference?
factormeta
11 days ago
There's no reason to think the current next token prediction models are optimal for predicting sentences (they aren't!)
> An algorithm may outperform another on a problem when neither is specialized to the problem
I would go even further and say there isn't any indication that we are even close to what is possible. My subjective feeling is that with the current rate of progress it is entirely possible that we will have GPT-4 level performance locally on smartphone hardware within 3-10 years (unless companies decide again that they don't want to give this kind of power away)
stkdump
11 days ago
Probably. Advancements in ML algorithms, like this one, have been outpacing advancements in hardware for awhile now, so both are converging on making ML faster and ubiquitous.
naasking
11 days ago
Interesting stuff. I guess the idea has occurred to many but was well written and presented.
toxik
12 days ago
Yep. My roommate and I were talking about this a year ago. You can also do something similar for LLM steering.
programjames
11 days ago
> Our research shows this process ā mimicking human cognitive process of forming complete sentences in mind before articulating word by word
This is not how I work. Is there something wrong with me?
doctor_eval
12 days ago
Nor is it how I work, I think that's normal enough. I do have an idea of what I'm going to say before I say it, I think that's closer to what they meant. I think and speak in increments of ideas, not words.
jerbear4328
12 days ago
> I think and speak in increments of ideas
extremely common among (but not unique to) people with ASD, those "increments of ideas" are called "gestalts".
In some conversations, maybe it's easier to form complete sentences. In some others, the best we can do is: have a rough draft about what to say in mind and then refine it word by word while speaking.
snyhlxde
11 days ago
You might not have an internal monologue. A lot of us don't, and the ones that do are equally shocked every time they find out. For what it's worth, I'm in the same boatācan form sentences, but why would I? It'd slow me down.
People who don't have inner monologues tend to assume that all that stuff is some form of analogy or metaphor. It's not. It's entirely literal.
Filligree
12 days ago
Do you mean in a real time conversation?
Because I definitely dont "have an internal monologue about what I'm going to say" in the 100ms between when someone asks a casual question and I respond to it.
oceanplexian
12 days ago
Yes, it is possible to maintain an internal monologue in real time conversation. That is one of the reasons why some people usually take longer than 100ms to respond.
int_19h
11 days ago
Are you sure. It might not be the whole sentence, but I would find it hard to believe that in practice the way you speak or write is like
hello
<think>
May
<think>
be
<think>
I'll
<think>
go
<think>
get
<think>
break
<think>
fast
throwawaymaths
11 days ago
They probably do not mean people form entire sentences before expressing them, I am not aware of anybody doing that. I assume it refers to people first coming up with a global outline of what they want to say before they start speaking.
DrSiemer
12 days ago
"Rem tene, verba sequentur" (you hold the matter, then words come) is largely "how it works".
You form logical ideas as you speak, as you speak your speech develops, so the translation is from ideas to sentences. It is not clear in which phase one would mentally form a complete sentence, nor why it should be relevant. You "see something [that makes sense]", then you describe it - iteratively.
mdp2021
12 days ago
You are probably pretty far from the LLM extreme, though, of thinking one token at a time.
causal
11 days ago
Probably.
giardini
12 days ago
> Surprisingly, we find such an objective is analogous to that of consistency models
This is why numerical methods should be part of the ML curriculum.
programjames
11 days ago
Can't wait to see something like this merged into ollama (I'm sure there would be plenty of people fine-tuning models for it).
rcarmo
12 days ago
Ollama doesn't have their own inference engine, they just wrap llama.cpp. But yes, it will be awesome when it's more generally available.
Me1000
12 days ago
The lab is tied to the vLLM project. I would say it might get picked up sooner by vLLM than other inference frameworks.
helloericsf
12 days ago
from CLLM authors:
Thank you guys for the great questions and insights! We have made a Twitter posts with some more details and we invite you to engage with us on Twitter as well.
Is this how Groq (https://groq.com/) is so fast, or are they doing something different?
paulclark
12 days ago
Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.
buildbot
12 days ago
I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:
My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.
(Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)
wrsh07
11 days ago
According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".
throwawaymaths
11 days ago
They can quickly try with one of the open source models, then show a side by side demo
m3kw9
12 days ago
Could someone please explain the intuition around this technique in more lament terms?
ec109685
12 days ago
For all of these "how can we batch predicting the next n tokens?" the intuition is basically that it takes a buttload of math to predict some of the tokens, but that most tokens are actually easy to guess. For example, if I asked "What was that phone number from that 80's song?" as soon as a model generates 867- it shouldn't take that much math at all to finish predicting 5309.
TomatoCo
12 days ago
A bit more intuition on how training works: in natural language processing, some phrases/collocations, for example "remind ... of ...", "make a decision", "learn a skill" etc. are used together. We can ask LLMs to learn such collections & frequently appearing n-grams. After learning, the model can use parallel decoding to predict many tokens that are frequently appear together in one forward pass.
snyhlxde
11 days ago
[deleted]
12 days ago
"Try to fix all the words in a sentence at once. Keep iterating until you don't think it needs fixing."
programjames
11 days ago
Would something like this apply to MAMBA/JAMBA too?
fermuch
12 days ago
I think any next token predictor will benefit. Iiuc mamba is a next token predictor.
I just skimmed the gradient article, but if their only change is swapping out the transformer block for the mamba block, I don't think it's already using this optimization
This mirrors what I experienced when I enrolled in "free drawing" (no teaching) classes:
While people considered me a good drawer since I was a child, I remember just repeating either similar detailed drawings I drew before, or otherwise just taking plenty of time to draw. I believe anyone with time and patience can make a nice drawing of a scene.
The "free drawing" class had no rules or lectures: you brought the materials you wanted to work with (some brought ink, others pencils, while I brought charcoal). The only thing determined was the timing between poses for the model: for each session the first few poses were very short (say a minute), and then the pose durations would progressively lengthen until say 5 minute poses. At all times you were free to tear your picture up and retry drawing the pose again.
My drawing skills improved considerably. The short "warmups" actually force you to get proportions and outlines correct on the first tries. Conventional wisdom says haste makes waste, but when learning or refining skills, it seems natural selection has hardcoded the sensation of haste as a stressor prompting attention and learning.
I am convinced I could have drawn similar quality drawings before enrolling in those classes, except they would have taken me easily 5 or 10 x as long to draw. Being forced not to beat around the bush and feeling the penalty of making a hasty mistake (further decreasing time left for the second try in the remaining time) does seem to work.
My only gripe is that the technique is termed "Consistency" whereas I would reserve such a term for an improvement in performance not inference speed, although I understand that they indicate "consistency with what would ultimately have been generated one token at a time". I would rather dub it "Proficiency LLM", where the same output is expected, only without the inhibition of stuttering to the same conclusion.
DoctorOetker
12 days ago
Hi we are CLLM authors and thanks for sharing your experience and insights! I can see this drawing skill refining process echoes with the training process in CLLM, the only thing is at this point stressor in CLLM training is not getting progressively demanding.
For example, while drawing, you can set very specific time limit on how long you are allowed to draw in each trial and make the time progressively shorter. In CLLM, maybe we can make this the learning process more and more difficult by mapping more and more distant states in Jacobi trajectory to its final state.
We are using the term "consistency" because we draw parallelism between consistency LLM and the consistency model in diffusion image generation where the training processes are analogous.
snyhlxde
11 days ago
Do you use same dataset to train / eval the model? Was the model used for example trained on GSM8K dataset for example?
boroboro4
11 days ago
Yes, we consider both domain-specific applications (spider for text2SQL, gsm8k for math, codesearchnet for python) as well as open-domain conversational applications (ShareGPT). We use test set from each application to evaluate CLLMsā performance in our paper.
On the other hand, technically CLLM works on any kind of queries. But the speedup might vary. Feel free to try out our codebase for your use cases!
snyhlxde
11 days ago
Is it just me, or does this read like it was written by an LLM ... ?!
Quarrel
11 days ago
It's just much more formal than people generally speak on HN.
jasonjmcghee
11 days ago
lol I take that as a compliment. Good try but sadly no LLM in this writing :)
snyhlxde
11 days ago
I had an interesting experience in an Invertebrate Zoology lab class one summer.
We students were brought into a lab, given specimens to draw, and the only instructions we received were 'You have 30 minutes to draw this. Go.'
There was no "here's how to draw. here's what to do and not to do". It was just basically "We don't care about any insecurities you might have. We don't care if you think you can't draw. No excuses, just fucking draw it. Now."
Not only did we draw, but we (all of us) improved enormously over the course of the class as more animals were brought in and the exercise was repeated over and over and over again throughout the summer.
What it taught us is that everyone, and I mean everyone, can draw. Our collective attitude shifted from "don't know if this is even possible" to "of course we can do this. this is easy. routine. trivial."
Highly recommended approach.
It was the most freeing and amazing class I had in college.
aamargulies
11 days ago
That sounds like a pretty awesome experience. Thanks for sharing.
Version467
11 days ago
Systems generally become more efficient when under stress. They are also forced into local optima - everything has upsides and downsides.
manmal
12 days ago
Interestingly - this is the idea behind Nassim Talebās book āAntifragileā and the concept of āanti-fragilityā.
In essence, it promotes dynamic/evolutionary/always learning behaviour than performing the same set of steps every time, and in the process, becoming stronger than before.
An example he shares is: how the breakdown of muscle tissue through exercise leads to more muscle development and an increase in strength. I guess itās similar to LLM training using error/loss reducing functions (practice makes perfect) but dissimilar in the sense that training is a oneātime action.
sheepscreek
11 days ago
> They are also forced into local optima
The good ol', "under pressure, you don't rise to the occasion, but sink to the level of your training"?
TeMPOraL
11 days ago
The authors mention that Jacobi decoding is equivalent to greedy autoregressive decoding, but in practice don't we often want the sampling temperature to be above zero to avoid repetitions and excessively generic responses?
I'm completely unfamiliar with this decoding strategy so maybe I'm just missing a simple way to account for that.
miven
12 days ago
Yes this is a great question! We are actively working on supporting other sampling strategies other than greedy sampling. In the context of CLLM training, instead of mapping to a static fixed point obtained from Jacobi decoding as the training ojbective, we term it dynamic fixed point. You can keep an eye on our github repo for new progress.
snyhlxde
11 days ago
Agreed. It's straightforward to check that a token was the argmax, but it seems difficult to check that a token appeared with the probability you wanted it to. You could still do the fine-tuning step I guess, where you train the trajectories to approach n-token completions with the statistics you want, but I can't see how you can replace the "check for a fixed point" step. Maybe "check the result was above this fixed threshold for likelihood".
matheist
12 days ago
I feel it's a pretty dangerous optimization before we REALLY understand what's going on inside of the LLM. e.g. guys believe in the geometric interpretation will have something to say, and it would probably hurt if you are using "filler" tokens.
Besides, the assumption (not a universal fact) that "forming complete sentences in mind before articulating word by word" seems overly simplifies activities happens in our mind: do we really have a complete planning before start talking/typing? as a Buddhist I lean towards it's an illusion. further more, what about simultaneous thoughts? are we linear thinker in the sentence level?
anyway, pretty neat math!
wangii
11 days ago
The optimization does not affect the result of LLM, it's guaranteed to produce equivalent results as decoding directly. Let's not treat that LLM as some magic that resembles our mind, it's just another program that produces sentences that happens to make sense.
renonce
11 days ago
> Let's not treat that LLM as some magic that resembles our mind,it's just another program that produces sentences that happens to make sense.
"That happen to make sense" is hiding a lot of magic. It would be statistically impossible to make as much sense as LLMs do in response to prompts if it did not actually make semantic distinctions. If it makes semantic distinctions, then it does resemble the human mind in at least one way.
naasking
11 days ago
According to the original Jacobi decoding paper, it's set in the machine translation tasks, with encoder + decoder, in which parallel algo applied only to the decoder part.
wangii
11 days ago
Lets not treat our mind as something magical. It's just another program that learned to speak by consuming lots of training input. The implementation might look slightly different from the outside, but from a mathematical perspective, artificial neural networks are proven to be at least as capable as the human mind.
sigmoid10
11 days ago
The best part is, your comment works both when sarcastic and completely serious.
baq
11 days ago
> artificial neural networks are proven to be at least as capable as the human mind
Do you have a source for this? I know we have models of neural networks designed to act like neurons, but those aren't what're being used.
ben-schaaf
10 days ago
See the universal approximation theorem for fully connected perceptrons.
sigmoid10
7 days ago
That's really nowhere near enough of a proof. You'd need to prove that a human brain is equivalent to a mathematical function, and that that function can be sufficiently approximated by a NN to be functionally identical.
Additionally UAT doesn't actually prove NNs can approximate any function. Non-continuous functions and infinitely large domains aren't covered.
ben-schaaf
4 days ago
Define ācapableā and most of the confusion and potential controversy goes away.
xpe
10 days ago
That assumption might be useful in this context, but I think it's pretty clearly not true. Ask anyone to tell you about a complex past event with a lot of parallel branches and you'll quickly see them add bits, pieces and tangents midsentence to cover the full range of events. I don't think I've seen the sentence granularity hypothesis in any serious scientific context before.
Etheryte
11 days ago
Can't speak for everyone but I definitely don't mentally form complete sentences before talking. Sometimes I grammatically talk myself into a corner in the middle of a sentence and need to use some awkward words/phrases to finish my thought, or simply pause and restart the phrase from the beginning.
hatthew
11 days ago
I feel surprisingly disconnected from my speaking self, acting as more of an observer, who is sometimes surprised at what I come up with. It just flows. I feel I have very little need for input.
But, I also feel fairly disconnected from my thinking self. I point my attention at something and solutions usually just pop out, maybe with some guidance/context forming required, in the form of internal dialog, which is usually of a rubber ducky style format [1], or mental testing of that mostly spontaneous solution.
I feel the "real" me is the one sensing/observing, which includes the observing of those spontaneous solutions, and what I say.
[1] Works with any problem space, not just coding "debugging": https://rubberduckdebugging.com/
nomel
10 days ago
are you practicing any meditation? it's regarded as "awaken" state in some practice! if you have any method, please share with me! thanks!
wangii
10 days ago
We don't appear to be forming words sequentially from underlying parts, even though in many languages they are broken down in smaller units that carry semantic meaning themselves. There doesn't seem to be any clear reason for this to break down suddenly at sentence level.
int_19h
11 days ago
What is the geometric interpretation?
causal
11 days ago
Wow, I'm mindblown this isn't getting more attention. This seems like a clear win for inference. Fine tuning cost for this is reasonable (around 0.01% of the original pre-training cost). And the performance wins seem fairly consistent.
alfalfasprout
12 days ago
Yes, seems like a huge important result for LLM performance.
Iām not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?
At least while also:
- Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.
- Improving not just query latency but also global throughput
- Not requiring more compute
- Having a relatively practical implementation and not adding big challenges and complexity
You could argue the insight is incremental, as it builds on whatās been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.
WhitneyLand
11 days ago
Similar or greater inference wins are achieved with speculative decoding which is already widely used, so while this is really interesting (and was tried before with less success AFAIK), it's not yet clear how impactful it would be.
lopuhin
12 days ago
I donāt see where similar wins have ever been achieved.
Speculative decoding can reduce latency, but at the cost of using a lot more compute. The amazing thing here is latency and global throughput improvements would be realized because of the increase in efficiency.
From what I understand speculative decoding can also come with more challenges insofar as trying to maintain overall output quality.
WhitneyLand
11 days ago
Thanks for interesting in our work! Yes we found training with consistency loss + AR loss on even a subset of a dataset results in significant speedup (0.01% pre-training cost). Training on more data permits even further speedup: the model is able to learn from more frequently-appearing collocations and phrases.
For more details, please check out our paper and you can also see speedup saturates as the size of training data grows.
snyhlxde
11 days ago
At first I thoght that this was another Medusa-like paper, simply using more unembed heads for guessing subsequent tokes, but damn, not at all. This is amazing. And it doesn't even use extra parameters, it's just an auxiliary training loss.
andy12_
12 days ago
The only similarity between Medusa and CLLM is both train and adapt LLMs for fast inference. But they use completely different training technique, decoding technique and as you pointed out CLLMs don't need extra parameters or configuring attention mask for tree-based verification.
snyhlxde
11 days ago
Interesting
I think soon we are going to realize that we donāt really need training the models
We just need good indexing and sampling
Essentially at some level any LLM is equivalent to a DB of the dataset, with a great NLP interface on top
Both are just different methods of navigating stored data
nico
12 days ago
LLMs can easily produce data not in training dataset.
LLMs do not navigate stored data. An LLM is not a DB of the training data.
tempusalaria
11 days ago
I've had the same thought as above but unfounded (just a feeling, pretty much) so I'm curious to learn more. Do you have any references I can check out that supports these claims?
carlthome
11 days ago
Come up with a novel puzzle that is guaranteed to not be in the training set, and ask GPT-4 to solve it.
int_19h
11 days ago
To control for that doesn't seems trivial.
carlthome
8 days ago
But indexing *is* training. It's just not using end-to-end gradient descent.
sdrg822
12 days ago
The models are multiple orders of magnitude smaller than the compressed versions of their training data, they can not be the equivalent of a DB of it.
PeterisP
11 days ago
The training data is ideo-semantically compressed? News to me... is it perhaps stored in kanji?
lainga
11 days ago
You might like, the Infinigram paper then. It was discussed recently:
https://news.ycombinator.com/item?id=40266791
nsagent
12 days ago
[flagged]
JoannaWongs
11 days ago
Anyone know somewhere someone dumb like me can "Ask an AI expert"?
I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?
I guess I want to learn this stuff and should maybe follow one of those "write an LLM in an hour" type videos on YouTube.
JKCalhoun
11 days ago
> how is it that an LLM when given the same prompt does not respond in the same deterministic way?
In software (not in the model) here's literally a random number generator that picks from a weighted set of "next-token" choices that the model spits out. The selection process can have a series of knobs to manipulate the responses. If you want it to be deterministic (if you have direct access to the software) you can tell it to set "top-k = 1" or "temperature = 0.0" (depending on your software) and it will be deterministic.
Usually the default settings are not for determinism, because for whatever reason the quality of the results tends to not be that good when you go fully d.
throwawaymaths
11 days ago
For that answer, you can refer to the 3blue1brown videos
The llm model outputs a vector of probabilities for tokens, and the llm user picks a token from the most likely list using a random number
8note
11 days ago
It's because an LLM is essentially a probability matrix. You type a prompt, then it calculates what's the probability of getting a next word and so on, eventually forming a sentence. The probability learned is based on the training data.
Because of the underlying probability model, it's not going to be 100% deterministic. Plus a model like ChatGPT purposefully have "temperature" parameter that will further add randomisation to the whole process.
My answer is based on this paper if you're interested to read more: The Matrix: A Bayesian learning model for LLMs, https://arxiv.org/abs/2402.03175
zipfcharge
11 days ago
Are there any ways to show the source of the information retrieved by the model? For instance, the LLM forms a sentence and it points to a stackoverflow answer with the same or similar content.
flopriore
11 days ago
As I understand it, pretty sure that is impossible. When it is input a single datum, sure, trivial. As soon as it is fed a second one though the weights are already a kind of blend of the two tokens (so to speak).
JKCalhoun
11 days ago
Its not impossible, but its definitely difficult. There is some overlap in the methods used to detect benchmark data contamination, though its not entirely the same thing. For the detection use case, you already know the text you're looking for and you are just trying to demonstrate that the model has "seen" the data in its training set. The challenge is proving that it is statistically improbable that the model could stochastically generate the same tokens without having seen them during training.
Some great research exists in this area [1] and I expect much of it may be repurposed for black box attribution in the future (in addition to all the work being done in the mechanistic interpretability field)
[1] https://arxiv.org/abs/2311.04850
spmurrayzzz
11 days ago
> I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?
You can control that in most systems with an inference-set parameter called "temperature". But setting the temperature as low as possible tends to lead to very low-quality answers - the system can't crawl out of some local optimum and ends up repeating itself over and over. Such answers may be "deterministic" but they're also not good.
zozbot234
11 days ago
I found this to be a good start that explains things fairly methodically, but without losing the high-level perspective.
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
int_19h
11 days ago
For this particular question, ask chatgpt how temperature affects llm softmax sampling.
For other things, study using Karpathy's videos.
rahimnathwani
11 days ago
> ... speculative decoding methods ... incurs extra memory cost during inference time.
Any detail on this? For speculative decoding you need a smaller model to generate "branches" which are fast but maybe inaccurate and verify these branches later with a larger model. However, only memory equivalent to a single token is needed for speculative decoding, and tokens in other branches are simply masked out during inference. With a context size of 1000 and ~30 branches for 5 tokens, the memory overhead would be 3% which is negligible. If your context size is much smaller compared to the number of branches - would someone who use a generative LLM with a context window of just 50 tokens care about generation speed?
Also, speculative decoding techniques are not restricted to greedy sampling - it's expected to behave exactly the same as the original model and sample with the expected probabilities. Most literature on speculative decoding already reports 2.6x-3.5x speedup. The blog post here reports 2.4x-3.4x generation speed - which isn't that much of an upgrade?
While I mentioned speculative decoding above and Medusa2 and Eagle seems to be the techniques that the author compares against, the core problem remains: whatever method you use to predict tokens ahead of time, there is a specific point where the previous tokens are absolutely needed before predicting the next token. It doesn't depend on what your model is or what your techniques are, it's just about what is mathematically achievable. How can you predict 5 tokens at once if the probability distribution of the 5th next token depends heavily on the previous 4 tokens? Speculative decoding, Jacobi decoding, multi-token parallel decoding, whatever.
If only greedy sampling is supported for this, then I wonder what are the advantages of this method, not to mention that other techniques already achieve the expected speedup. Comparing greedy sampling speedups to random sampling speedups is comparing apples to oranges, and I doubt if the speedup described by the method would remain after this method is adapted to random sampling (due to the core problem mentioned above).
renonce
11 days ago
`the previous tokens are absolutely needed before predicting the next token'
Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?
On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.
cxczz
11 days ago
Speculative decoding requires you to load the smaller model into memory and run inference on it.
Palmik
11 days ago
I think the smaller model is at least 20 times smaller. If you do speculative decoding on a 70B model an 1B model would be appropriate.
renonce
11 days ago
There's no free lunchā¢, so from what I can tell there's some pathway loss here. E.g. some Jacobi trajectories definitionally exclude higher temperature paths. Which might actually be a positive given data retrieval (but a negative if we want to maximize for creativity?).
dvt
12 days ago
There are better and worse algorithms. I'm not sure "there is no free lunch" always applies in a particularly meaningful way. Some things aren't on the pareto frontier.
wrsh07
11 days ago
Kinda like the aiff -> mp3 conversion process. A lot of data is lost, but we human can really tell the too much of a difference?
factormeta
11 days ago
There's no reason to think the current next token prediction models are optimal for predicting sentences (they aren't!)
> An algorithm may outperform another on a problem when neither is specialized to the problem
https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and_...
wrsh07
11 days ago
I would go even further and say there isn't any indication that we are even close to what is possible. My subjective feeling is that with the current rate of progress it is entirely possible that we will have GPT-4 level performance locally on smartphone hardware within 3-10 years (unless companies decide again that they don't want to give this kind of power away)
stkdump
11 days ago
Probably. Advancements in ML algorithms, like this one, have been outpacing advancements in hardware for awhile now, so both are converging on making ML faster and ubiquitous.
naasking
11 days ago
Interesting stuff. I guess the idea has occurred to many but was well written and presented.
toxik
12 days ago
Yep. My roommate and I were talking about this a year ago. You can also do something similar for LLM steering.
programjames
11 days ago
> Our research shows this process ā mimicking human cognitive process of forming complete sentences in mind before articulating word by word
This is not how I work. Is there something wrong with me?
doctor_eval
12 days ago
Nor is it how I work, I think that's normal enough. I do have an idea of what I'm going to say before I say it, I think that's closer to what they meant. I think and speak in increments of ideas, not words.
jerbear4328
12 days ago
> I think and speak in increments of ideas
extremely common among (but not unique to) people with ASD, those "increments of ideas" are called "gestalts".
https://kidtherapy.org/helpful-articles/what-is-gestalt-lang...
paulmd
12 days ago
In some conversations, maybe it's easier to form complete sentences. In some others, the best we can do is: have a rough draft about what to say in mind and then refine it word by word while speaking.
snyhlxde
11 days ago
You might not have an internal monologue. A lot of us don't, and the ones that do are equally shocked every time they find out. For what it's worth, I'm in the same boatācan form sentences, but why would I? It'd slow me down.
People who don't have inner monologues tend to assume that all that stuff is some form of analogy or metaphor. It's not. It's entirely literal.
Filligree
12 days ago
Do you mean in a real time conversation?
Because I definitely dont "have an internal monologue about what I'm going to say" in the 100ms between when someone asks a casual question and I respond to it.
oceanplexian
12 days ago
Yes, it is possible to maintain an internal monologue in real time conversation. That is one of the reasons why some people usually take longer than 100ms to respond.
int_19h
11 days ago
Are you sure. It might not be the whole sentence, but I would find it hard to believe that in practice the way you speak or write is like
hello <think> May <think> be <think> I'll <think> go <think> get <think> break <think> fast
throwawaymaths
11 days ago
They probably do not mean people form entire sentences before expressing them, I am not aware of anybody doing that. I assume it refers to people first coming up with a global outline of what they want to say before they start speaking.
DrSiemer
12 days ago
"Rem tene, verba sequentur" (you hold the matter, then words come) is largely "how it works".
You form logical ideas as you speak, as you speak your speech develops, so the translation is from ideas to sentences. It is not clear in which phase one would mentally form a complete sentence, nor why it should be relevant. You "see something [that makes sense]", then you describe it - iteratively.
mdp2021
12 days ago
You are probably pretty far from the LLM extreme, though, of thinking one token at a time.
causal
11 days ago
Probably.
giardini
12 days ago
> Surprisingly, we find such an objective is analogous to that of consistency models
This is why numerical methods should be part of the ML curriculum.
programjames
11 days ago
Can't wait to see something like this merged into ollama (I'm sure there would be plenty of people fine-tuning models for it).
rcarmo
12 days ago
Ollama doesn't have their own inference engine, they just wrap llama.cpp. But yes, it will be awesome when it's more generally available.
Me1000
12 days ago
The lab is tied to the vLLM project. I would say it might get picked up sooner by vLLM than other inference frameworks.
helloericsf
12 days ago
from CLLM authors:
Thank you guys for the great questions and insights! We have made a Twitter posts with some more details and we invite you to engage with us on Twitter as well.
https://twitter.com/haoailab/status/1788269848788869299
snyhlxde
11 days ago
11 days ago
Is this how Groq (https://groq.com/) is so fast, or are they doing something different?
paulclark
12 days ago
Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.
buildbot
12 days ago
I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:
https://proceedings.mlr.press/v48/diamos16.pdf
Or better yet - on Cerebras
Kudos to groq for writing that kernel
gdiamos
11 days ago
My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.
(Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)
wrsh07
11 days ago
According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".
throwawaymaths
11 days ago
They can quickly try with one of the open source models, then show a side by side demo
m3kw9
12 days ago
Could someone please explain the intuition around this technique in more lament terms?
ec109685
12 days ago
For all of these "how can we batch predicting the next n tokens?" the intuition is basically that it takes a buttload of math to predict some of the tokens, but that most tokens are actually easy to guess. For example, if I asked "What was that phone number from that 80's song?" as soon as a model generates 867- it shouldn't take that much math at all to finish predicting 5309.
TomatoCo
12 days ago
A bit more intuition on how training works: in natural language processing, some phrases/collocations, for example "remind ... of ...", "make a decision", "learn a skill" etc. are used together. We can ask LLMs to learn such collections & frequently appearing n-grams. After learning, the model can use parallel decoding to predict many tokens that are frequently appear together in one forward pass.
snyhlxde
11 days ago
12 days ago
"Try to fix all the words in a sentence at once. Keep iterating until you don't think it needs fixing."
programjames
11 days ago
Would something like this apply to MAMBA/JAMBA too?
fermuch
12 days ago
I think any next token predictor will benefit. Iiuc mamba is a next token predictor.
I just skimmed the gradient article, but if their only change is swapping out the transformer block for the mamba block, I don't think it's already using this optimization
wrsh07
11 days ago
[dead]
Linda231
12 days ago