Doughnut Reader

Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x

hao-ai-lab.github.io

zhisbug

102 comments

This mirrors what I experienced when I enrolled in "free drawing" (no teaching) classes:

While people considered me a good drawer since I was a child, I remember just repeating either similar detailed drawings I drew before, or otherwise just taking plenty of time to draw. I believe anyone with time and patience can make a nice drawing of a scene.

The "free drawing" class had no rules or lectures: you brought the materials you wanted to work with (some brought ink, others pencils, while I brought charcoal). The only thing determined was the timing between poses for the model: for each session the first few poses were very short (say a minute), and then the pose durations would progressively lengthen until say 5 minute poses. At all times you were free to tear your picture up and retry drawing the pose again.

My drawing skills improved considerably. The short "warmups" actually force you to get proportions and outlines correct on the first tries. Conventional wisdom says haste makes waste, but when learning or refining skills, it seems natural selection has hardcoded the sensation of haste as a stressor prompting attention and learning.

I am convinced I could have drawn similar quality drawings before enrolling in those classes, except they would have taken me easily 5 or 10 x as long to draw. Being forced not to beat around the bush and feeling the penalty of making a hasty mistake (further decreasing time left for the second try in the remaining time) does seem to work.

My only gripe is that the technique is termed "Consistency" whereas I would reserve such a term for an improvement in performance not inference speed, although I understand that they indicate "consistency with what would ultimately have been generated one token at a time". I would rather dub it "Proficiency LLM", where the same output is expected, only without the inhibition of stuttering to the same conclusion.

DoctorOetker

12 days ago

Hi we are CLLM authors and thanks for sharing your experience and insights! I can see this drawing skill refining process echoes with the training process in CLLM, the only thing is at this point stressor in CLLM training is not getting progressively demanding.

For example, while drawing, you can set very specific time limit on how long you are allowed to draw in each trial and make the time progressively shorter. In CLLM, maybe we can make this the learning process more and more difficult by mapping more and more distant states in Jacobi trajectory to its final state.

We are using the term "consistency" because we draw parallelism between consistency LLM and the consistency model in diffusion image generation where the training processes are analogous.

snyhlxde

11 days ago

Do you use same dataset to train / eval the model? Was the model used for example trained on GSM8K dataset for example?

boroboro4

11 days ago

Yes, we consider both domain-specific applications (spider for text2SQL, gsm8k for math, codesearchnet for python) as well as open-domain conversational applications (ShareGPT). We use test set from each application to evaluate CLLMs’ performance in our paper.

On the other hand, technically CLLM works on any kind of queries. But the speedup might vary. Feel free to try out our codebase for your use cases!

snyhlxde

11 days ago

Is it just me, or does this read like it was written by an LLM ... ?!

Quarrel

11 days ago

It's just much more formal than people generally speak on HN.

jasonjmcghee

11 days ago

lol I take that as a compliment. Good try but sadly no LLM in this writing :)

snyhlxde

11 days ago

I had an interesting experience in an Invertebrate Zoology lab class one summer.

We students were brought into a lab, given specimens to draw, and the only instructions we received were 'You have 30 minutes to draw this. Go.'

There was no "here's how to draw. here's what to do and not to do". It was just basically "We don't care about any insecurities you might have. We don't care if you think you can't draw. No excuses, just fucking draw it. Now."

Not only did we draw, but we (all of us) improved enormously over the course of the class as more animals were brought in and the exercise was repeated over and over and over again throughout the summer.

What it taught us is that everyone, and I mean everyone, can draw. Our collective attitude shifted from "don't know if this is even possible" to "of course we can do this. this is easy. routine. trivial."

Highly recommended approach.

It was the most freeing and amazing class I had in college.

aamargulies

11 days ago

That sounds like a pretty awesome experience. Thanks for sharing.

Version467

11 days ago

Systems generally become more efficient when under stress. They are also forced into local optima - everything has upsides and downsides.

manmal

12 days ago

Interestingly - this is the idea behind Nassim Taleb’s book “Antifragile” and the concept of “anti-fragility”.

In essence, it promotes dynamic/evolutionary/always learning behaviour than performing the same set of steps every time, and in the process, becoming stronger than before.

An example he shares is: how the breakdown of muscle tissue through exercise leads to more muscle development and an increase in strength. I guess it’s similar to LLM training using error/loss reducing functions (practice makes perfect) but dissimilar in the sense that training is a one—time action.

sheepscreek

11 days ago

> They are also forced into local optima

The good ol', "under pressure, you don't rise to the occasion, but sink to the level of your training"?

TeMPOraL

11 days ago

The authors mention that Jacobi decoding is equivalent to greedy autoregressive decoding, but in practice don't we often want the sampling temperature to be above zero to avoid repetitions and excessively generic responses?

I'm completely unfamiliar with this decoding strategy so maybe I'm just missing a simple way to account for that.

miven

12 days ago

Yes this is a great question! We are actively working on supporting other sampling strategies other than greedy sampling. In the context of CLLM training, instead of mapping to a static fixed point obtained from Jacobi decoding as the training ojbective, we term it dynamic fixed point. You can keep an eye on our github repo for new progress.

snyhlxde

11 days ago

Agreed. It's straightforward to check that a token was the argmax, but it seems difficult to check that a token appeared with the probability you wanted it to. You could still do the fine-tuning step I guess, where you train the trajectories to approach n-token completions with the statistics you want, but I can't see how you can replace the "check for a fixed point" step. Maybe "check the result was above this fixed threshold for likelihood".

matheist

12 days ago

I feel it's a pretty dangerous optimization before we REALLY understand what's going on inside of the LLM. e.g. guys believe in the geometric interpretation will have something to say, and it would probably hurt if you are using "filler" tokens.

Besides, the assumption (not a universal fact) that "forming complete sentences in mind before articulating word by word" seems overly simplifies activities happens in our mind: do we really have a complete planning before start talking/typing? as a Buddhist I lean towards it's an illusion. further more, what about simultaneous thoughts? are we linear thinker in the sentence level?

anyway, pretty neat math!

wangii

11 days ago

The optimization does not affect the result of LLM, it's guaranteed to produce equivalent results as decoding directly. Let's not treat that LLM as some magic that resembles our mind, it's just another program that produces sentences that happens to make sense.

renonce

11 days ago

> Let's not treat that LLM as some magic that resembles our mind,it's just another program that produces sentences that happens to make sense.

"That happen to make sense" is hiding a lot of magic. It would be statistically impossible to make as much sense as LLMs do in response to prompts if it did not actually make semantic distinctions. If it makes semantic distinctions, then it does resemble the human mind in at least one way.

naasking

11 days ago

According to the original Jacobi decoding paper, it's set in the machine translation tasks, with encoder + decoder, in which parallel algo applied only to the decoder part.

wangii

11 days ago

Lets not treat our mind as something magical. It's just another program that learned to speak by consuming lots of training input. The implementation might look slightly different from the outside, but from a mathematical perspective, artificial neural networks are proven to be at least as capable as the human mind.

sigmoid10

11 days ago

The best part is, your comment works both when sarcastic and completely serious.

baq

11 days ago

> artificial neural networks are proven to be at least as capable as the human mind

Do you have a source for this? I know we have models of neural networks designed to act like neurons, but those aren't what're being used.

ben-schaaf

10 days ago

See the universal approximation theorem for fully connected perceptrons.

sigmoid10

7 days ago

That's really nowhere near enough of a proof. You'd need to prove that a human brain is equivalent to a mathematical function, and that that function can be sufficiently approximated by a NN to be functionally identical.

Additionally UAT doesn't actually prove NNs can approximate any function. Non-continuous functions and infinitely large domains aren't covered.

ben-schaaf

4 days ago

Define ‘capable’ and most of the confusion and potential controversy goes away.

xpe

10 days ago

That assumption might be useful in this context, but I think it's pretty clearly not true. Ask anyone to tell you about a complex past event with a lot of parallel branches and you'll quickly see them add bits, pieces and tangents midsentence to cover the full range of events. I don't think I've seen the sentence granularity hypothesis in any serious scientific context before.

Etheryte

11 days ago

Can't speak for everyone but I definitely don't mentally form complete sentences before talking. Sometimes I grammatically talk myself into a corner in the middle of a sentence and need to use some awkward words/phrases to finish my thought, or simply pause and restart the phrase from the beginning.

hatthew

11 days ago

I feel surprisingly disconnected from my speaking self, acting as more of an observer, who is sometimes surprised at what I come up with. It just flows. I feel I have very little need for input.

But, I also feel fairly disconnected from my thinking self. I point my attention at something and solutions usually just pop out, maybe with some guidance/context forming required, in the form of internal dialog, which is usually of a rubber ducky style format [1], or mental testing of that mostly spontaneous solution.

I feel the "real" me is the one sensing/observing, which includes the observing of those spontaneous solutions, and what I say.

[1] Works with any problem space, not just coding "debugging": https://rubberduckdebugging.com/

nomel

10 days ago

are you practicing any meditation? it's regarded as "awaken" state in some practice! if you have any method, please share with me! thanks!

wangii

10 days ago

We don't appear to be forming words sequentially from underlying parts, even though in many languages they are broken down in smaller units that carry semantic meaning themselves. There doesn't seem to be any clear reason for this to break down suddenly at sentence level.

int_19h

11 days ago

What is the geometric interpretation?

causal

11 days ago

Wow, I'm mindblown this isn't getting more attention. This seems like a clear win for inference. Fine tuning cost for this is reasonable (around 0.01% of the original pre-training cost). And the performance wins seem fairly consistent.

alfalfasprout

12 days ago

Yes, seems like a huge important result for LLM performance.

I’m not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?

At least while also:

- Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.

- Improving not just query latency but also global throughput

- Not requiring more compute

- Having a relatively practical implementation and not adding big challenges and complexity

You could argue the insight is incremental, as it builds on what’s been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.

WhitneyLand

11 days ago

Similar or greater inference wins are achieved with speculative decoding which is already widely used, so while this is really interesting (and was tried before with less success AFAIK), it's not yet clear how impactful it would be.

lopuhin

12 days ago

I don’t see where similar wins have ever been achieved.

Speculative decoding can reduce latency, but at the cost of using a lot more compute. The amazing thing here is latency and global throughput improvements would be realized because of the increase in efficiency.

From what I understand speculative decoding can also come with more challenges insofar as trying to maintain overall output quality.

WhitneyLand

11 days ago

Thanks for interesting in our work! Yes we found training with consistency loss + AR loss on even a subset of a dataset results in significant speedup (0.01% pre-training cost). Training on more data permits even further speedup: the model is able to learn from more frequently-appearing collocations and phrases.

For more details, please check out our paper and you can also see speedup saturates as the size of training data grows.

snyhlxde

11 days ago

At first I thoght that this was another Medusa-like paper, simply using more unembed heads for guessing subsequent tokes, but damn, not at all. This is amazing. And it doesn't even use extra parameters, it's just an auxiliary training loss.

andy12_

12 days ago

The only similarity between Medusa and CLLM is both train and adapt LLMs for fast inference. But they use completely different training technique, decoding technique and as you pointed out CLLMs don't need extra parameters or configuring attention mask for tree-based verification.

snyhlxde

11 days ago

Interesting

I think soon we are going to realize that we don’t really need training the models

We just need good indexing and sampling

Essentially at some level any LLM is equivalent to a DB of the dataset, with a great NLP interface on top

Both are just different methods of navigating stored data

nico

12 days ago

LLMs can easily produce data not in training dataset.

LLMs do not navigate stored data. An LLM is not a DB of the training data.

tempusalaria

11 days ago

I've had the same thought as above but unfounded (just a feeling, pretty much) so I'm curious to learn more. Do you have any references I can check out that supports these claims?

carlthome

11 days ago

Come up with a novel puzzle that is guaranteed to not be in the training set, and ask GPT-4 to solve it.

int_19h

11 days ago

To control for that doesn't seems trivial.

carlthome

8 days ago

But indexing *is* training. It's just not using end-to-end gradient descent.

sdrg822

12 days ago

The models are multiple orders of magnitude smaller than the compressed versions of their training data, they can not be the equivalent of a DB of it.

PeterisP

11 days ago

The training data is ideo-semantically compressed? News to me... is it perhaps stored in kanji?

lainga

11 days ago

You might like, the Infinigram paper then. It was discussed recently:

https://news.ycombinator.com/item?id=40266791

nsagent

12 days ago

[flagged]

JoannaWongs

11 days ago

Anyone know somewhere someone dumb like me can "Ask an AI expert"?

I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?

I guess I want to learn this stuff and should maybe follow one of those "write an LLM in an hour" type videos on YouTube.

JKCalhoun

11 days ago

> how is it that an LLM when given the same prompt does not respond in the same deterministic way?

In software (not in the model) here's literally a random number generator that picks from a weighted set of "next-token" choices that the model spits out. The selection process can have a series of knobs to manipulate the responses. If you want it to be deterministic (if you have direct access to the software) you can tell it to set "top-k = 1" or "temperature = 0.0" (depending on your software) and it will be deterministic.

Usually the default settings are not for determinism, because for whatever reason the quality of the results tends to not be that good when you go fully d.

throwawaymaths

11 days ago

For that answer, you can refer to the 3blue1brown videos

The llm model outputs a vector of probabilities for tokens, and the llm user picks a token from the most likely list using a random number

8note

11 days ago

It's because an LLM is essentially a probability matrix. You type a prompt, then it calculates what's the probability of getting a next word and so on, eventually forming a sentence. The probability learned is based on the training data.

Because of the underlying probability model, it's not going to be 100% deterministic. Plus a model like ChatGPT purposefully have "temperature" parameter that will further add randomisation to the whole process.

My answer is based on this paper if you're interested to read more: The Matrix: A Bayesian learning model for LLMs, https://arxiv.org/abs/2402.03175

zipfcharge

11 days ago

Are there any ways to show the source of the information retrieved by the model? For instance, the LLM forms a sentence and it points to a stackoverflow answer with the same or similar content.

flopriore

11 days ago

As I understand it, pretty sure that is impossible. When it is input a single datum, sure, trivial. As soon as it is fed a second one though the weights are already a kind of blend of the two tokens (so to speak).

JKCalhoun

11 days ago

Its not impossible, but its definitely difficult. There is some overlap in the methods used to detect benchmark data contamination, though its not entirely the same thing. For the detection use case, you already know the text you're looking for and you are just trying to demonstrate that the model has "seen" the data in its training set. The challenge is proving that it is statistically improbable that the model could stochastically generate the same tokens without having seen them during training.

Some great research exists in this area [1] and I expect much of it may be repurposed for black box attribution in the future (in addition to all the work being done in the mechanistic interpretability field)

[1] https://arxiv.org/abs/2311.04850

spmurrayzzz

11 days ago

> I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?

You can control that in most systems with an inference-set parameter called "temperature". But setting the temperature as low as possible tends to lead to very low-quality answers - the system can't crawl out of some local optimum and ends up repeating itself over and over. Such answers may be "deterministic" but they're also not good.

zozbot234

11 days ago

I found this to be a good start that explains things fairly methodically, but without losing the high-level perspective.

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

int_19h

11 days ago

For this particular question, ask chatgpt how temperature affects llm softmax sampling.

For other things, study using Karpathy's videos.

rahimnathwani

11 days ago

> ... speculative decoding methods ... incurs extra memory cost during inference time.

Any detail on this? For speculative decoding you need a smaller model to generate "branches" which are fast but maybe inaccurate and verify these branches later with a larger model. However, only memory equivalent to a single token is needed for speculative decoding, and tokens in other branches are simply masked out during inference. With a context size of 1000 and ~30 branches for 5 tokens, the memory overhead would be 3% which is negligible. If your context size is much smaller compared to the number of branches - would someone who use a generative LLM with a context window of just 50 tokens care about generation speed?

Also, speculative decoding techniques are not restricted to greedy sampling - it's expected to behave exactly the same as the original model and sample with the expected probabilities. Most literature on speculative decoding already reports 2.6x-3.5x speedup. The blog post here reports 2.4x-3.4x generation speed - which isn't that much of an upgrade?

While I mentioned speculative decoding above and Medusa2 and Eagle seems to be the techniques that the author compares against, the core problem remains: whatever method you use to predict tokens ahead of time, there is a specific point where the previous tokens are absolutely needed before predicting the next token. It doesn't depend on what your model is or what your techniques are, it's just about what is mathematically achievable. How can you predict 5 tokens at once if the probability distribution of the 5th next token depends heavily on the previous 4 tokens? Speculative decoding, Jacobi decoding, multi-token parallel decoding, whatever.

If only greedy sampling is supported for this, then I wonder what are the advantages of this method, not to mention that other techniques already achieve the expected speedup. Comparing greedy sampling speedups to random sampling speedups is comparing apples to oranges, and I doubt if the speedup described by the method would remain after this method is adapted to random sampling (due to the core problem mentioned above).

renonce

11 days ago

`the previous tokens are absolutely needed before predicting the next token'

Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?

On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.

cxczz

11 days ago

Speculative decoding requires you to load the smaller model into memory and run inference on it.

Palmik

11 days ago

I think the smaller model is at least 20 times smaller. If you do speculative decoding on a 70B model an 1B model would be appropriate.

renonce

11 days ago

There's no free lunch™, so from what I can tell there's some pathway loss here. E.g. some Jacobi trajectories definitionally exclude higher temperature paths. Which might actually be a positive given data retrieval (but a negative if we want to maximize for creativity?).

dvt

12 days ago

There are better and worse algorithms. I'm not sure "there is no free lunch" always applies in a particularly meaningful way. Some things aren't on the pareto frontier.

wrsh07

11 days ago

Kinda like the aiff -> mp3 conversion process. A lot of data is lost, but we human can really tell the too much of a difference?

factormeta

11 days ago

There's no reason to think the current next token prediction models are optimal for predicting sentences (they aren't!)

> An algorithm may outperform another on a problem when neither is specialized to the problem

https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and_...

wrsh07

11 days ago

I would go even further and say there isn't any indication that we are even close to what is possible. My subjective feeling is that with the current rate of progress it is entirely possible that we will have GPT-4 level performance locally on smartphone hardware within 3-10 years (unless companies decide again that they don't want to give this kind of power away)

stkdump

11 days ago

Probably. Advancements in ML algorithms, like this one, have been outpacing advancements in hardware for awhile now, so both are converging on making ML faster and ubiquitous.

naasking

11 days ago

Interesting stuff. I guess the idea has occurred to many but was well written and presented.

toxik

12 days ago

Yep. My roommate and I were talking about this a year ago. You can also do something similar for LLM steering.

programjames

11 days ago

> Our research shows this process – mimicking human cognitive process of forming complete sentences in mind before articulating word by word

This is not how I work. Is there something wrong with me?

doctor_eval

12 days ago

Nor is it how I work, I think that's normal enough. I do have an idea of what I'm going to say before I say it, I think that's closer to what they meant. I think and speak in increments of ideas, not words.

jerbear4328

12 days ago

> I think and speak in increments of ideas

extremely common among (but not unique to) people with ASD, those "increments of ideas" are called "gestalts".

https://kidtherapy.org/helpful-articles/what-is-gestalt-lang...

paulmd

12 days ago

In some conversations, maybe it's easier to form complete sentences. In some others, the best we can do is: have a rough draft about what to say in mind and then refine it word by word while speaking.

snyhlxde

11 days ago

You might not have an internal monologue. A lot of us don't, and the ones that do are equally shocked every time they find out. For what it's worth, I'm in the same boat—can form sentences, but why would I? It'd slow me down.

People who don't have inner monologues tend to assume that all that stuff is some form of analogy or metaphor. It's not. It's entirely literal.

Filligree

12 days ago

Do you mean in a real time conversation?

Because I definitely dont "have an internal monologue about what I'm going to say" in the 100ms between when someone asks a casual question and I respond to it.

oceanplexian

12 days ago

Yes, it is possible to maintain an internal monologue in real time conversation. That is one of the reasons why some people usually take longer than 100ms to respond.

int_19h

11 days ago

Are you sure. It might not be the whole sentence, but I would find it hard to believe that in practice the way you speak or write is like

hello <think> May <think> be <think> I'll <think> go <think> get <think> break <think> fast

throwawaymaths

11 days ago

They probably do not mean people form entire sentences before expressing them, I am not aware of anybody doing that. I assume it refers to people first coming up with a global outline of what they want to say before they start speaking.

DrSiemer

12 days ago

"Rem tene, verba sequentur" (you hold the matter, then words come) is largely "how it works".

You form logical ideas as you speak, as you speak your speech develops, so the translation is from ideas to sentences. It is not clear in which phase one would mentally form a complete sentence, nor why it should be relevant. You "see something [that makes sense]", then you describe it - iteratively.

mdp2021

12 days ago

You are probably pretty far from the LLM extreme, though, of thinking one token at a time.

causal

11 days ago

Probably.

giardini

12 days ago

> Surprisingly, we find such an objective is analogous to that of consistency models

This is why numerical methods should be part of the ML curriculum.

programjames

11 days ago

Can't wait to see something like this merged into ollama (I'm sure there would be plenty of people fine-tuning models for it).

rcarmo

12 days ago

Ollama doesn't have their own inference engine, they just wrap llama.cpp. But yes, it will be awesome when it's more generally available.

Me1000

12 days ago

The lab is tied to the vLLM project. I would say it might get picked up sooner by vLLM than other inference frameworks.

helloericsf

12 days ago

from CLLM authors:

Thank you guys for the great questions and insights! We have made a Twitter posts with some more details and we invite you to engage with us on Twitter as well.

https://twitter.com/haoailab/status/1788269848788869299

snyhlxde

11 days ago

[deleted]

11 days ago

Is this how Groq (https://groq.com/) is so fast, or are they doing something different?

paulclark

12 days ago

Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.

buildbot

12 days ago

I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:

https://proceedings.mlr.press/v48/diamos16.pdf

Or better yet - on Cerebras

Kudos to groq for writing that kernel

gdiamos

11 days ago

My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.

(Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)

wrsh07

11 days ago

According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".

throwawaymaths

11 days ago

They can quickly try with one of the open source models, then show a side by side demo

m3kw9

12 days ago

Could someone please explain the intuition around this technique in more lament terms?

ec109685

12 days ago

For all of these "how can we batch predicting the next n tokens?" the intuition is basically that it takes a buttload of math to predict some of the tokens, but that most tokens are actually easy to guess. For example, if I asked "What was that phone number from that 80's song?" as soon as a model generates 867- it shouldn't take that much math at all to finish predicting 5309.

TomatoCo

12 days ago

A bit more intuition on how training works: in natural language processing, some phrases/collocations, for example "remind ... of ...", "make a decision", "learn a skill" etc. are used together. We can ask LLMs to learn such collections & frequently appearing n-grams. After learning, the model can use parallel decoding to predict many tokens that are frequently appear together in one forward pass.

snyhlxde

11 days ago

[deleted]

12 days ago

"Try to fix all the words in a sentence at once. Keep iterating until you don't think it needs fixing."

programjames

11 days ago

Would something like this apply to MAMBA/JAMBA too?

fermuch

12 days ago

I think any next token predictor will benefit. Iiuc mamba is a next token predictor.

I just skimmed the gradient article, but if their only change is swapping out the transformer block for the mamba block, I don't think it's already using this optimization

wrsh07

11 days ago

[dead]

Linda231

12 days ago