I first started this hobby almost a year ago. Llama 3 8b had released a day or so prior. I had finally caught on and loaded up a llamafile on my old thinkpad.
It only ran at 0.7-1 t/s. But it ran. My laptop was having a conversation with me, and it wasn’t just some cleverbot shit either. I was hooked man! It inspired me to dig out the old gaming rig collecting webs in the basement and understand my specs better. Machine learning and neural networks are fascinating.
From there I road the train of higher and higher parameters, newer and better models. My poor old nvidia 1070 8gb has its limits though as do I.
I love mistral models. 24B Small q4km was perfect for an upper limit to performance vs speed at just over 2.7-3t/s. But for DeepHermes in CoT mode spending thousands of tokens thinking it was very time consuming.
Well, I neglected to try DeepHermes 8b based off my first model, llama 3. Until now. I can fit the highest q6 on my card completely. Ive never loaded a model fully on vram always partial offloading.
What a night and day difference it makes! Entire paragraphs in seconds instead of a sentence or two. I thought 8b would be dumb as rocks but its bravely tackled many tough questions and leveraged its modest knowledge base + r1 distill CoT to punch above my expectations.
Its absolutely incredible how far things have come in a year. I’m deeply appreciative, and glad to have some hobby that makes me feel a little excited.
Ive tried official Deepseek qwen 2.5 14b r1 distill and a few unofficial mistrals trained on R1 CoT. They are indeed pretty amazing and I found myself switching between a general purpose model and a thinking model regularly before this released.
DeepHermes is a thinking model family with R1 distill CoT that you can toggle between standard short output or spending a few thousand tokens thinking about a solution.
I found that pure thinking models are fantastic for asking certain kinds of problem solving questions, but awful at following system prompt changes for roleplay scenarios or adopting complex personality archetypes.
This let’s you have your cake and eat it too by letting CoT be optional while keeping regular system prompt capabilities.
The thousands of tokens spent thinking can get time consuming when you only getting 3t/s on the larger 24b models. So its important to choose between a direct answer or spend 5 minutes to let it really think. Its abilities are impressive even if it takes 300 seconds to fully think out a problem at 2.5t/s.
Thats why I am so happy the 8b model is pretty intelligent with CoT enabled so I can fit a thinking model entire in vram and its not dumb as rocks in knowledge base either. I’m getting 15-20t/s with 8b instead of 2.5-3t/s partially offloading a larger model. 6.4x speed inceease at the CoT is a huge W for my real life human time spent waiting for a complete output.