

2 x 2GB. Bargain, really.


2 x 2GB. Bargain, really.


Oh I am right there with you, beratna


“Why did you climb Mt Everest?”
“Because it was there” - George Mallory
But also
“Simplicity is the ultimate sophistication” - some dude named after a Ninja turtle
PS: my homelab - for the longest time - was a Raspberry Pi 4B, with a 2TB hard-drive attached. Jokes aside, I have all the love for minimalism and spite engineering. Rock on.


Of course. I only posted this for inspiration, because he walks it through step by step. As for crazy spec…well…you tell me
• 12U KWS Rack V2
• Lenovo ThinkCentre M720q Cluster (3x nodes running Proxmox)
• Lenovo ThinkCentre M920q running pfSense (router/firewall)
• Terramaster D5-310 HDD Enclosure (12TB + 18TB + NVMe SSDs)
• 10-Port 2.5G/10G Ethernet Switch
• Google Coral USB Accelerator (AI inference)
Probably only the 4th one down is the exxy one…and someone one should tell him the Coral USB accelerator is for Vision not inference (IIRC).


Don’t let the perfect be the enemy of the good. Also, I agree with phant. It’s punk as fuck.


deleted by creator


Ppft. Simples. We already solved this Down Under.
https://interestingengineering.com/science/biological-computer-with-human-neurons-play-doom
I have no mouth. Yet, I must scream.


World models aren’t just for robotics (though they definitely WILL be used for that). They’re for reasoning under uncertainty in domains where you can’t see the outcome in advance. Eg:
Medical diagnosis: you can’t physically “embody” whether a treatment will work. But a system that understands disease progression, drug interactions, and physiological constraints (not by pattern-matching text, but by learning causal structure) - well, that’s fundamentally different from an LLM hallucinating plausible-sounding symptoms.
Financial modeling, engineering simulations, climate prediction…all domains where the “embodied experience” is simulation, not physical interaction. You learn how the world actually works by understanding constraint and causality, not by predicting the next token in a Bloomberg article.
The point isn’t “robots will finally work.” The point is: understanding causality is cheaper in the long run and more reliable than memorizing correlations. Embodiment is just the training signal that forces you to learn causality instead of surface patterns.
My read is that LeCun’s betting that a system trained to predict abstract state transitions in any domain (be that medical, financial, physical) will generalize better / hallucinate less than one trained to predict text.
Whether that’s true? Fucked if I know - that’s why it’s (literally) the billion-dollar question. If he cracks it…it’s big.
But “it won’t cook dinner” misses the point (and besides which, it might actually cook dinner and change lightbulbs, so…)


Different approach, yeah. JEPA learns world models instead of predicting text. Whether that closes the gap with how biology actually works…that’s what he’s spending the billion to find out.


Yep. And per the article’s conclusion -
“…The question is whether being right about the problem is the same as being right about the solution.”


Tell it to Lecun. He won the Turing prize. I figure he knows what he’s doing. Let him cook I sez.
PS: I didn’t down vote you. It’s good to be skeptical.


As I mentioned elsewhere (below) I am currently conducting similar testing across 4 different 4B models (Qwen3-4B Hivemind, Qwen3-4B-2507-Instruct, Phi-4-mini, Granite-4-3B-micro), using both grounded and ungrounded conditions. Aiming for 10,000 runs, currently at 3,500.
Not to count chickens before they hatch - but at ctx 8192, hallucination flags in the grounded condition are trending toward near-zero across the models tested (so far). If that holds across the full campaign, useful to know. If it doesn’t hold, also useful to know.
I have an idea for how to make grounded state even more useful. Again, chickens not hatched blah blah. I’ll share what I find here if there’s interest. I’m intending to submit the whole shooting match for peer review (TMLR or JMLR) and put it on arXiv for others to poke at.
I realize this is peak “fine, I’ll do it myself” energy after getting sick of ChatGPT’s bullshit, but I got sick of ChatGPT’s bullshit and wanted to try something to ameliorate it.


I dunno. Some strange relic from the 1980s?
Kidding aside, it’s shocking how bad raw YT is. We watch it via Smartube (or PipePipe as needed). I can’t believe people watch “raw” youtube…it’s unwatchable.
If they ever quash SmartTube and PipePipe…well…I imagine Peertube, Nebula, Libby and Curiosity stream will suddenly become a great deal more popular.
I don’t think the powers that be fully grasp the (very delicate) knife edge they walk. They only stay in business so long as they aren’t annoying enough to be replaced. Actually, who am I kidding - they know that exactly and play the delicate “gently gently” boil frogs in a pot game like grandmasters.


No, it’s real ™. I’m running on a Quadro P1000 with 4GB vram (or a Tesla P4 with 8GB). My entire raison d’être is making potato tier computing a thing.
https://openwebui.com/posts/vodka_when_life_gives_you_a_potato_pc_squeeze_7194c33b
Like a certain famous space Lothario, I too do not believe in no win scenarios.


Well…no. But also yes :)
Mostly, what I’ve shown is if you hold a gun to its head (“argue from ONLY these facts or I shoot”) certain classes of LLMs (like the Qwen 3 series I tested; I’m going to try IBM’s Granite next) are actually pretty good at NOT hallucinating, so long as 1) you keep the context small (probably 16K or less? Someone please buy me a better pc) and 2) you have strict guard-rails. And - as a bonus - I think (no evidence; gut feel) it has to do with how well the model does on strict tool calling benchmarks. Further, I think abliteration makes that even better. Let me find out.
If any of that’s true (big IF), then we can reasonably quickly figure out (by proxy) which LLM’s are going to be less bullshitty when properly shackled, in every day use. For reference, Qwen 3 and IBM Granite (both of which have abliterated version IIRC - that is, safety refusals removed) are known to score highly on tool calling. 4 swallows don’t make spring but if someone with better gear wants to follow that path, then at least I can give some prelim data from the potato frontier.
I’ll keep squeezing the stone until blood pours out. Stubbornness opens a lot of doors. I refuse to be told this is an intractable problem; at least until I try to solve it myself.


Firstly, thanks for this paper. I read it this afternoon.
Secondly, well, shit. I’m beavering away at a paper in what little spare time I have, looking at hallucination suppression in local LLM. I’ve been testing both the abliterated and base version of Qwen3-4B 2507 instruct, as they represent an excellent edge device llm per all benchmarks (also, because I am a GPU peasant and only have 4GB vram). I’ve come at it from a different angle but in the testing I’ve done (3500 runs; plus another 210 runs on a separate clinical test battery), it seems that model family + ctx size dominate hallucination risk. Yes, a real “science discovers water makes things wet; news at 11” moment.
Eg: Qwen3-4B Hivemind ablation shows strong hallucination suppression (1.4% → 0.2% over 1000 runs) when context grounded. But it comes with a measured tradeoff: contradiction handling suffers under the constraints (detection metrics 2.00 → 0.00). When I ported the same routing policy to base Qwen 3-4B 2507 instruct, the gains flipped. No improvement, and format retries spiked to 24.9%. Still validating these numbers across conditions; still trying to figure out the why.
For context, I tested:
Reversal: Does the model change its mind when you flip the facts around? Or does it just stick with what it said the first time?
Theory of Mind (ToM): Can it keep straight who knows what? Like, “Alice doesn’t know this fact, but Bob does” - does it collapse those into one blended answer or keep them separate?
Evidence: Does it tag claims correctly (verified from the docs, supported by inference, just asserted)? And does it avoid upgrading vague stuff into false confidence?
Retraction: When you give it new information that invalidates an earlier answer, does it actually incorporate that or just keep repeating the old thing?
Contradiction: When sources disagree, does it notice? Can it pick which source to trust? And does it admit uncertainty instead of just picking one and running with it?
Negative Control: When there’s not enough information to answer, does it actually refuse instead of making shit up?
Using this as the source doc -
https://tinyurl.com/GuardianMuskArticle
FWIW, all the raw data, scores, and reports are here: https://codeberg.org/BobbyLLM/llama-conductor/src/branch/main/prepub
The Arxiv paper confirms what I’m seeing in the weeds: grounding and fabrication resistance are decoupled. You can be good at finding facts and still make shit up about facts that don’t exist. And Jesus, the gap between best and worst model at 32K is 70 percentage points? Temperature tuning? Maybe 2-3 pp gain. I know which lever I would be pulling (hint: pick a good LLM!),
For clinical deployment under human review (which is my interest), I can make the case that trading contradiction flexibility for refusal safety is ok - it assumes the human in the middle reads the output and catches the edge cases.
But if you’re expecting one policy to work across all models, automagically, you’re gonna have a bad time.
TL;DR: once you control for model family, I think context length is going to turn out the be the main degradation driver; my gut feeling based on the raw data here is that the useful window for local 4B is tighter ~16K. Above that hallucination starts to creep in, grounding or not. It would be neat if it was a simple 4x relationship (4b–>16K; 8b–>32K) but things tend not to work out that nicely IRL.
PS: I think (no evidence yet) that ablit and non abilt might need different grounding policies for different classes of questions. That’s interesting too - it might mean we can route between deterministic grounding and not differently, depending on ablation, to get the absolute best hallucination suppression. I need to think more on it.
PPS: I figured out what caused the 24.9% retry spike - my stupid fat fingers when coding. I amended the code and it’s now sitting at 0%. What’s more, early trends are showing 0.00% hallucinations across testing (I’m about 700 repeats in). I’m going to run a smaller re-test battery (1400 or so) across both Qwen3-4B 2507 models to achieve minimal statistical valid difference. If THAT holds, I will then test on Granite Micro 3B, Phi-4B-mini and Small-llm 3B tomorrow. I think that will give me approx 8000 data points.
If this shows what I hope it shows, then maybe, just maybe … no, let’s not jinx it. I’ll put the data out there and someone else can run confirmation.


Yep, same issue with Firestick here.


What I hear you saying is you have great taste in consoles, a kick ass TV and a free space heater.


It seems like a lot of the bigger names suck. I bought a Blauerpunkt and it is awful - not hackability wise but as a product. Probably for the same reasons as Nokia, Phillips, JVC etc are pale shadows of themselves (sold off / rebagged)
I have a Blauerpunkt, a TCL and a Samsung. Of the three, it’s the TCL that’s been the least locked down.
At this rate, I’m probably going to go for a short throw projector or just get an old school plasma if /when these go tits up.
stern nod
We just became blood brothers. R’amen.