• 0 Posts
  • 94 Comments
Joined 2 years ago
cake
Cake day: June 16th, 2023

help-circle
  • Problem with the “benchmarks” is Goodhart’s Law: one a measure becomes a target, it ceases to be a good measurement.

    The AI companies obsession with these tests cause them to maniacly train on them, making then better at those tests, but that doesn’t necessarily map to actual real world usefulness. Occasionally you’ll see a guy that interviews well, but it’s petty useless in general on the job. LLMs are basically those all the time, but at least useful because they are cheap and fast enough to be worth it for super easy bits.


  • On the code competition, I think it can do like 2 or 3 lines in particular scenarios. You have to have an instinct for “are the next three lines so blatantly obvious it is actually worth reading the suggestion, or just ignore it because I know it’s going to screw up without even looking”.

    Very very very rarely do I find prompt driven coding to be useful, like very boilerplate but also very tedious. Like “show user to specify these three parametets in this cli utility”, and poof, you got a reasonable argv handling pretty reliably.

    Rule of thumb is if a viable answer could be expected during an interview by a random junior code applicant, it’s worth giving the llm a shot. If it’s something that a junior developer could get right after learning on the job a bit, then forget it, the LLM will be useless.


  • after 5 mins talking to an LLM about said field.

    The insidious thing is that LLMs tend to be pretty good at 5-minute initial impressions. I’ve seen repeatedly people looking to eval LLM and they generally fall back to “ok, if this were a human, I’d ask a few job interview questions, well known enough so they have a shot at answering, but tricky enough to show they actually know the field”.

    As an example, a colleague became a true believer after being directed by management to evaluate it. He decided to ask it “generate a utility to take in a series of numbers from a file and sort them and report the min, max, mean, median, mode, and standard deviation”. And it did so instantly, with “only one mistake”. Then he tried the exact same question later in the day and it happened not to make that mistake and he concluded that it must have ‘learned’ how to do it in the last couple of hours, of course that’s not how it works, there’s just a bit of probabilistic stuff and any perturbation of the prompt could produce unexpected variation, but he doesn’t know that…

    Note that management frequently never makes it beyond tutorial/interview question fodder in terms of the technical aspect of their teams, and you get to see how they might tank their companies because the LLMs “interview well”.


  • The overall interface can, which leads to fun results.

    Prompt for image generation then you have one model doing the text and a different model for image generation. The text pretends is generating an image but has no idea what that would be like and you can make the text and image interaction make no sense, or it will do it all on its own. Have it generate and image and then lie to it about the image it generated and watch it just completely show it has no idea what picture was ever shown, but all the while pretending it does without ever explaining that it’s actually delegating the image. It just lies and says “I” am correcting that for you. Basically talking like an executive at a company, which helps explain why so many executives are true believers.

    A common thing is for the ensemble to recognize mathy stuff and feed it to a math engine, perhaps after LLM techniques to normalize the math.



  • Not a single of the issues I brought up years ago was ever addressed except one.

    That’s the thing about AI in general, it’s really hard to “fix” issues, you maybe can try to train it out and hope for the best, but then you might play whack a mole as the attempt to fine tune to fix one issue might make others crop up. So you pretty much have to decide which problems are the most tolerable and largely accept them. You can apply alternative techniques to maybe catch egregious issues with strategies like a non-AI technique being applied to help stuff the prompt and influence the model to go a certain general direction (if it’s LLM, other AI technologies don’t have this option, but they aren’t the ones getting crazy money right now anyway).

    A traditional QA approach is frustratingly less applicable because you have to more often shrug and say “the attempt to fix it would be very expensive, not guaranteed to actually fix the precise issue, and risks creating even worse issues”.







  • Same thing happened with 5G, claiming that categorically new stuff would be possible with 5G that just couldn’t be done at all with LTE. IoT and VR were buzzwords thrown around as simply demanding 5G and utterly impossible without it.

    Then 5G came and it was welcome, but much more mundane. IoT applications are generally so light that even today such devices only bother to ship with LTE hardware. VR didn’t catch on that hard, but to the extent it has, 5G doesn’t matter, no cellular modems and Internet speed is too slow to support anything directly even with 5G.

    Same is happening in pretty much every technology with AI right now, claiming that AI absolutely requires whatever the hell it is they want to push. Trying to lean hard on AI FOMO to push their tech.


  • The reason for volatility is that any such concept at scale is subject to just the messiest lump of evolving opinions on everything. It will deflate, inflate, deflate wildly because it’s utterly subject to the whims of the people without any mechanism to counter a lack of mass consensus on what ‘value’ is.

    We started noticing as things scaled up, there needed to be some regulatory management to counter the whimsical populace. Hard to fight mass inflation or deflation when you can’t do anything to manage the “money supply” to offset panic.




  • If you had, hypothetically, AR glasses that weighed 25 grams with a 12 hour battery runtime with transparent or equivalent real world visuals and perfectly opaque virtual content across the entire field of view, youd have even broader adoption than earbuds have today.

    Being able to pull up your phone apps without holding your phone, the ability to have real world subtitles in any language. If they go the camera and reproduce route, they can have a nice solution to presbyopia (reading glasses suck to have to switch out).

    Unfortunately current headsets weighs the same as twenty eyeglasses and has much improved, but still terrible passthrough, and wouldn’t last but a couple of hours even if you wanted to try. Bigscreen beyond gets down to 100 grams, but still looks weird and requires external battery and processor.