@jj4211

jj4211@lemmy.world · 36 minutes ago

Problem with the “benchmarks” is Goodhart’s Law: one a measure becomes a target, it ceases to be a good measurement.

The AI companies obsession with these tests cause them to maniacly train on them, making then better at those tests, but that doesn’t necessarily map to actual real world usefulness. Occasionally you’ll see a guy that interviews well, but it’s petty useless in general on the job. LLMs are basically those all the time, but at least useful because they are cheap and fast enough to be worth it for super easy bits.

jj4211@lemmy.world · 46 minutes ago

On the code competition, I think it can do like 2 or 3 lines in particular scenarios. You have to have an instinct for “are the next three lines so blatantly obvious it is actually worth reading the suggestion, or just ignore it because I know it’s going to screw up without even looking”.

Very very very rarely do I find prompt driven coding to be useful, like very boilerplate but also very tedious. Like “show user to specify these three parametets in this cli utility”, and poof, you got a reasonable argv handling pretty reliably.

Rule of thumb is if a viable answer could be expected during an interview by a random junior code applicant, it’s worth giving the llm a shot. If it’s something that a junior developer could get right after learning on the job a bit, then forget it, the LLM will be useless.

jj4211@lemmy.world · 14 hours ago

after 5 mins talking to an LLM about said field.

The insidious thing is that LLMs tend to be pretty good at 5-minute initial impressions. I’ve seen repeatedly people looking to eval LLM and they generally fall back to “ok, if this were a human, I’d ask a few job interview questions, well known enough so they have a shot at answering, but tricky enough to show they actually know the field”.

As an example, a colleague became a true believer after being directed by management to evaluate it. He decided to ask it “generate a utility to take in a series of numbers from a file and sort them and report the min, max, mean, median, mode, and standard deviation”. And it did so instantly, with “only one mistake”. Then he tried the exact same question later in the day and it happened not to make that mistake and he concluded that it must have ‘learned’ how to do it in the last couple of hours, of course that’s not how it works, there’s just a bit of probabilistic stuff and any perturbation of the prompt could produce unexpected variation, but he doesn’t know that…

Note that management frequently never makes it beyond tutorial/interview question fodder in terms of the technical aspect of their teams, and you get to see how they might tank their companies because the LLMs “interview well”.

jj4211@lemmy.world · 14 hours ago

The overall interface can, which leads to fun results.

Prompt for image generation then you have one model doing the text and a different model for image generation. The text pretends is generating an image but has no idea what that would be like and you can make the text and image interaction make no sense, or it will do it all on its own. Have it generate and image and then lie to it about the image it generated and watch it just completely show it has no idea what picture was ever shown, but all the while pretending it does without ever explaining that it’s actually delegating the image. It just lies and says “I” am correcting that for you. Basically talking like an executive at a company, which helps explain why so many executives are true believers.

A common thing is for the ensemble to recognize mathy stuff and feed it to a math engine, perhaps after LLM techniques to normalize the math.

jj4211@lemmy.world · 14 hours ago

And now Grok, though that didn’t even need Internet trolling, Nazi included in the box…

jj4211@lemmy.world · 14 hours ago

Not a single of the issues I brought up years ago was ever addressed except one.

That’s the thing about AI in general, it’s really hard to “fix” issues, you maybe can try to train it out and hope for the best, but then you might play whack a mole as the attempt to fine tune to fix one issue might make others crop up. So you pretty much have to decide which problems are the most tolerable and largely accept them. You can apply alternative techniques to maybe catch egregious issues with strategies like a non-AI technique being applied to help stuff the prompt and influence the model to go a certain general direction (if it’s LLM, other AI technologies don’t have this option, but they aren’t the ones getting crazy money right now anyway).

A traditional QA approach is frustratingly less applicable because you have to more often shrug and say “the attempt to fix it would be very expensive, not guaranteed to actually fix the precise issue, and risks creating even worse issues”.

jj4211@lemmy.world · 1 day ago

Stellantis doing Stellantis things…

It’s remarkable that anyone buys Dodge/Ram/Chrysler/Jeep given how crappy Stellantis has been.

jj4211@lemmy.world · 3 days ago

They joke, but… https://www.aa.com/i18n/travel-info/experience/landline.jsp

A bus that boards at a gate and drives to another airport and drops off at a gate…

There was a video of someone deliberately taking these lines and included someone a bit shocked that the ‘flight’ they booked was just a bus…

jj4211@lemmy.world · 4 days ago

I’ll agree that it’s here to stay, but not so sure it’s going to obviously improve. I have had access to various LLMs and while they are useful, they are very obviously limited and have kind of been at that level for a while now. Feel like they’ve largely gotten as “capable” as the strategy is going to get, and now the game is on to make some things friendlier for LLM consumption to get that capability more usefully available.

At least in the context of coding.

jj4211@lemmy.world · 6 days ago

They see Meta paying $200 mil or more to get a single employee and think about half that is a steal for a whole team of failed AI centric people

jj4211@lemmy.world · 7 days ago

You might as well talk to yourself than a chatbot, or read some stories.

Having an overly agreeable puree of language dispensed to you in place of actual conversation with you is neither healthy nor meaningfully engaging.

Conversation is valuable because it is an actual external perspective. LLM chatbots are designed as echo chambers. They have their uses but conversation for the sake of conversation is not one of them.

jj4211@lemmy.world · 7 days ago

Same thing happened with 5G, claiming that categorically new stuff would be possible with 5G that just couldn’t be done at all with LTE. IoT and VR were buzzwords thrown around as simply demanding 5G and utterly impossible without it.

Then 5G came and it was welcome, but much more mundane. IoT applications are generally so light that even today such devices only bother to ship with LTE hardware. VR didn’t catch on that hard, but to the extent it has, 5G doesn’t matter, no cellular modems and Internet speed is too slow to support anything directly even with 5G.

Same is happening in pretty much every technology with AI right now, claiming that AI absolutely requires whatever the hell it is they want to push. Trying to lean hard on AI FOMO to push their tech.

jj4211@lemmy.world · 8 days ago

The reason for volatility is that any such concept at scale is subject to just the messiest lump of evolving opinions on everything. It will deflate, inflate, deflate wildly because it’s utterly subject to the whims of the people without any mechanism to counter a lack of mass consensus on what ‘value’ is.

We started noticing as things scaled up, there needed to be some regulatory management to counter the whimsical populace. Hard to fight mass inflation or deflation when you can’t do anything to manage the “money supply” to offset panic.

jj4211@lemmy.world · edit-2 9 days ago

Sometimes devs are the most difficult users.

“Why is this not working the way it should? Ok, yes I did rewrite how the code manages save data in the filesystem, but that shouldn’t have any impact, I just thought it should make sure it only writes in 8k chunks because I read a comment somewhere that says it would increase ssd life by 3%, but I promise you it’s exactly equivalent to the original code and the problem must be elsewhere, not my patch. I patched dozens of other packages without issue with my 8k barrier strategy without any problems”

Devs come up with wild ideas, rewrite stuff, fail to mention it until you run into it, then explain why it doesn’t matter and stubbornly refuse to at least try without their weird change.

jj4211@lemmy.world · 10 days ago

Getting flashbacks to installing qmail back in the day…

I have a heard time imagining it to be worth it with other psx emulators readily available without weird hoops to go through.

jj4211@lemmy.world · 12 days ago

If you had, hypothetically, AR glasses that weighed 25 grams with a 12 hour battery runtime with transparent or equivalent real world visuals and perfectly opaque virtual content across the entire field of view, youd have even broader adoption than earbuds have today.

Being able to pull up your phone apps without holding your phone, the ability to have real world subtitles in any language. If they go the camera and reproduce route, they can have a nice solution to presbyopia (reading glasses suck to have to switch out).

Unfortunately current headsets weighs the same as twenty eyeglasses and has much improved, but still terrible passthrough, and wouldn’t last but a couple of hours even if you wanted to try. Bigscreen beyond gets down to 100 grams, but still looks weird and requires external battery and processor.

jj4211@lemmy.world · 12 days ago

Frankly, while the general depiction is realistic, the actual penis doesn’t look like any real penis, regardless of size. It shouldn’t fall in the scope of the law.

jj4211@lemmy.world · 13 days ago

It gave me flashbacks when the Replit guy complained that the LLM deleted his data despite being told in all caps not to multiple times.

People really really don’t understand how these things work…

jj4211@lemmy.world · 13 days ago

Well, not irrelevant. Lots of our world is trying to treat the LLM output as human-like output, so if human’s are going to treat LLM output the same way they treat human generated content, then we have to characterize, for the people, how their expectations are broken in that context.

So as weird as it may seem to treat a stastical content extrapolation engine in the context of social science, there’s a great deal of the reality and investment that wants to treat it as “person equivalent” output and so it must be studied in that context, if for no other reason to demonstrate to people that it should be considered “weird”.

jj4211@lemmy.world · 13 days ago

Interaction with the physical world isn’t really required for us to evaluate how they deal with ‘experiences’. They have in principle access to all sorts of interesting experiences in the online data. Some models have been enabled to fetch internet data and add them to the prompt to help synthesize an answer.

One key thing is they don’t bother until direction tells them. They don’t have any desire they just have “generate search query from prompt, execute search query and fetch results, consider the combination of the original prompt and the results to be the context for generating more content and return to user”.

LLM is not a scheme that credibly implies that more LLM == sapient existance. Such a concept may come, but it will be something different than LLM. LLM just looks crazily like dealing with people.