Hacker News new | past | comments | ask | show | jobs | submit login
How to recognize AI snake oil [pdf] (princeton.edu)
907 points by longdefeat on Nov 19, 2019 | hide | past | favorite | 344 comments



I don't have time to read the entire paper but I would like to share an anecdote. I worked at a company with a well staffed/funded machine learning team. They were in charge of recommendation systems - think along the lines of youtube up next videos. My team wanted better recommendations (really, less editorial intensive) so the ML team spent weeks crafting 12 or more variants of their recommendation system for our content. We then ran that system for months in an A/B testing system that judged user behaviour by several KPI. The result was all variants performed equally well within statistically insignificant bounds. The best performing variant happened to be random.

Talking to other groups that had gone through the exact same process our results were pretty typical. These guys were all very intelligent and the code and systems they had implemented were pretty impressive. I'm guessing the system they built would have cost a few million dollars if built from scratch. We did use this "AI/ML" in our marketing so maybe it was payed for by increased sales through use of buzz words. But my experience was that in most limited use cases the technology was ineffective.


This reminds me of a job interview I was on. I was asked about how I would use AI/machine learning for their problem space. Since they seemed to be smart and level-headed, I answered honestly, "Pick something unimportant, use a machine learning algorithm just to get familiar with the tools, ignore the result unless it happens to work, then put machine learning in your marketing materials. But keep track of it, and if it is useful to you in 5 years, be sure to use it for real."

They said, "That's about what we concluded except that we didn't get around to actually doing that pilot project yet."

I got the job. :-)


Speaking of jobs and interviews. I am yet to find a job board which does not show JavaScript jobs when searching for Java jobs. Some of them claim to use AI. :)


My wife once watched me as I did a job search, carefully setting all the available parameters to match my requirements. She said "don't forget, engineering jobs are posted by admin staff", typed "software engineer location" into Google, and found way more than I ever did.


Let's not forget the job application systems that make you chronologically list every previous employer, their location, your title, and dates of employment, your education history including dates and degrees received, skillsets/technologies you have experience with, etc. All information that is on your CV and/or LinkedIn profile and they want you to manually re-type it into their late 1990's era job application system rather than using some basic NLP to extract it. Personally, the moment I start applying and the system asks me for more than my email/mobile and a CV upload, I bail on the application.


I guess that is intentional to throttel down on people applying. It works for me: Except, when I desperately need a job, I don't apply. I primarily only know "this" style of applying - It's all big corp companies, though, I must admit.


In fairness I have this issue with a number of human intelligence based (allegedly) recruiters.


Of course, that's where machines gets it from as well. Your training set is usually which candidates were shortlisted by recruiters. At least that is what I have seen people do who are built one of these AI based recruitment systems. When I asked them are they not worried about such bad quality data, their answer is - humans know the best. Ah well, then why are you building this again?


The exact same applies to blockchains.

Blockchain is mostly a marketing tool, not something you would want to use in production for anything.


I think generally that is how many products and features work.


That sounds odd, like they don't really need machine learning (unless it is to snare investors?).


Snaring customers too. I swear, people are obsessed with "machine learning" even when the domain really isn't suited for it.


I sometimes wonder if management and engineers don’t have more in common than acknowledged. Publications such as Harvard Business Review have huge coverage of things like managing AI, and being able to say you managed an “AI project” might mean something.


I wonder if people are becoming more savvy and anti-ML after all the issues people have with Facebook and Google collecting a lot more data than people are comfortable with.


I suspect what people dislike is big data. Anecdotally, I'd love to hear about a success where some lone genius codes a better tool using publicly available datasets. I don't like the idea of BigCorp using their larger servers to dominate the space.


Since we are sharing anecdotes, I can report it's been 20 years of buying stuff on the internet and the combined billions of ad tracking research dollars spent by Amazon and Google have not yet come up with a better algorithm than to bombard me with ads for the exact same thing I just bought.


I just spent 15m on Amazon trying to prod the recommendation algorithm into finding something I actually wanted to buy so I could get above the "free delivery threshold".

Think about that. I wanted to spend money. I wasn't too fussy what it was. Amazon has a decade of my purchasing and browsing history.

And they still failed.


Amazon infuriates me. I regularly buy gadget-y bits like electronic components and peripherals. Probably at least once a month. Never see adverts for similar.

That ONE TIME I buy a unicorn dress for my 2 year old daughter? That's a lifetime of unicorn related merchandise adverts and recommendations for you!


Actually that's really interesting because it exposes a bias in their recommendation system. They must be heavily biased towards things with mass appeal instead of specifically targeting user preferences, which is funny because it goes against the grain of the whole "targeted advertising" promise of ML. You'd think if anyone could get that right, it would be Amazon, yet..


It's not impossible for that to be a clever decision by Amazon (although I'm not saying it's likely, I have no idea about the numbers).

The ultimate goal of the advertising is return on investment, not making you feel interested in the adverts. If, to exaggerate the possibility, 100% of "people who look at tech" are 0% influenced by adverts, but 10% of "people who bought a unicorn thing" will go on to buy another if they're constantly reminded that whoever they bought it for likes unicorns, all of a sudden it would make sense despite being counterintuitive to viewers.

A more commonly discussed example of a similar thing is that it's easy to think "I just bought a (dishwasher, keyboard, etc), I obviously already have one so why am I seeing adverts for them?" Sure, it might be that the company responsible has an incomplete profile and doesn't know you bought one already. But it's also possible that the % of people who just bought the item and then decide they don't like it, return it and buy a different type is high enough to be worth advertising to them.


This is basically what comes from a mindset increasingly common among ML practitioners of abdicating thinking and assuming "the machine will find what features are important". They throw a junkyard full of features at the algorithm (or even worse automated feature generation). These days at least once a fortnight I get the opportunity to show folks how if only they thought for 10-15 mins or simply charted their data in a few different cuts before modeling, how much better they could have done :-)


I suspect Amazon has learned that some feature labels are easy to recognize and correlate (dress color, size, style, etc) and others are hard and lead to useless results (an electronic device's: computer, format, port type, protocol, etc).

So they gave up trying to match CPU with GPU and went back to connecting beer to diapers.


Ha. Last time I wanted to buy something on Amazon, their search page kept freezing every time I load it: 100% CPU load. Because I really wanted to buy that thing, I spent 30 mins debugging their silly scripts and found one for-loop that tries to find a non-existent element. Unfortunately, I couldn't figure out a way to enable my fix in a minified script, as reloading the page kept loading the original script.


Put a breakpoint on the offending line, manually add the element when it breaks, hit resume and hope for the best?


Whit uMatrix, you could have created a rule to block that script.


Turn off Javascript - Amazon still works OK.


Amazon still suffers from those "This guy really loves washing machines!" type recommendations


I hope 'm' here is minutes, not millions )


don't gift card count anymore?


Right? Recommenders are almost counter productive for me in most cases. I want a recommender to remain broad, not give me increasingly niche recommendations a la youtube. About the only halfway decent recommender system I've interacted with is the various forms of curated lists on Spotify. They seem to actually take a decent stab at it with related but sufficiently different and interesting content.

For all Facebook knows about me they've always been exceptionally bad at advertising to me, which is remarkable considering what they've got. Google is only very marginally better. Actually, now that I think about it, Amazon's 'customer's also bought' is also pretty bad at the recommendation itself since it not uncommonly recommends incompatible things! ...but it does often succeed at getting me to think more about what else I might need and sometimes leads me to buying other things. At least it's not always recommending the same thing, but rather related things, which is probably a much better way to advertise.


What I've always assumed, but now this thread has me doubting myself, but what I've assumed is that these systems, even though they appear to suck at specifically targeting me, must somehow be pretty good on the average, still netting big profits overall even though they don't seem to live up to the promise of getting me to buy stuff. But if everyone has this impression, maybe it doesn't work? I mean, I assume companies like Amazon and Facebook would be pretty good at least at optimizing the total rewards from using recommendation systems, but maybe you can't tell from your own anecdotal use. I have no idea. I'd love to see an analysis that includes aggregate numbers.


I think you've got the nail on the head. It's sort of the tragedy of the commons, which in terms of recommenders is a really easy thing to accidentally optimize for. For example, pop music is popular. In the early days, recommenders would just recommend pop music because on average that was a decent recommendation. We've come a long ways since then. Well, everyone but Facebook has.


> For all Facebook knows about me they've always been exceptionally bad at advertising to me

Because it's not Facebook really, it's the advertisers who choose targeting criteria. You as an advertiser have a myriad of options. For example if you've built a competitor to X, you can target users who've visited X recently, aged N-M, residing in countries A, B and C, and so on. There are options with broader interests too. Poorly targeted ad means poorly selected criteria by the advertiser (or sometimes just advertiser experimenting) and consequently money wasted. Facebook doesn't care though.

Then there is retargeting/remarketing (target bounced traffic) already mentioned here is probably the stupidest looking invention that actually works.


But they are also really really bad at curating content I like to see. Like even worse than the advertising actually. My feed is just garbage and has been ever since they switched over from being chronological. But I sometimes get on a wild hair about some particular person, and I'll look at their page directly and see actually interesting content on their page that was never shown to me. Facebook's only real guess is then to just say: oh! You must like this person, let me show them, all the time.". The reality is, I'm interested in stuff like when a biased person I don't like shares a more neutral or inclusive opinion, or someone does something interesting. Facebook is just unuseably bad at selecting for that. Their algorithm just pushes really shallow crap and buries anything with substance or depth. I keep it around just to stay in touch with hard to contact people.


Likewise Pandora does a decent job of picking songs for me based on what I previously liked. It's not perfect but far better than random.


it's easier to recomment things you probably like than things they want you to like/may like


Hey, personal experience and to be fair for them, Google does occasionally give me ads about a Haas CNC machine. I really want one. But I don't have the disposable $100k for one and I don't have the space... nor do I have 3 phase power. But I do want one and haven't bought one. So, good on them, right?


This can happen when a retargeting campaign doesn't have a 'burn pixel' or conversion event trigger. It's a common oversight, which can cause a re-targeting program to kick-off unnecessarily (or cause ads that have followed you around to become obvious)


You'd think with all the "AI" out there, they could match a sales DB entry with the CRM DB entry, but, in fact, they basically can't.


The AI innovation must be that they can figure out which marketers are likely to forget to have a burn pixel because those marketers drive more revenue.


And yet I haven't seen one ad that has a 'not interested' feature similar to YouTube.

I mean, if I could stop seeing washing machines or whatever, I'd probably click it.


do you clear your cookies and html5 storage? That should wipe any personalization that's happening, but the ads will become very generic then.

You can also block ads (Ublock/Umatrix)


Oh yeah I can do that, I'm just wondering why they're not too interested in direct user feedback from my end. Surely it would be useful to adjust their algos.


That's because:

* They don't have suppression set up

* They're using a conversion tracking platform that is slow

* They're testing the returns conversion hypothesis: you have expressed concrete intent, you have bought the product. If it has 5% return rate, you probably still want it, and there's a 5% chance they need to be in the mix.


Not to mention these systems are also completely useless at recognizing one product someone bought already includes the product they're recommending. When you buy Dark Souls 2 with all DLCs you can be sure Steam will suggest Dark Souls 2 without any DLCs to you for at least a full year.


Sometimes it also recommends other brands of the thing I just bought. Like a cell phone: I didn't buy Samsung, but now I see their ads. I guess that is a very very small improvement- Maybe I'm the type of consumer who gets a new phone every week?


No, but you might be unhappy with your phone, return it and buy a new one.


You are luckier than me: I've spent 10 years bombarded with ads for things I never bought, never wanted to buy and that are directly insulting at this point. (Buying something would maybe give me a two day break though.)


I have been looking for a shelf that’s as close to 50” wide and 10” deep as possible. No search on any site that allows you to search for such a thing as far as I can tell.


You can do this at wayfair.com. It's kind of hidden though, when you get to the shelves, click "Sort & Filter" and then scroll down until you see the dimension sliders.


I think amazon works pretty well.

They show me similar things to the thing I just put in my cart. Sometimes better choices that I made.

They also show related things that other people have bought, that many times I end up purchasing.


Dirty secret: "The customers who bought also bought" algorithm doesn't require ML/AI.

You can accomplish that with relational algebra on a precomputed data warehouse job and only for products with strong correlation. The intelligence of the customers is enough agency to instil a semblance of intelligence in the data.


Yes, this is the straightforward ‘collaborative filtering’ algorithm. I suppose the line between ‘algorithm’ and AI/ML is not well defined though. At what point does a technique become ‘AI’? I don’t know a good answer.


As an utterly cynical layperson, algorithm means directly querying data. AI means feeding systems with training data and sprinkling them with magic obfuscation dust.


I wonder if the dirty secret for lots of stuff is: "They aren't using AI anywhere"


Well, for problems that are simply looking for a ranked relationship if you have human input it could be used to train a ML that attempts to look for similar correlations...or you could just use your human mechanical turks that are already informing you. ML problems that are good are the ones not trying to approximate reality like CWBAB.

I'm doubtful we'll see an ai that makes a serious jump without directly interacting with the world we live in. Under that measure cars might be the closest since you learn to interact with the bounds of where it can and can't go is similar to a toddler learning to crawl.


The marketing budgets are there to be spent :)


Yes - just in case you want two or stopped the purchase along the way.


I have had a similar experience, but I really do think it points more to the team than it does to the efficacy of machine learning. I was on a team of extremely intelligent people, but they were very academic, with minimal practical coding skills, refusals to use version control etc. By academic, I mean graduates from the number one university in their respective fields, top publications, etc. They produced very little actual value. Great theoretical ideas, in depth understanding of different methods of optimization, etc.

The team I was on before that one was a bunch of scrappy engineers from Poland, India and the USA with no graduate degrees, but 20 years coding and distributed systems experience each. The difference in problem solving ability, the speed at which they moved, broke down problems, tried out different methods, was staggering.

I think ML is suffering from a prestige problem, and many companies are suffering for it. The wrong people are being hired and promoted, with business leaders calling the shots on who runs machine learning projects without fully understanding who can actually deliver.


The San Jose Mercury News had a weather-forecasting contest. It was won one year by a 12-year-old, who's algorithm was "The weather tomorrow will be the same as the weather today". A kind of AI I guess.


Brilliant. I think YouTube has arrived at the same algorithm - it picks the videos I watched yesterday to recommend today.


Well to be fair that's how all employers also hire.

If you did a good job at the last company you'll probably do a good job here.

If you did a good job yesterday, you'll probably do a good job today.

For the most part they are usually correct.


I hope you are joking since the industry collectively knows how much employers/interviewers value algorithms-based coding interview, which doesn't correlate strongly with performance. Even if you are talking about senior positions where they don't matter, then you should know that people hire someone they know+like who did decently well, rather than the truly best on the market.


That's SF/Big Tech. The rest of the world basically works like the stated algorithm.


The coding interview comes usually after they vet your resume, any profiles, get you talk to an HR, and potentially check your references.

Even the coding interviews are just a signal against overall performance given a short amount of time. It's only a sample of data, but if you did your interview right you should be able to protect some against bad people getting very lucky. Just like driving a car; bad drivers tend to stay pretty bad, and good drivers tend to stay safe. Even though there's a lot of ways to define what is a good driver, there's clearer ways to define what is a bad driver, and if someone was a bad driver yesterday they are still probably a bad driver.


well, that's the thing though - coding interview is pretty much a yes/no thing. What rank you get is typically based on "what rank do you have now?" and "how much do we respect your current employer?"


I remember reading Steve Jobs used a "different" technique to figure out if someone was good.

He would go around to people and say "I heard Joe sucks". If the people strongly defended Joe, he was probably pretty good. If nobody stuck up for him, Joe might indeed suck.


Probably everyone would be silent as well if someone said "Steve Jobs sucks". This anecdote is meaningless TBH.


Completely ignoring the depth of your subscriptions as well. Amazon music was much better for me. Even Google music just repeats.


Seriously. Why would I want to watch a video that I've already watched (unless it's music maybe)?


They have a lot of videos. You're statistically unlikely to care about any randomly selected one. By watching a video, you establish that it's interesting enough for you to watch. It's much more likely that you'll want to rewatch it than that you'll want to view a random video.

I'm only half-joking here. To me, YT algorithm seems to be a mix of "show random videos you've already watched" + "show random videos from channels and users you watched" + "show the most popular videos in last few hours/days/weeks". It's pretty much worthless, but what are we expecting? Like all things ad economy, the primary metric here isn't whether you like the recommended content, or whether that content challenges you to grow - it's maximizing the amount of videos you watch, because videos are the vehicles for delivering ads to you.


I prefer a user driven random walk. Like a multi-arm bandit over a hierarchical graph instead of a stuck-in-local-minimun + noise recommender. But no one does it.

Years ago, there is an app stumble upon. I always find it engaging. Hard to get bored.


Im guessing things like 5 year olds watching the same video 300 times randomly on their parent's account has irreversibly tainted the Googletron into thinking people like to watch the same video over and over again.

Hundreds of times I have told youtube im not interested in a recommended video that i have already watched, seems to be completely ignored.


Why wouldn't you? Do you never re-watch a film or re-read a book or re-order the same food at a restaurant, or re-drive the same route to work?

I often watch the same video that I've already watched, many times music, many times comedy, many times something I want to link to another person but end up watching some/all again as I find it, often if I remember it being interesting (e.g. a VSauce video or a Dan Gilbert TED talk), sometimes if it was a guide or howto that I want to follow - e.g. a cooking instruction.


Music is probably the main driver, but I've definitely clicked some recommended videos from a channel I'm subscribed to with infrequent long uploads.


I have a friend who lived in San Jose who just wrote the forecast on his whiteboard and left it there, because it never changed. It was funny, because he came from Minnesota where the weather is never the same two days in a row.


Ha! I came here to say something similar. Here in San Jose, the weather tomorrow will be the same as today, most of the time. My joke when I lived in Minnesota was: "If you don't like the weather, just wait 10 minutes. It will be different."

Gotta say though, I don't miss my slow blower even just a wee little bit.


To be fair San Jose weather is heavily biased towards sameness.


Yes, but why did all the 'smart' people using models and training networks get a worse result? A condemnation of modelling, to be sure.


Using a markov chain with your stochastic matrix set to I...


Using “a I”


AR(1) models are commonly employed to model time series, and the "same-as-yesterday" model is the case where the AR1 coefficient equals 1. There is is some mean reversion in seasonally-adjusted temperature, so an AR1 coefficient less than 1 should work better.


Just showing a customer's previously viewed items accounted for 85% of the "lift" credited to the recommendations team I worked on.


What about Benjamin Franklin's moving weather?


>>The best performing variant happened to be random.

Some years ago I heard an anecdote from a developer who had worked on a video game about American football. The gist of it was that they had tested various sophisticated systems for an AI opponent to choose a possible offensive/defensive play, but the one that the players often considered the most "intelligent" was the one that simply made random decisions.

In certain domains, I think, it's quite difficult to beat the perceived performance of an AI system that merely makes committed random decisions (i.e. carried out over time) within a set of reasonable choices. If we don’t understand what an agent is doing, we often assume that there is some devious and subtle purpose behind its actions.


It's quite common in games for AI to pick a random decision. Simply put, a good AI is a character/NPC that appears to have a mind of its own, and its own life. Nothing beats random at explaining someone's behaviour based on a personal history you don't know.

If AI responded/acted based on a predefined set of patterns that could be recognizable, the player would automatically feel it (pattern matching) and makes the NPC far less interesting.


Right-- but often an NPC is merely scenery or traffic, in the sense that their behavior does not compete with your interests. What I found interesting about the football example is that the random strategy of the NPC-oach both suggested a deeper intelligence AND proved to be an apparently effective opponent.

Beyond reinforcing our tendency to project, as you say, a personal history on random behavior, it also highlights what a few other people have commented: that in many non-cooperative situations a committed random strategy is extremely effective, and perhaps more effective than a biased, seemingly "rational" strategy. (For another example, I believe Henrich's "The Secret Of Our Success" discusses the possible adaptive benefits of divination as a generator for random strategies among early societies.)


It's how in Rock Paper Scissors you can probably do better against a smart opponent by playing randomly than by trying to trick them. At least they can't get in your head, because there's nothing there. You won't do better than chance, but at least you can't do much worse.


So, use a random strategy and measure the entropy of your opponent's behavior. And modify accordingly, as needed.

A lot of the best ml right now is effectively about making better conditional probability distributions. You always get random output, but skewed according to the circumstances, and sharp according to confidence in the result.


I'm not sure what the term for it is, but humans have an uncanny ability to ascribe meaning to pure randomness. I'm not surprised a random AI can appear smart.


When your audio player picks these random tracks https://i.imgur.com/QRoGQRy.png (example from earlier in the day) you start believing in a Higher Power..


Unless you consider that randomness, whatever source, is some form of intelligence. (I am only half-joking here).


Humans are great at finding "patterns" in random noise.


We are very good at finding patterns where there aren't any but it's important to remember that random is actually the best answer a lot of the times. Maybe it's as simple as random plays being the hardest to predict and that's the best until you can get an AI trained on the meta of how to plan for what the current player is thinking you'll do.


Sometimes being unpredictable makes for a good strategy


It is well known that tit-for-tat is the best strategy for iterated prisoners' dilemma. See https://en.wikipedia.org/wiki/Tit_for_tat


It may sound similar but offensive vs. defensive behavior in (video) game strategy is a much different concept than cooperation in game theory.


I had a similar experience at one point. A team put energy into building a recommendation system, and were able to demonstrate that the "Recommended for you" content performed better than all other content editorial. After getting challenged a bit, though, turns anything performs better when put under the header "Recommended for you."


Well, that’s a good lesson to learn, just put everything under “recommended for you” header! :)


You aren’t talking about dcom are you?


> put energy put energy

And they say English doesn't have reduplication!


I worked at a larger services marketplace, helping data scientists get their models into production as A/B experiments. We had an interesting and related challenge in our search ranking algorithms: we wanted to rank order results by the predicted lifetime value of establishing a relationship between searcher and each potential service provider. In our case, a 1% increase in LTV from one of these experiments would be...big. Really big.

Improving performance of these ranking models was notoriously difficult. 50% of the experiments we'd run would show no statistically significant change, or would even decrease performance. Another 40% or so would improve one funnel KPI, but decrease another, leading to no net improvement in $$. Only 10% or so of experiments would actually show a marginal improvement to cohort LTV.

I'm not sure how much of this is actually "there's very little marginal value to be gained here" versus lack of rigor and a cohesive approach to modeling. The data scientists were very good at what they do, but ownership of models frequently changed hands, and documentation and reporting about what experiments had previously been tried was almost non-existent.

All that to say, productizing ML/AI is very time- and resource-intensive, and it's not always clear why something did/didn't work. It also requires a lot of supporting infrastructure and a data platform that most startups would balk at the cost of.


If you have historical data to validate against, you can set a leader board on models run against older data, and always leave part of the data out and unavailable for test

https://gluebenchmark.com/leaderboard/

This encourages a simple first version and incremental complexity, rather than starting very complex 6 months in, and never having an easy baseline to compare to. A simple baseline can spawn off several creative methods of improvement to research.

The other case is that the models should be run against simple cases that are easy to understand and easy to confirm. This way there's always a human QA component available to make sure results are sensible.


IMH (and biased) O, a lot of great coders are implimentors, or let's say applied computers scientists.

That is great for building incredible open source software and a lot of other things that I would not be able to do given a 1000 years. However (again IMHBO) a specific ML and any other specific application of stastistics or mathematics becomes really tricky once your use case is explicitly defined.

You then need intimate and deep knowledge of the tools that you are using (e.g.: Should I even use NN? Should I even use genetic algorithms? Should I even use x?) but ML for most people is shorthand for NN and its variants or maybe shorthand for something else specific rather than in principle.

A well aimed shot at PCA [1] can often solve your problem. Or at least, tell you what the problem looks like. This is just an example, but IMHBO people waste their time learning ML and not learning mathematics and statistics.

IMHBO I still think that self-driving cars can be solved by defining a list of 1000 or so rules, by hand, by humans, and by consensus. The computer vision part is the ML part.

[1] https://en.wikipedia.org/wiki/Principal_component_analysis


I could not agree more about self driving cars. They will disrupt and cause us to actually look at our terrible transportation infastructure, not learn to survive in it.


I wonder if this could be a case of mismatch between what the recommendations system was designed to do and what the business actually needed it to do. Your team evaluated the models based on live KPIs in an A/B testing environment, but did the recommendations team develop the system specifically with those KPIs in mind? Did they ever have access to adequate information to truly solve the problem your team needed solved? And was the same result observed for other uses of their recommendation systems?


> did the recommendations team develop the system specifically with those KPIs in mind?

Yes they did - in fact they had input on defining them and helped in tracking them.

> Did they ever have access to adequate information to truly solve the problem your team needed solved?

They believed so. Their team was also responsible for our company data warehousing so they knew even better than me what data was available. Basically any piece of data that could be available they had access to.

> And was the same result observed for other uses of their recommendation systems?

I did not have first-hand access to the results of their use in other recommendation contexts. As I mentioned in my original post I only had second-hand accounts from other teams that went the same route. They reported similar results to me.


Some ideas seem to attract smart people like moths to a flame.

It seems like everyone who joins my company to shake things up follows the same path of wanting personalized content to acquire new customers.

But in reality we just don't have enough data points on people before they become customers to segment people that way. Even if we could, being able to accurately

Every time I see people go through the motions of attempting to implement this until they eventually give up.

This idea looks like an obvious win and big companies have done them before with success, but is extremely hard to impossible to pull off for our small company.


That's surprising to hear. Comparing model performance to a randomized baseline model is a "must-have" on my team before we feel comfortable presenting to management.


An old team I advised for a while also compared model performance to a randomised baseline model.

What they didn't seem to get however was that a randomised baseline model would beat a randomised baseline model on a naive comparison 50% of the time, so their understanding of randomness/statistical significance/performance metrics was way off. So while they believed they were also testing their models before presenting to management, none of them were implementing their comparison/measurements properly, and huge parts of their work were just p- hacking and pulling random high performing results out of the tails of the many models they built and compared.

So while it's good your team makes comparison to baselines (it's alarming how many don't even do that), my experience also suggests a huge number who think they're comparing to reasonable baselines and using metrics to measure their performance aren't actually doing so properly.


I am confused, if a new model beats randomly selected randomised model 100% of time for each experiment why does it matter if randomised model beats other randomised models? Are they only comparing against the subset of worst randomised models?


I think he's saying something like the following:

1/ the team implemeted a naive baseline

2/ they implemeted a more sophisticated model that depended on some parameter p

3/ for 100 different values of p, they examined its performance, and picked the model with the best performance

Now they're not quite subject to the multiple comparisons problem there, since the models with different values of p aren't independent from one another. But they're not not suffering from it either. It mostly depends on the model. But it's a very easy mistake to make. I'd say many many academic papers make the same mistake.


Short answer: if you do it right, it doesn't matter.

Long answer: I have saying in statistics: "nature abhors two numbers: 0 and 100". In the real world, there is no 100%, you have a number of models and a (finite) number of trials/comparisons to whatever metric and then you have to then make a decision.

My point was that their "non randomised" models may in fact have the equivalent performance of a random model, and that if this was in fact the case, you would expect them to beat a randomised comparison roughly half the time. If you have repeated trials of multiple models, the odds of one consistently beating others (even if it's properties were essentially equivalent to a random model) in a small finite number of trials is much higher than most people realise. Essentially, they're flipping a large number of coins to determine their performance, and choosing the coins that consistently come up heads.

Another observation I'd make is that in the real world, random or averages are almost the most facetious thing to be comparing performance against. We aren't generally in a state ignorance or randomness, but you see this kind of metric all the time, even from "respected" sources. 2 if/then/else statements will generally outperform randomness universally in a huge number of fields/subject matter areas.

What's not interesting is that one can build a robot that beats/meets the average human at tennis (the average human probably is probably incapable of serving out a single game), but that one can build one that performs better than a relatively cheap implementation of our current state of knowledge of the game.

Moving from 2 if/then/else statements to an n parameter complicated model that requires training data and that no one understands and requires huge amounts of power and time to train is not only not progression, it's actually a regression on the current state of affairs. In almost all fields, random or average is the last thing you want to compare against.


Getting a team to publish their results (after patenting) is also a good way to get them to do these sorts of things. Significance, baselines, and other things are asked for by reviewers for the better journals and conferences.


Recommender systems are notoriously hard because it's difficult to do better than just recommending the most popular content. You can recommend more personalized content at the expense of KPIs like click-through rate.


If the team can’t even beat random, then I think that says more about your team (or perhaps your features) than about ML as a whole.


Today ML can solve some problems. In the future it might solve some problems with advances in the field. Yet other problems will likely remain unsolved, such as the stock market, or the weather, or predicting /dev/rand

"Up Next" problem can easily fall into any of the three buckets.


YouTube's "Up Next" recommendations do (significantly) better than random, therefore "Today ML can solve some problems".


>YouTube's "Up Next" recommendations do (significantly) better than random, therefore "Today ML can solve some problems".

IMO YT AI is the opposite of intelligent , it still recommends things I disliked. for some reason this basic rule of not showing something that I explicitly disliked was to hard for it to learn, I am wondering if it is truly an AI behind it or just statistics


If the goal is simply to maximize engagement, there is no hard requirement that the algorithm should never show you things that you dislike. Essentially, what I am saying is that your belief of what their objective function is may be different from their actual objective function, and that is in no way an indication that their model is a failure.


Isn't AI statistics?


The modern AI/ML is more like we throw a lot of data and we generate a model, we know that it works but we have no idea how and why.


It's intentional. Controversy is a strong signal for the youtube algorithm.


I don't think so, the videos I was referring were music videos. I engaged with it in a way but hitting dislike on a music genre I don't want to listen(I normally don't dislike things because is not my genre) in the hope the algorithm will learn but it was even worse, it did not learned that I disliked artist X and genre Y it continue to play the exact video that I disliked,

The bad algorithm will force the unhappy user to use manually created playlists leaving less people to engage with the algorithm and probably have the algorithm getting worse in time as more users will avoid it

And even more interesting trying to google "how to make youtube not show X" it is a complete fail, it will just show you youtube video results.


I'm pretty sure it does better than 'next video = random(from all of YouTube)' would, but would it be much better than 'next video = random(videos with the same subject or tags as the one playing now)'?


Yes. They've published quite a number of papers on their recommendation algorithms. You can take a look and decide for yourself whether it's snake oil or not.


Or not enough communication with the team, discussing what the objectives are, providing them with good, enough, relevant data to work with etc. I guess in many cases a data science team is expected to just "do their magic", build the AI and then come back and meanwhile not bother anyone else. In other cases, nobody really cares anyway, they just want the buzzword label to be includable in the brochures.


A relevant Twitter thread: https://twitter.com/NeuroStats/status/1192679554306887681

At the risk of projecting, this has the hallmark of bad experimental design. The best experiments are designed to determine which of many theories better account for what we observe.

(When I write "you" or "your" below, I don't mean YOU specifically, but anyone designing the kind of experiment you describe.)

One model of gravity says the postition/time curve of a ball dropped from a height should look like X. Another model of gravty says it should look like Y.

You drop many balls, plot their position/time, and see which of the two models' curves match what you observe. The goal isn't to get the curve; the goal is to decide which model is a better picture of our universe. If the plotted curve looks kinda-sorta like X but NOTHING like Y, you've at least learned that Y is not a good model.

What models/theories of customer behavior were your experiments designed to distinguish between? My guess is "none" because someone thinking about the problem scientifically would start with a single experiment whose results are maximally dispositive and go from there. They wouldn't spend a bunch of time up-front designing 12 distinct experiments.

So it wasn't really an experiment in the scientific sense, but rather a kind of random optimization exercise: do 12 somewhat-less-than-random things and see which, if any, improve the metrics we care about.

Random observations aren't bad, but you'd do them when you're trying to build a model, not when you're trying to determine to what extent a model corresponds with reality.

For example, are there any dimensions along which the 12 variants ARE distinguishable from one another? That might point the way to learning something interesting and actionable about your customers.

Did the team treat the random algorithm as the control? Well, if you believe some of your customers are engaged by novelty then maybe random is maximally novel (or at least equivalently novel), and so it's not really a control.

What about negative experiments, i.e., recommendations your current model would predict have a NEGATIVE impact on your KPIs? If those experiments DON't produce a negative impact then you've learned that some combination of the following is the case:

   1. The current customer model is inaccurate
   2. The model is accurate but the KPIs don't measure what you believe they do (test validity)
   3. The KPIs measure what you believe they do but the instrumentation is broken
Some examples of NEGATIVE experiments:

What if you always recommend a video that consists of nothing but 90 minutes of static?

What if you always recommend the video a user just watched?

What if you recommend the Nth prior video a user watched, creating a recommendation cycle?

Imagine if THOSE experiments didn't impact the KPIs, either. In that universe, you'd expect the outcome you observed with your 12 ML experiments.

In fact, after observering 12 distinct ML models give indistingiushable results, I'd be seriously wondering if my analytics infrastructure was broken and/or whether KPIs measured what we thought they did.


This is a very good comment. Is this line of reasoning fleshed out and written up somewhere so I could point people to it? (Also, I would like to think more deeply about its implications)

> What models/theories of customer behavior were your experiments designed to distinguish between? My guess is "none" because someone thinking about the problem scientifically would start with a single experiment whose results are maximally dispositive and go from there.

This is how science is (at least, ought to be) done. This way, the goal is to always be improving your understanding of objective reality.

> They wouldn't spend a bunch of time up-front designing 12 distinct experiments. [...] So it wasn't really an experiment in the scientific sense, but rather a kind of random optimization exercise: do 12 somewhat-less-than-random things and see which, if any, improve the metrics we care about.

The problem is that a lot of AI salesmen tend to hype the "model-free" nature of "predictive" AI towards optimizing outcomes/goals, and people who don't know better get carried away with the bandwagon. Overly business-oriented people are susceptible to the ostrich mentality of not wanting to understand problems with bad tools -- they are too focused on the possibility of optimizing money-making. I find the movie "The big short" to be a fantastic illustration of this psychology.

It's probably going to lead to a very bad hangover, but for the moment the party's still going on and nobody likes the punch bowl being yanked away.


Nope, I just typed the above off-the-cuff. I could tweet storm it. Would that be useful?


Personal recommendation systems all have tradeoffs. It's just the nature of curation as an intangible endeavor. You can love "Scarface", "Heat" and "LA Confidential" but still find "Casino" boring ;)

More on such tradeoffs in a recent case study from DeepMind on Google Play Store app recommendations. Even they acknowledge the same techniques that surface 30% cost efficiencies in data center cooling, may not be completely applicable to "taste"

https://deepmind.com/blog/article/Advanced-machine-learning-...


Isn't sparse recommendation for videos kind of solved in netflix prize, where the winner uses SVD to extract signature characteristic and recommend videos base on that?


There are a lot of ways of formalizing the problem of recommendation. Perhaps the variant of the problem used by Netflix is "solved", but it's kind of an odd one. Basically, they built a system to answer questions of the following form: "Given that user X watched media Y, what rating would they give it?" They trained and tested on media that users have already rated. Some of the ratings are masked and thus need to be "predicted" for the test.

The issue is that the Netflix dataset has a baked-in assumption that a recommender system should show media that a user is likely to have ranked highly. It may be more important to show the user media they wouldn't have found (and thus ranked) at all. Or perhaps a user will be more engaged with something controversial rather than generically acceptable. Who knows?


Probably not. I say this because for many months, i would visit netflix and not want to watch anything. Eventually I cancelled my subscription after many years.

I think I'd rather have a random collection of titles than a recommended list for me.


Just because the medium is the same doesn't mean the customer wants the same types of recommendations in two different contexts.


I feel like if the Netflix Prize results had truly solved the problem, then they’d still be using them. It seems like the video recommendations aren’t as good as they were ten years ago during the prize competition, and they’re no longer based on what I might like but rather what Netflix wants me to watch.


Interestingly, Steam's recommendation algorithm to show "similar to a game" by their 'learning machines' works very well. I have found really good games via that, and none of them did show up in the regular recommendations/carousel on top.


pretty sure it just uses the tags players give it, strategy, sci-fi , openworld etc.


excuse me for butting into dead conversation, but your team used ml to inapplicable case. (at least they aсquired experiense).

up next recommendations wont work without advanced image recognition and topics gathering - basically titles/tags for most videos are garbage and clickbait, and most of youtube work by watching buzzed videos: some well known (by a big amout of watchers) "influencers" push video of some topic (thing/brand) then it get traction from other content creators - they produce videos about it and watchers tend to stick to buzzed topics. it's like news about news.

if your team used ml to recommend up next on your own video hosting your result simply means your videos are equally not on topic Or non-interesting for your service auditory; or they are garbage.


Doesn’t it depend on the size and variety of your dataset?

I can see “random” performing well in a set of <1000 videos, all on similar subject spaces (eg “memes”, or “python”), but recommending relevant stuff gets much harder as the amount of content grows...


I feel like A/B testing isn’t a great way to determine correctness either.

IMO even in interface designing you should be arguing from first principles rather than relying on telemetry and other empirical data.


Over the years my heuristic has turned into: "Did the team formulate their problem as a supervised learning problem?" - If not it's probably BS.

In longform if anyone is interested https://medium.com/@marksaroufim/can-deep-learning-solve-my-...

EDIT: I would consider autoencoders, word2vec, Reinforcement Learning examples of turning a different problem into a supervised learning problem

EDIT 2: Social functions like happiness, emotion and fairness are difficult to state - you can't have a supervised learning problem without a loss function


You miss the point of the slides. His point isn't about supervised vs unsupervised, it's about the general areas where AI seems to excel and fail. It's being used to predict social outcomes where it does very poorly, may be inscrutable, and is unaccountable to the public.

Your examples (deep learning applied to perception) are what he argues AI is generally good for.


Auto-encoders have been more successful in fraud and anomaly detection then supervised methods. For the uninitiated: the basic concept is to reduce the feature space (i.e. the things you know) to a lower dimensional space, then decode back into the original space. When enough differences arise between the original and reconstructed variables, the event may be flagged for a human to review (or some triage process).


I wonder if a similar approach can be used for a classification task where one or more classes have only few training examples (those would be similar to "anomalies", I suppose).


It's hard to verbalize this, most of it is "intuition" but I think it boils down to "supervised learning is BS."

Humans are smarter than computers. How can a human teach a computer how to do something when the human itself can't teach another human that something?

We haven't solved that problem. The snake is eating its tail.

You can't teach a human how to do something when the methodology to do that is the student trying something and the teacher saying "Yes" or "No".

Well.... why? Why is it yes or why is it no? What is the difference between what the human or the computer, or in general, the student, did and what is good or correct? And then you still have to define "good" and many times that means waiting, in the case of the PDF linked to above, perhaps many years to determine if the employee the AI picked, turned out to be a good employee or not.

And how do you determine that? How do you know if an employee is good or not? We haven't even figured that out yet.

How can we create an AI to pick good employees if human beings don't know how to do that?

Supervised learning isn't going to solve any problem, if that problem isn't solved or perhaps even solvable at all.

In other words, over the years, my heuristic has turned into, "Has a human being solved this problem?" If not, then AI software that claims to is BS.


Supervised learning in machine learning is nothing remotely like a human teaching anyone anything. It's a very clear mathematical formulation of what the objective is and how the algorithm can improve itself against that objective.

The closest analogy for humans would be to define a metric and ask a human to figure out how to maximize that metric. That's something we're often pretty good at doing, often in ways that the person defining the metric didn't actually want us to use.


> Supervised learning in machine learning is nothing remotely like a human teaching anyone anything.

I disagree, I think it's exactly the same. As an example, a human teaching a human how to use an orbital sander to smooth out the rough grain of a piece of wood.

The teacher sees the student bearing down really hard with the sander and hears the RPM's of the sander declining as measured by the frequency of the sound.

The teacher would help the student improve by saying, "Decrease pressure such that you maximize the RPM's of the sander. Let the velocity of the sander do the work, not the pressure from your hand."

That's a good application of supervised learning. Hiring the right candidate for your company is not.


But that's not at all how "supervised learning" works. You would do something like have a thousand sanded pieces of wood and columns of attributes of the sanding parameters that were used, and have a human label the wood pieces that meet the spec. Then you solve for the parameters that were likely to generate those acceptable results. ML is brute force compared with the heuristics that human learning can apply. And ML never* gives you results that can be generalized with simple rules.

* excepting some classes of expert systems


One of the columns of sanding parameters is the sound of the sander.


Machine learning really almost nothing in common with most types of human learning. The only type of learning that has similarities is associative learning (think Pavlovs dogs studies).

The human learning situation you describe works quite differently, though: The student sees either the device alone or the teacher using the device to demonstrate its functionality. This is the moment most of the actual learning happens: The student creates internal concepts of the device and its interactions with the surroundings. As a result the student can immediately use the decive more or less correctly. What's left is just some finetuning of parameters like movement vectors, movement speed, applied pressure etc.

If the student would work like ML, it would: hold the device in random ways, like on the cord, the disc, the actual grip. After a bunch of right/wrong responses she would settle on using the grip mostly. Then (or in parallel) the student would try out random surfaces to use the device on: the own hand (wrong), the face of the teacher (wrong), the wall (wrong), the wood (right), the table (wrong) etc. After a bunch of retries she would settle on using the device on the wood mostly.

It's easy to overlook the actual cognitive accomplishments of us humans in menial tasks like this one because most of it happens unconsciously. It's not the "I" that is creating the cognitive concepts.


That is such a horrible metaphor


> You can't teach a human how to do something when the methodology to do that is the student trying something and the teacher saying "Yes" or "No".

Strangely, I recently had to complete a cognitive test that was essentially that process. I was given a series of pages, each of which had a number of shapes and a multiple choice answer. I was told whether I chose the correct answer, then the page was flipped to the next problem. The heuristic for the correct answer was changed at intervals during the test, without any warning from the tester. I'm told I did OK.


You're touching on the "difficulty" in verbalizing it. I see what you mean, because you did learn that the heuristic was changing with just a yes or no. I said you can't teach that way, but you clearly learned that way, so I wasn't exactly correct, but I'm not practically wrong either still I don't think.

I wonder, how would an AI perform on the same test.

What is the mathematical minimum number of questions on such a test, subsequent to the heuristic change, that could guarantee that new heuristic has been learned?

I'm curious about the test. Did it have a name? What were they testing you for?


> I wonder, how would an AI perform on the same test.

This situation is called Multi-armed Bandit. In this setup you have a number of actions at your disposal and need to maximise rewards by selecting the most efficient actions. But the results are stochastic and the player doesn't know which action is best. They need to 'spend' some time trying out various actions but then focus on those that work better. In a variant of this problem, the rewards associated to actions are also changing in time. It's a very well studied problem, a form of simple reinforcement learning.


If the rewards are changing, then isnt it a moving target problem?


Doesn’t it depend on what you mean by guarantee? The test can’t get 100% certainty, since theoretically you could be flipping a coin each time and miraculously getting it right, for 1000 times in a row. The chance of that is minuscule (1/2^1000), but it’s nonzero. So we’d have to define a cutoff point for guaranteed. The one used generally in many sciences is 1/20 chance (p = 0.05), so that seems like a plausible one, and with that cutoff, I think you’d need five questions passed in a row (1/2^5 = 1/32). Generally, if you want a chance of p, you need log2(1/p) questions in a row passed correctly. However, that only works if your only options are random guessing and having learned the heuristic. If you sorta know the heuristic (eg. right 2/3 of the time), then you’d get the 5 questions right ~13% ((2/3)^5) of the time, which isn’t inside the p = 0.05 range. So you also need to define a range around your heuristic, like knowing it X of the time. Then you’d need log(1/p)/log(1/X) questions. For example, if you wanted to be the same as the heuristic 19/20 times and you wanted to pass the p = 0.05 threshold, you’d need log(1/0.05)/log(1/(19/20)) ~= 59 questions.


There were more than two possible answers to choose from on each page, so the odds of being right were considerably lower.


I'm sure the test was a standard with a name, but I was never told. It was a small part of a 3 hour ordeal, evaluating my healing progress since suffering a brain injury in March.

I would agree that it's a very inefficient way of teaching something. It gave me an unexpected insight into machine learning though.

I'm sure the test was designed so that picking the same answer each time or picking one at random would result in a fail.


Sounds like Ravens progressive matricies.


Similar but not the same.


Well... why is it necessary that we can teach a human to do something in order to teach a machine to do it?


Teaching a human is a heuristic for understanding the problem well enough to teach a machine.


I agree and rather than post a sibling response, I'll add that I think it's necessary today, simply because we don't have AGI, yet. And also point out that we are talking about determining if AI is snake oil or not. There may be some scenarios where we can teach a computer to do something we can't teach a human to do, I can't think of any off the top of my head, but if we can't, then I'm going to be super doubtful that an AI software can do it better than a human, if at all.

AGI, in the singularity sense, will be solving problems before we even identify them as problems. Experts in a field can do this for the layman already and I think it's possible. Some don't. I do.

It'll be super interesting when it flips! When the student becomes the master and we, as a species, start learning from the computer. You can kind of get a sense of this from the Deep Mind founder's presentation on their AI learning how to play the old Atari game Breakout. He says when their engineers watched the computer play the game, it had developed techniques the engineers who wrote the program hadn't even thought of.

Even still, the engineers could teach another human how to play Breakout, so yes, I do believe they did in fact create a software to play Breakout better than they could.


Same for AlphaGo, but it only works when you have access to cheap simulation (breakout being a game easy to run, Go being just a board). It doesn't work as well in situations where you don't see the full state of the system, or where there is randomness.


AlphaGo did pretty well in Starcraft 2. Even though it is still pretty far from the best players in the world.


This simply isn't true. We know your intuition here is mistaken, as we have plenty of counter-examples.

The best chess AIs can beat any human chess player. They use techniques that were never taught to them by a human.

Another example: a machine-learning-driven computer-vision system predicting the sex of a person based on an image of their iris. No human can do this. [0]

[0] Learning to predict gender from iris images (PDF) https://www3.nd.edu/~nchawla/papers/BTAS07.pdf


I don't follow that. The recidivism predictor was supervised. Conversely, AlphaZero is unsupervised and certainly not BS.


AlphaZero is not unsupervised. It is a reinforcement learning algorithm, it knows exactly what the outcome of the game is.


The terms "supervised machine learning" and "unsupervised machine learning", by their ordinary English meaning, make it sound like all machine learning is partitioned into one or the other. But a lot of the literature in machine learning considers reinforcement learning to be neither 'supervised learning' nor 'unsupervised learning'. See, e.g., section 1.1 of [1].

[1] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, second edition. MIT press, 2018.


I basically agree with this rule. I find that my colleagues who overly hype unsupervised approaches typically don't have much experience working on ML problems without labeled data. My suspicion of this comes from the fact that whenever I give a talk on ML I always have a wealth of personal experience to draw on for examples. My colleagues almost always reuse slides from projects they never worked on.


I'm a little surprised to see this sentiment. Some of the most important advances in the field have been unsupervised tasks:

- OpenAI: Dota 2 (PPO), GPT-2...

- NVidia: StyleGAN, BigGAN, ProGAN...


Those are certainly important advances, but they don't really apply to most business needs for AI or ML.


I work in the industry on NLP tasks. Unsupervised learning has been behind the largest developments in the last decade in the field.


I don't disagree with your point, but the unsupervised aspect of NLP typically isn't useful on its own. Usually it's a form of pre-training to help supervised models perform better with less data.

From Google in 2018:

"One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch."


As I said, I'm an NLP researcher and practitioner, so you don't need to quote this at me.

The unsupervised aspect is the engine driving all modern NLP advancements. Your comment suggests that it is incidental, which is far from the case. Yes, it is often ultimately then used for a downstream supervised task, but it wouldn't work at all without unsupervised training.

Indeed, one of the biggest applications of deep NLP in recent times, machine translation, is (somewhat arguably) entirely unsupervised.


I didn't mean to make it sound incidental although I do see your point. Just wanted to chime in with how important having a labeled dataset is for a successful ML project.


I think the point is labeling itself is very difficult except for special and limited domains. Manually constructed labels, like feature engineering, are not robust and do not advance the field in general.


That makes sense. I'm coming from the angle of applied ML where solutions need to solve a business problem rather than advance the field of ML. In consulting many problems can't be solved well without a labeled dataset and in lieu of one, less credible data scientists will claim they can solve it in an unsupervised manner.


For sure. There are counter-examples however - fully unsupervised machine translation for resource poor languages comes to mind and is increasingly getting business applications.

I think that in the future, more and more clever unsupervised approaches will be the path forward in huge AI advances. We've essentially run out of labeled data for a large variety of tasks.


Echo the other commentator. Unsupervised techniques are the only reason NLP works as well as it does.


I would argue that GAN's by definition aren't unsupervised, they just aren't supervised by humans. Additionally, OpenAI's game stuff also has similar arguments against it.


> I would argue that GAN's by definition aren't unsupervised

You can define the terms how you want - but in terms of how they're understood in both industry and academia, you are incorrect.


The discriminator definitely is supervised but the generator is unsupervised. I.e., it has no labels on its targets.


I'm not sure that's correct. The discriminator and the generator both learn to match a training set. You don't need to label the training set at all. You can just throw 70,000 aligned photos at it.

I think I see what you're saying, but that might be a different definition of "supervised". It seems impossible for one half of the same algorithm to be supervised and the other to be unsupervised. But I like your definition (if it was renamed to something else) because you're right that the discriminator is the only thing that pays attention to the training data, whereas the generator does not.


There's a lot of gray area between unsupervised and supervised learning. For example self-supervised learning: https://www.facebook.com/722677142/posts/10155934004262143/


Ironically, the algorithm you pose in that comment itself, is a BS algorithm in it of itself.

"Formulate the problem as X" - what is your input for how a problem is formulated? That you personally like how it was formulated?

"Probably," - OK, so you assign probability scores? Or do you mean, "likelihood based upon my guess?"

Finally, how do you measure performance? Your own assessment of how good you were at it?


The author says "AI is already at or beyond human accuracy in all the tasks on this slide and is continuing to get better rapidly" and one of his examples is "Medical diagnosis from scans". That is an example of precisely the sort of snake oil hype he's berating in the social prediction category.

In an extremely narrow sense of pattern recognition of some "image features", i.e. 5% of what a radiologist actually does, he's probably right. But context is the other 95%, and AI is nowhere close to being able to approach expert accuracy in that. It's a goal as far away from reality as AGI.

"AI" tools will probably improve the productivity of radiologists, and there are statistical learning tools that already kind of do that (usually not actually widely used in medical practice, you can say yet, I can say who knows but nice prototype). But actual diagnosis, like the part where an MD makes a judgement call and the part which malpractice insurance is for? Not in any of our lifetimes.

A radiologist friend complains that it's been 10+ years since they've been using speech recognition instead of a human transcriptionist, and all the systems out there are still really bad. Recognizing medical lingo is something you can probably achieve with more training data, but the software that sometimes drops "not" from a scan report is a cost-cutting measure, not a productivity tool. It makes the radiologist worse off because he's got to waste his time proofreading the hell out of it, but the hospital saves money.


Author here. I appreciate your criticism. What I had in mind was more along the lines of Google's claims around diabetic retinopathy. I received feedback very similar to yours, i.e. that those claims are based on an extremely narrow problem formulation: https://twitter.com/MaxALittle/status/1196957870853627904

I will correct this in future versions of the talk and paper.


Then I shall write to you directly. I don’t know how you can make the claim that automated essay grading is anything but a shockingly mendacious academic abuse of student’s time and brainpower. To me, this seems far worse than job applicant filtering, firstly because hiring is fundamentally predictive, and secondly because many jobs have a component of legitimately rigid qualifications. An essay is a tool to affect the thoughts of a human. It is not predictive of some hidden factor; it stands alone. It must be original to have value; a learned pattern of ideas is the anti-pattern for novelty. If the grading of an essay can be, in any way, assisted by an algorithm, it is probably not worth human effort to produce. If you personally use essay grading software, or know of anybody at Princeton that does, you have an absolute obligation to disclose this to all of your students and prospective applicants. They are paying for humans to help them become better humans.


Thanks for the .pdf and the research in general, great stuff!

One thing I'd love is a look at 'noise' in these systems, specifically injecting noise into them. Addons like Noiszy [0] and trackmenot [1] claim to help, but I'd imagine that doing so with your GPS location is a bit tougher. I'd love to know more on such tactics, as it seems that opt-ing out of tracking isn't super feasible anymore (despite the effectiveness of the tracking).

Again, great work, please keep it up!

[0] https://noiszy.com/

[1] https://trackmenot.io/


FYI, I too looked at his categorisation of what was/was advancing/was not snake oil and didn't exactly agree with all his decisions either.

Medical imaging diagnosis was one of them.

Speech recognition/transcription was another. I don't know if it's my accent or my speech patterns(though foreigners regularly compliment my wife and myself on our pronunciation), but the tech hasn't gotten noticeably better for me since the days of Dragon natural speaking, and that was, what...10 years ago?

Sure, I can "hey Google/siri/alexa" a handful of predefined commands, but I still have to talk in a sort of stoccato "I am talking to a computer" voice, it still only gets it right 90% of the time, and God help you if you try anything new/natural not in the form of "writing Boolean logic programs with my voice".


I feel like it’s gotten loads better. You can watch as the voice recognition on your phone changes the words it recognizes to match the context in the sentence. Sometimes it gets a word wrong and fixes it after a half second. Google translate does magical things recognizing common phrasing constructions and bad accents, stuff that Dragon could never do. I built a lipsync pipeline for video games based on Dragon a decade ago, and it most definitely was not as good as what I have on my phone today.


Again, just anecdotally, I don't know if its just me, but most of my experience is of google/apple translate and auto-corrects going from the word I want to a wrong/incorrect one.

It is one of my most frustrating everyday software experiences.

Not only is it not getting better, its actually getting worse, because before I at least had the correct sentence. Now my correct sentence is mangled as it tries to force corrections/substitutions, and I have to continually go back and manually-correct the auto-correct.

It seems to work for me on short pre-formed sentences and toy examples (if you communicate using pre-formed phrases and use well-worn cliches in your writing, it seems to pick up and predict for them). I wonder whether the "increased accuracy" of modern solutions aren't just functionally having access to a larger library of lookup rules of stored common/popular phrases and direct translations (a huge part of practical 'AI' advancement has been on the scaling-infrastructure/collection of new-scales-of-data rather than the AI techniques themselves IMO) effectively mined from its training data, but the moment I try to write or dictate anything new, original, or lengthy and it turns absolutely pear-shaped.


That does sound frustrating. I didn’t mean to discount your experience, it certainly is possible that the additional AI has made it worse for some people, especially if there’s an accent involved. Not to mention that the meme of autocorrect mistakes when texting somewhat backs up your experience on a larger scale. I wonder if the scale and complexity of what they’re doing now compared to last decade is the cause of the regressions, like is the problem being solved much harder by trying to factor in and autocorrect based on context, and causing worse results than pure phoneme detection?


My brush with AI snake oil:

I interviewed at a startup that seemed fishy. They offer a fully AI powered customer service chat as an off the shelf black box to banks. I highly suspect that they were a pseudo AI setup. LinkedIn shows that they are light on developers but very heavy on “trainers”, probably the people who actually handle the customers, mostly young graduates in unrelated fields, who may believe that their interactions will be the necessary data to build a real AI.

I doubt that AI will ever be built, it's just a glorified Mechanical Turk help-desk. I guess the banks will keep it going as long as they see near human level outputs.



Very common AI startup. They use some pre made AI tools (e.g. Watson bot) and resell it to specific industries where they already have the "intent trees" made (common questions/actions the user wants). The trainers are nothing more than analysts that will identify an intent not listed on the tree and configure it there. The devs probably are API and frontend devs, not much AI stuff going on.


I don't think that is in principle problematic (unlike the social problem statements pointed out in the talk). A system which amplifies human resources by filling in for their common activities over time could use sophisticated tech drawing on the latest in NLP. The metric would be a ratio of the number of service requests they handle per day / the number of "trainers" (or whatever name given) compared to the median for a purely human endeavour where every service request is handled by a customer-visible human.

In the Mechanical Turk analogy there is no such capability amplification happening.


My experience with automated "help" desks is that I have to let the automatons fail one after the other until I finally get connected to a real human. Then I can start to state my problem. All that those automated substitutes do is discourage customers from calling at all.


I have a feeling I know _exactly_ which company you're talking about...so it's either just that obvious, or there's more than one of these, or both!


It seems to be the latter. In fact, the trick (should I say, fraud?) is so common that they were even several articles about it in the press over the past two or three years. Even the famous x.ai had (and I guess still has) humans doing the work.

https://www.bloomberg.com/news/articles/2016-04-18/the-human...


I was going to say the same thing!

A couple of weeks ago such a startup based in London contacted me on LinkedIn - the product really hyped AI, but it all seemed very dubious. My guess was it was really a mix of a simple chatbot with a Mechanical Turk-style second line.


I guess the idea would be to get a few contracts, pull down some money from those and then go bust as the costs of the Mechanical Turk become evident?


> go bust as the costs of the Mechanical Turk become evident

I'm afraid you have misspelled "raise a humongous round form SoftBank". It's an easy typo to make, don't feel bad.


Hugh mongous what?


My company is sourcing AI from MTurk. It's actually cheaper than running fat GPU model training instances. The network learns fast and adapts well to changes in inputs.

I envision the sticker "human inside" strapped on our algorithms.


You should emphasize that this is Organic AI. It's low carbon and overall greener.


Or keep calling it AI, and concede that AI stands for "actual intelligence" if someone asks you directly.


AI now is like Cyber was in the 1990s it's seems to be nothing but a buzzword for many organizations to throw around.

The term AI is used as if humanity now has figured out general AI or artificial general intelligence (AGI). It's quite obvious organizations and people use the term AI to fool the less tech inclined into thinking it's AGI - a real thinking machine.


Remember 5-ish years ago when IBM's marketing department was hawking their machine learning and information retrieval products as AI, and everyone in the world rolled their eyes so hard we had to add another leap second to the calendar to account for the the resulting change in the earth's rotation?

I suppose their only real sin was business suits. Everything seems more credible if you say it while wearing a hoodie.


Cyber is still happens to be a buzzword, it just shifted meaning to the defense sector.


Has been like that for quite a while, it has up and lows with the 90s marking a winter season for AI, and now the hype machine is on full steam again, until people find out again a lot of it is marketing BS to get funding. Then the research that is worthwhile gets mixed up with that, go into a lack of funding and in 30 years or so it's back on full hype again.


Call it Al, and only hire Alans, Alberts, Alphonses, Alis, etc


Each unit uses about 100W continuously and emits about 1kg of CO2 per day before adding impact of supporting infrastructure.

These things better be smart, because they are not low-footprint.


We just have to hope our robot overlords will not be overly environmentally conscious


20+ years ago I used to refer to this is as artificial artificial intelligence (AAI) specifically as part of a pitch to MGM for an MMORPG to run their non-player characters. Not surprisingly, it didn't catch on...


That was MTurk's slogan on launch.


Totally should have trademarked it along with "IGlasses" in the very same pitch. The patch was apparently rejected because our level design ideas were better than the actual episodes of the show upon which it was based: "Poltergeist: The Legacy."


You mean high carbon, low silicon? Because humans usually have a higher carbon footprint than computers, it takes a lot of computers to match one human. Plus we're made of carbon.


i m not sure, if you factor in the CO2 footprint of computer manufacture, and the fact that AI needs powerful computers & networking to be delivered. Our body carbon is almost 100% recycleable.


If only the carbon footprint of a human was the body carbon.

Modern humans have a very heavy carbon footprint, especially in the US. Think of all the things you do and consume and all the carbon involved all thorough the chain. It's a big number. Computers are extremely efficient compared to that.


People you hire on MTurk don't generally work in the US, and have a small fraction of the carbon footprint of an average American.


Aren't most of those arguments here moot, since they also need a computer (+ networking) to be available on MTurk?

computer < human + computer

Of course not every computer has the same specs and footprint, but they should be in roughly the same ballpark.


A typical computer used by a typical person has only a fraction of energy use of GPU farms utilized to train DNN models. We're not at the point where you can pick an existing DNN off the shelf and just use it, you have train (at least partially) a new one for each new task, often many times.


I think simply OI (Organic Intelligence) would the more appropriate term since it's no longer artificial ;)


Eh, every existing instance was created by some people working together.


I like calling them MeatBots


Even companies like Facebook, Apple and Google employee humans to do work that people believe is done by "computers" and non of the companies seem keen on informing the public that they do in fact have humans scanning through massive amounts of data. So perhaps it is in fact cheaper, or the problems they face remains to hard for current types of AI.

Given the number of people Facebook employees to censor content and the mistakes they make I would label most of Facebooks AI claims as snake oil.


Well, just about any ML task needs people to prune and correct the training data.


I'm aware of moderation. What else?


Recaptcha is probably one you've actually interacted with, but even then you're mostly reinforcing existing predictions. But other applications within Google Maps are things like street number recognition, logo recognition, etc. Waymo contract out object detection from vehicle cameras and LIDAR point clouds. Google even sell a data labeling as a service.

I believe Google Maps has a lot of humans who tidy up the automated mapping algorithms (such as adjusting roads).

Annotation is time consuming and therefore extremely expensive if you have a $100k engineer doing it.

https://www.forbes.com/sites/korihale/2019/05/28/google-micr...


Yes you can report problems with the road network and people update GMaps manually. And up until a couple of years ago users could do it themselves, but they took it down for some reason.

Changes to other types of places can still be done manually by GMaps users themselves, and other users can evaluate that, and I guess if it's a "controversial" (low rep user did the change or people voted against it) a Google employee evaluates it. And if you're beyond certain level as a GMaps user you can get most changes published immediately.


What about data privacy? do your customers knows that random Turks look at their data?



I worked at a place that was selling ML powered science instrument output analysis. It did not work at all (fake it till you make it is normal, was told). So there was a person in the loop (machine output -> internet -> person doing it manually pretending to be machine -> internet -> report app). The joke was “organic neural net.” Theranos of the North! ML is a great and powerful pattern matcher (talking about NN not first order logic systems) right now, but, I fear we are going into another AI winter with all the over promising.


We won’t ever have an AI winter like in the 70s again. A lot of ML is already very useful across many domains (computer vision, NLP, advertising, etc). Back then, there was almost no personal computing, almost no internet, smol data, and so on. Stuff you need for ML to be useful and used.

So what if some corporate hack calls linear regression “AI”? The results speak for themselves. The ML genie is too profitable to go back in the bottle.


Didn't linear regression used to be called "AI" as recently as a decade ago?


It's still better in many cases than modern ML (especially if you incorporate explainability and efficiency as metrics of "better" next to the predictive power), so I wouldn't object much if a company called it "AI". In fact, if I learned that an "AI" behind some product was just linear regression, I'd trust them more.


I personally don’t see a problem with this. Where do you draw the line at model simplicity? Are decision trees too simple to be AI? What about random forest? Are deep neural nets the only model sophisticated enough to be “AI”? It’s not the model, it’s how you use it.


it still is, but the people who mean "regression" when they say "AI" will generally not admit this


all fair points. I have have used it to do amazing things. It’s not going away. Just that AGI seems very far away. I think CS is like biology before the discovery of the microscope (why are we getting sick? Microorganisms etc). Or DNA. Once that big breakthrough happens we will quickly transition to a new local maximum.


There's a recent xkcd about your company: https://xkcd.com/2173/

"We trained a neural network to oversee the machine output"


That sounds at least achievable, unlike examples in OP.


What I dislike far more than the idea of using such systems to predict social outcome is that the usage of such systems is done behind closed doors. I would be much more willing to accept such systems if the law required any system to be fully accessible online, including the current neural network, how it was trained, and training data used to train it (if the training data cannot be shared online, then the neural network trained from it cannot be used by the government).

Independent companies using AI is far less a concern for me. If they are snake oil, people will learn how to overcome them. Government (especially parts related to enforcement) is what I find scary.


In my country, a relatively recent law added an obligation for the government to give on request a detailed and joe-six-pack-undersandable explanation for how an "algorithm" has reached a decision pertaining to that person.

I've therefore started stockpiling popcorn since this law was announced for the inevitable clusterfuck that was going to happen when this law would have to apply to a decision taken using machine learning.

(Which is pretty much impossible to explain the way the law requires to, because even those that made the neural network would be quite at loss to understand themselves how exactly the neural network came up with that decision, even less being able to explain it to your average person !)


Maybe they can use something like this?

https://blog.acolyer.org/2019/11/01/optimized-risk-scores/

They optimise a simple set of decision rules which has reasonable accuracy in their application, quite cool really


Using an association between features to make a prediction about something, rather than measuring the thing itself, is exactly what’s meant by “prejudice.” Even when the associations are real and the model is built with perfect mathematical rigor. ML is categorically unsuitable for government decisions affecting lives.


You seem to think this level of prejudice for prediction is wrong. Why?

If someone has killed 12 people, being prejudice about their chance of killing another and using that to determine the length of a sentence seems reasonable.

Even with something like a health inspection. Measuring how they store and cook raw chicken is about predicting the health risks to the public eating it, not about measuring the actual number of outbreaks of salmonella. And even if they were to measure the previous outbreaks of salmonella and use it to prediction the future outbreaks, that is still two different things.


I understand and agree with the outcry around using ML areas like criminal justice, but there are some really compelling examples of ML being used by governments to help citizens[1].

[1] https://www.kdd.org/kdd2018/accepted-papers/view/activeremed...


The problem with leaving this to independent companies is that some of the most natural application areas are dominated by independent companies operating as a cartel -- think about credit scoring. The fact that the data and models are a natural "moat" (in the sense of Warren Buffett) is all the more worrying.


The difference is the level of threat between a private company denying you a loan and the government deciding you should spend 10 years in prison. Please don't misunderstand, I'm not saying the former is good, only that it isn't nearly as bad as the latter.


The "independent company" part doesn't work though. If Facebook comes up with anything useful, the US government walks in, grabs the data, then issues a gag order so nobody knows. It simply wouldn't matter that the government was officially restricted to "open AI".


A judge using his experience and judgement to subjectively set a jail sentence is as opaque as a proprietary algorithm. He or she may cite reasons for the sentence, but nobody is verifying that judges' sentences are consistent with the criteria they cite.


You are right and I see that as a flaw in the current system that should be fixed.


I read "Why are HR departments apparently so gullible?" and as someone who has worked in a corporate for 20 years I spotted my underwear.

The identification of facial recognition as problematic because of accuracy doesn't match my thinking. I believe that the key issue is that given a set of targets facial recognition systems will find near misses from the wider population of all faces offered as candidates, that they then flag as potential matches. This leads to real world problems (like innocent folks being arrested).

Automated essay grading and content recommendation are both very problematic because they do not account for originality and novelty. A lecturer grading an essay that is written my a strong mind from a different culture might be able to recognise and credit a new voice, a learned classifier never will. Similarly content recommenders have us trapped in the same old same old bubble, nothing strikingly new can get through.


...I spotted my underwear?

What does this mean?


Interpretation 1: they pooped their pants

Interpretation 2: they introduced a non sequitur in their excitement of finding their underwear (perhaps they lost it)

Interpretation 3: when they had their flashback to having worked in corporate 20 years ago they recalled where they misplaced their underwear


English humour - being taken unaware by a sudden very funny thought causing one to involuntarily do a small wee in your pants. This should not be taken as a literal admission of incontinence.


spotted = soiled


Top textual feature predicting snake oil: calling the product AI rather than ML.


Sometimes you just need to do it that way, because the people buying do not know what machine learning is - even though they've heard about AI.

For example, I was networking for some jobs in data science - and was approached by some energy company. Struck up a conversation with the guy (older exec), and he said "so I hear you have a background from AI, correct?" to which I replied "I have a degree in Machine Learning, and have worked with etc." - he just replied "Oh, we don't really need any mechanical engineers now"

So I asked him, mechanical engineering? He pointed out that I said MACHINE, and assumed it had something to do with, you know, physical machines and stuff - so in the domain of mechanical engineering.

So, from then, I went easy on using Machine Learning unless I was fairly confident the other guy had some domain knowledge. If I talk with non-technical salespeople, or older executives, I just leave it at AI.


Seems like there was a time when "machine" was the term grant committees wanted to hear, they were perhaps fed up with the theory and wanted more tangible stuff. So the theorists branded their stuff as machines. See support vector machine, kernel machines...

Similar to how "dynamic programming" was coined to please funding agencies.


Further signal: claiming general intelligence.


That doesn't work. Everything is called AI these days and in mountains of bullshit there are also some actually useful results, these few are not snake oil.


Somewhere out there, a biotech R&D company has developed an effective penis enlargement treatment. Unfortunately they have been having some trouble reaching potential customers.


This is a tragedy for half of humanity. Actually, all humanity come to think of it


Yes, I know folks who worked for Pfizer (which made viagra). They said they had all sorts of spam problems.


yes, and? That doesn't mean the feature isn't useful, it's just not 100% predictive. It's still the top feature.


Dynamic yield was practically pushed into AI. They never promoted doing AI but investors and clients liked it more. So they simply gave up and went with it.


Machine learning is AI.

AI is a large field, composed of a lot of more specific things. ML is one of those more specific fields.


This presentation categorized AI-related tech with social outcomes as fundamentally dubious, such as predicting criminal recidivism, predicting terrorist risk, or predictive policing. The rationals are that

(1) technically, the technology is far from perfect.

(2) Social outcomes like fairness are fundamentally difficult to state.

(3) Its inherent problem is amplified with the ethical/moral problems.

Of course many find it's unacceptable due to the ethical problems in a typical Western liberal democracy. But what if I'm an authoritarian who wants to find the tools of suppression, and I don't care the false positives or the ethical problems? Is it going to help my regime, or is it going to have the opposite effect?

I highly suspect that the answer is the former one. Fortunately, the technical limitations of "AI" means it's still more or less ineffective today, but it can only get better.

Therefore, I don't think AI with social outcomes are fundamentally dubious, but rather, fundamentally dangerous.


Lots of AI is actually large numbers of humans working on small bits of problems that are beyond our ability to automate. Not infrequently these are passed off on the outside as 'ai' startups. There are some good examples too where the companies that use machine learning properly and to good effect. Interestingly they don't blab about it because it is their edge over the competition and often just knowing that something can be done is enough to inspire someone else to be able to copy it.

So here is my own theory on how to recognize AI snake oil: if it requires advertising it is probably fake, if it is very quiet and successful it is likely genuine.


> if it requires advertising it is probably fake, if it is very quiet and successful it is likely genuine.

This is true of almost any product being offered for sale. Good advice.


Agreed, and I personally use the following heuristic in my product evaluation: the heavier the intensity of advertising, the worse the product. After correcting for rough company size (bigger company = bigger baseline advertising budget), I found it to be quite accurate.


The economic point of AI decision systems is that they can make an automated decision for $0.0001 of computer time, instead of $10 for a few minutes of an expert's time. For something like spam filtering, you obviously need the cheap solution. But you don't need that for the social interventions where you're making a decision about parole or something. You can spend $10 (or $1000) of people's time to make those decisions, because both their volume and impact is at human scale.


Moreover, the big advantage of AI is consistency. I.e. it does not get tired, bored, lose motivation, etc.

Hence it might better to get 70% correct answer all of the time, than 90% (human) some of the time.


On hard problems these AIs are doing no better than chance. Nor is a badly trained AI in any way consistent.


Perhaps ML could be applied here to help filter out the barrage of AI snake oil. Funding, anyone?


You should fund it with an Initial Coin Offering for maximum credibility


Where can I buy in?


The key claim of the presentation:

“For predicting social outcomes, AI is not substantially better than manual scoring using just a few features”

Suggests that only a simple heuristic is needed: if the AI salesman claims their product predicts a social outcome, like a candidate’s job performance or a person’s future criminality, it is snake oil.


And I bet I can beat that with a trivial algorithm: if(marketing copy contains the word 'AI') return 'bullshit'.

WRT. presentation, I didn't read it as claiming that just predicting social outcomes is hard; I read it as saying that perception is the only category where "AI" actually works somewhat (from the "automating judgement" part, I'd argue only spam detection works; I'm yet to hear of any effective ML solution for the rest of the bullet points there).


That filter would be ludicrously easy to "train". Just grep(1) for the letters 'AI'. I'll blithely take millions and billions of funding for this brilliantly profound vision of a mission to improve life for every single human being on earth for all time to come.


You can get high 90% accuracy with a single feature. If it has AI in the title it's snake oil.


Seems legit


I really wish we could stop using AI or ML for things in the "predicting social outcomes" category. Naming them more like "computational astrology" or "machine alchemy" would be a better fit.


"Astrology" (where mathematicians used to hang out before the scientific revolution, so no "computational"/"mathematical" qualifier needed) has been going on for decades in the financial and economic fields, so this one seems to be promised a bright future too!


Apple was all over that in 1979.

https://i.imgur.com/ssFqU3i.jpg


I've worked on data related to healthcare and security (and sometimes both) for quite a while now, and I think there are a couple of general contextual themes, where, if present, means that you have to be extremely careful about applying "AI" (some kind of ML in most cases): (a) where there's a high cost for incorrect predictions (e.g. criminal recidivism, educational attainment, terrorist attacks, etc) (b) where causation is important (e.g. drug efficacy and safety, educational attainment, almost all of healthcare) (c) where you're in an adversarial domain (e.g. fraud, cybersecurity, security in general) (d) where high technical performance (precision/recall/F1/etc) isn't correlated with predictiveness of what you're actually looking for (much of healthcare)

In healthcare and security, there's starting to be an awareness of the snake-oil that's out there, but I still run into people regularly who ask for a magic algorithm that predicts patient outcomes or a security breach.


Looking for a job as a "deep learning" PhD soonish, the amount of BS in this field at the moment has all my meters maxing out. Going through job listings is pretty exhausting when I'm constantly torn between laughing and crying...


Excellent takeaways:

>AI excels at some tasks, but can’t predict social outcomes.

>We must resist the enormous commercial interests that aim to obfuscate this fact.

>In most cases, manual scoring rules are just as accurate, far more transparent, and worth considering.


From the slides:

Harms of AI for predicting social outcomes

• Hunger for personal data

• Massive transfer of power from domain experts & workers to unaccountable tech companies

• Lack of explainability

• Distracts from interventions

• Veneer of accuracy

Human behavior is not IID and these models will struggle and fail due to the fundamental statistical assumptions of modern AI techniques. I also agree that as a result, we will normalize the collection of personal data in the name social progress.


Isn't "AI" pretty much snake oil. IIRC Artificial Intelligence used to mean a computer that could think like a person. But that just is not the case. Even with IBM ads with a computer talking to people saying it's going to fix the network and stop cyber attacks that is just complete nonsense. And it will probably always be nonsense because of course there's no way a computer can think like a person because a computer is not a person. It did not grow up and fall of it's bike and skin it's knee and take a road trip to the rock concert and meet someone and so on. AI is being used as a marketing term to compensate for the fact that sophisticated pattern recognition algorithms and the like are not particularly marketable even if they are useful.


Once upon a time, I used to do research in the field of Wireless Sensor Networks, which used to be abbreviated into WSN. You can still find lots of research papers using the abbreviation in the title, circa 10-12 years ago.

It went through a hype cycle, and then people sort of moved on - into IoT. And IoT, naturally, just sounds better and is more accessible, plus the tech did catch up, so it became a lot more popular than WSN (the term).

I told a friend of mine that no one paid attention when the field was called WSN, and now it is the exact same thing, but IoT is taking off like crazy. (At that point, I had left research altogether). He said "Yes, and that's perfectly reasonable. If you don't invent new terminology every few years and manufacture some kind of new hype, the funding agencies stop sending you money."

This is a community which is based on the mantra of "making something users want". Once you realize that the upstream user here is the funding agency, and what they really want is to make bets on "cool stuff for tomorrow" rather than boring old 3-5 year old tech (such as WSN), the hype actually makes perfect sense.

Sure, there is still a need to separate the snake oil from the reality. But that is true of tech in general. I am not sure if AI/ML is particularly bad in some way.


Here's the compressed decision tree:

Does it claim to have high performance on a task? Can humans quickly and cheaply verify that claim? If yes->no, then it is snake oil.


Several references to simple linear models and even improper linear models, so I will recommend this paper: Dawes, The Robust Beauty of Improper Linear Models.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.188...


It seems like at least half the tech people I talk to work with AI now... no matter their field.


Any type of heuristic search = AI. hand-made decision tree or lookup table = AI, Naive bayes = AI, KNN = AI. Uses numpy = AI. One employee in mobile app startup has graduate thesis in ML = AI startup. Approximate string matching in SQL query = AI. Not clear what the product is going to be = AI.

"Information is not knowledge. Knowledge is not wisdom. Wisdom is not truth. Truth is not beauty. Beauty is not blockchain. Blockchain is not AI. AI is THE BEST.”


In my space (cybersecurity) I’ve heard “more than 2 joins” (in an SQL context) is ML.


So it's turning into a meaningless buzzword.


I worked for a company that had a "big data machine learning" product. Yeah, it was mostly elasticsearch aggregations.


well, aggregations are just leveraging how the data fields are analyzed and labeled, so if those were based on ML/NLP techniques then it could be legit


> so if those were based on ML/NLP techniques then it could be legit

Occam's Razor is whispering that it probably wasn't legit.


i believe the only NLP/ML stuff we really had was at the level of positive / negative sentiment analysis for social media posts, which as far as i know is pretty basic and mostly just involves dictionary checks for +/- words. the elasticsearch aggregations & indexing was not ML/NLP based at all.


it turns out significant amounts of what I did in the past overlaps significantly with ML now: for example, hyperparameter exploration. I created a system years ago to do exploration by fitting polynomial surfaces, finding predicted maxima, and exploring those areas. Also, many of the algorithms underlying the area I did my PhD research in (molecular dynamics) are very similar to gradient descent.


I mean, if the software is smart enough to make sense of long sequences of 1's and 0's then it must be artificially intelligent to some extent. /s


The classic Bone Tumor Differential Diagnosis program that Apple published for the Apple ][ in the original Apple Software Bank Volume 1 from 1978 sounds legit. But beware: it requires 32K of RAM, bigger than any other program in that volume! (Available on both cassette and 5-1/4" floppy disk for free from your local Apple dealer.)

https://archive.org/stream/Apple_Software_Bank_Vol_1-2/Apple...

https://archive.org/details/a2_Biology_19xx_

https://mirrors.apple2.org.za/ftp.apple.asimov.net/documenta...

Program Name: BONE TUMOR DIFFERENTIAL DIAGNOSIS

Software Bank Number: 001 1 4

Submitted By: Jeffrey Dach, M.D.

Program Language: APPLESOFT II BASIC

Minimum Memory Size: 32K Bytes

This program is intended for use by qualified medical practitioners. While the specific data are of interest only to those familiar with bone pathologies, the programming techniques may well interest a wide range of computer users.

INSTRUCTIONS

LOAD the program into APPLESOFT II BASIC, and type RUN. Follow the instructions displayed on the screen. The program asks a series of questions concerning radiographic and clinical details of the bone tumor in question. For each question, type the number of the appropriate answer and press the RETURN key. Finally, the program uses Baye's rule and a predetermined probability matrix from Lodwick (1963) to calculate the relative probabilities of 9 different diagnoses.

Some knowledge of descriptive terms for bone tumors is needed to answer the questions. Only a qualified physician should attempt to use this program as a diagnostic tool.


Can we do blockchain next?


We don’t even need AI for that. Here’s some flawless pseudo-code:

fn is-snake-oil: return !(coin is BTC or coin is in stable-coin-list)


Sadly, that gives a false negative for when coin is BTC or coin is in stable-coin-list.


I think that was the joke


Read some of the authors other work, he's thrown plenty of cold water on various blockchain hype.


> Actually, Major Major had been promoted by an I.B.M. machine with a sense of humor almost as keen as his father's.

Catch 22


I once figured I can use ML to predict stock market movements and I went about buying 5 years of data from CBOE and started working on modeling. At one point I managed to get fairly accurate predictions of movements and I thought I had struck a lottery.

Turns out my training data had tomorrow’s movement indicator in it which I accidentally added. Despite that dumb mistake it was only accurate up to 80% or so. That’s when I realized Im better off with a random number generator.


I went down that path briefly too. After having trouble getting results, I realized I needed to check the fundamentals. What did it in for me was finally just doing a simple correlation analysis of earlier prices to later prices, and realizing there is NO CORRELATION. I realized, if there's nothing to predict, no amount of ML will magically figure it out for me, no matter how complex the indicator.

(I was just looking at single products.. I suppose this may be completely different when considering multiple products and markets and integrating external information like news sources.. not at all saying that ML can't be applied to the stock market, just that it's not nearly as simple as looking at previous prices to predict future prices, as everyone says... the point being, always check basic correlations!)


There is a particular form of statistical inference which is not well-understood as problematic. I will use my “smart” scale as an example. It takes weight and several impedance measurements, which are taken from a large population along with body composition truth measurements from DEXA, and the coefficients of the algorithm are determined by regression. It should be no surprise that the weight measurement overpowers all other measurements in the regression, and the impedance covariance is ill-conditioned, so really it’s just a height and weight formula. My information gain from the impedance measurement is zero. The correct way to do the regression is in reverse, from more (relevant) information to less, so the parameters should be well-conditioned. This is assuming that DEXA contains all useful information to predict impedance. If not, forget about the whole thing. If that model works, then the reverse can be found through Bayesian optimization. You still have bias set by the covariance prior, but it is at least known, and you can give information about how it is affecting the result.


It's a bit of a harsh heuristic, but after working now on several projects involving ML/AI and reading and watching about the experience of others in the industry too, I've come to associate most claims of ML as snake oil.

In industry today, I believe very few businesses are reaping much benefit from ML as compared to trivial statistical/analytical tools (linear regression, most popular recommenders, common sense improvements/optimizations, etc.). The only real benefit I would argue ML has brought for businesses has been in marketing to the general lay audience and misleading investors.

The main reason for this in my opinion is you can't really just come in and make recommendations/improvements to a given problem domain without deeply understanding that domain back to front - and that's an understanding that academic types that get hired to build ML systems almost never have. You can't stand at an arms length from real business problems and just throw maths at them and expect to make good (or even sensible) recommendations.


>The fast way forward involves being critical, being pragmatic, not overselling, and not drinking the Kool-aid.


Does the company mention AI in their name, marketing material, or investor pitch? Then it is snake oil.


I tend to find the following few questions a quick way to evaluate an ML/AI pitch in an elevator:

Ask yourself: could a human given the inputs reasonably produce the outputs you are looking for? — this helps avoid/identify the pure magic pitches.

Then ask the person pitching: 1. What’s your training data and how is it collected? 2. What’s your validation data and how is that constructed, and how does your system perform on that set? 3. What are the blind spots and biases in your model and how are you mitigating them?

If they don’t have succinct and competent answers or major red flags like no validation or claiming no biases then run away.


This was a really interesting read. In relation to the discussion of the predictive accuracy of a dataset with 13,000 features, I thought it might be worthwhile to bring up the idea of the "Curse of Dimensionality" for anyone unfamiliar: https://en.wikipedia.org/wiki/Curse_of_dimensionality

The "tl;dr" is basically that more features is not necessarily a panacea and can actually cause more problems.


Thanks for adding that to the discussion. I'd like to point out a couple of things:

(1) that adding features can create problems is well known among good ML practitioners (I daresay, esp. to those who have a fair amount of exposure to non-deep-learning techniques). With deep learning you can afford to worry less since with enough data and compute cycles, the network can figure out what to ignore. Which is convenient. Throwing out uninformative features however, may still have a practical benefit: less features -> smaller dataset size -> faster training.

(2) This is probably a minor nitpicky point: adding more features can lead to no improvements not only because of the curse of dimensionality, but sometimes simply because the feature has absolutely no bearing on the label; that is to say you might not be adding noise, but you might not be adding information either.


Easiest way to cut through AI BS: ask “what’s their dataset?” If it’s not obvious, there’s a problem. AI is only as powerful as the data it learns from.

There is one exception to this: if an exhaustive simulation of the problem exists. This is why AI is so successful at sandboxed games like chess and Go. It can generate its own data with zero ambiguity.

So: what’s your dataset? What simulation are you inverting? If neither, you’re just writing an expert system based on heuristics.


It would be interesting to try to train an algorithm to detect snake oil in companies claiming AI, but I don't know how it would work with no negatives.


How to recognize AI snake oil. The salesman calls it AI and claims it is not simply statistical inference.


That's ridiculously easy: If it contains the letters "AI", it's snake oil. There is nothing in man-made software that would merit the term "intelligence". Real intelligence would be creative and unpredictable. We would run for cover.


"Lack of explainability" might be seen for some to be a feature, not a bug. LEOs don't care how they have to justify an action, just that the justification exists for them to operate the way they want to.


This got me thinking - are there any cases of literal snake oil (or any snake-derived products) having better than placebo effects ? (I assume that it's different from snake venom?)


Company founders have learned that by simply being present in a big growing industry, they are virtually guaranteed for receiving investments. A rising tide will lift all boats.


Love this, many problems can be solved with regression analysis.


How to use AI to recognize AI snake oil?


1. If purports to have "self".


This is a good framework and categorization for evaluating the effectiveness of AI.

TLDR:

AI is good at: Perception (how things are)

AI is ok at: Judgement (how things are)

AI is poor at: Predicting (how things are going to be)

E.g., AI is effective to varying degrees at observing and characterizing how things are, but not at predicting how they're going to be.

For predicting, AI with hundreds of variables fairs no better than simple linear regression over single digits of variables. Simply because, the future is not reliably predictable, by any means.


This is good. I co-founded Futurescaper[1], a company that does work in strategic foresight systems -- collective mapping of complex systems, and analytical tools to help people and organisations to understand them. This put us in a prime position to be vendors of AI snake oil. Due to a stubborn overabundance of ethics, we've refused to do so, which has undoubtedly cost of a lot of business. People want to buy snake oil. It goes down much smoother than hard truths.

To amplify what this presentation says, here's the hard truths about predicting social outcomes: either they

1. Are simple and obvious, and can be easily understood and predicted via regressions and trendlines.

2. Are complex, non-obvious, and neither can nor should be predicted. In fact, attempting to predict them is often dangerously wrong. But this doesn't mean that they can't be understood.

In the first instance, you don't need any fancy software, so there's no snake oil to be sold. The second instance, however, gets caught in a cognitive bias: we think that the future can be predicted. This is because, in simple systems, it can: drop a glass above a hard floor, and you can accurately predict that it will fall and shatter. That is a simple system. We think that complex systems must be similar... just more complex.

But complex systems are fundamentally different. Consider a double-pendulum. Its movement can't be predicted for more than the next few swings. Even if you know the exact starting configuration of the pendulum -- and by exact, I mean not just every single sub-atomic particle in the pendulum arms, but the gravitational influence of literally every object in the universe -- you still wouldn't be able to greatly extend your foreknowledge of its movements. This is because the feedback loops are driven by chaos, and chaos is baked directly into the mathematical fabric of the universe itself.

For the mathematically inclined, consider the Mandelbrot set: it is, essentially, the equivalent of a double pendulum. It asks a question: "by starting with this number and iteratively exponentiating it, will in trend towards zero or infinity?". When you ask this question on a simple number line, then the answer is obvious: below 1, it trends towards zero; above 1, it trends towards infinity. However when you ask this question on the complex number plane -- with two numbers feeding back to each other as they exponentiate -- then the only way to answer the question is to keep iterating the numbers to find out. There's no shortcut. In some places, the answer is found quite quickly. In others, it takes thousands of iterations. In still others, it takes infinite iterations: you could build a computer the size of the galaxy and you still wouldn't be able to answer the question of whether such-and-such coordinates trends towards zero or infinity. That's the complexity you get from just two number lines and a very simple feedback loop.

The real world of psychological and social cause-and-effect is far more complex than a double pendulum or a Mandelbrot set, and attempting to predict it is even more futile. In fact, it's dangerous.

What makes this dangerous is that you can throw statistics at complex futures, and make statements like "Future A has a 40% probability; futures B, C, and D all have a 20% probability". But the human mind is terrible at making good judgements based on this kind of information. Hearing it, people tend to think: "right, that's settled then: Future A is twice as likely as any other scenario, so that's what we'll plan for." We fixate on what looks like the most likely future, and disregard the rest.

The problem is that if the preparations for Future A are contrary to what you'd want to do in Futures B, C, and D, then betting everything on Future A means that 60 % of the time, you'll lose.

What's interesting is that increasing the accuracy of those percentages doesn't necessarily help, and in many cases can hurt, since it only reinforces our tendency towards target fixation. In a 40/20/20/20 scenario, we might still make some concessions towards planning for the "20s". But in a 70/10/10/10 scenario, those "10s" will simply be discarded. Which means that 30% of the time you won't just lose: you'll be blindsided. Utterly fucked.

Unfortunately, most of the predictive AI that I've seen is focused on either increasing the "certainty" via extremely dubious means, or simply hiding the non-dominant answers altogether. So the AI does the target fixation for you. That's not a good thing.

The entire discipline of Scenario Planning[2] evolved to help people and companies "un-predict" the future. Rather than putting percentages on probabilities of outcomes, people need to understand the possibilities of outcomes. Even if complex systems can't be predicted, they can be better understood, and that can be very valuable for helping to navigate them in real-time. Rather than giving people the easy (and usually wrong) answer of "here's what is going to happen", you can give them a range of futures that could happen, and understanding those possibilities can aid navigation and lead to better outcomes.

It may even be that AI has a legitimate role to play in this process -- but it won't be in falsely predicting the outcome of complex systems, nor will it be in absolving people of the responsibility of thinking for themselves. This, unfortunately for my company's ability to raise capital, is a much less sexy sales pitch than the AI snake oil. (Although it does get us smarter and less annoying clients, so there's that!)

1: https://www.futurescaper.com/

1: https://en.wikipedia.org/wiki/Scenario_planning


This seems contradictory:

> AI can't predict social outcomes

> In most cases, manual scoring rules are just as accurate

So manual scoring rules don't work either for predicting social outcomes? There is some magic sauce that humans use for prediction that we haven't cracked yet? Nothing can predict social outcome?

AI is perfectly capable of predicting social outcomes, and only in very few cases are manual scoring rules as accurate as black box AI. The ethical concern is not about accuracy, but about our sensibilities when it comes to protected classes. The author cherry picked examples where simpler approaches also worked, but says nothing of practical feasability or increase in variance. Try actually doing face recognition or spam detection with manual rules.

Face recognition being way more accurate is just as much an ethical concern as a gun that is way more accurate. It all depends on who you point it at. Accurate face recognition at the border helps save lives as much as equipping the police with more accurate hand guns.

The talk of AGI is misguided. Everybody can see that the economy will be increasingly automated with narrow AI. Just because "big data" was a hype word, does not mean companies haven't been monitizing their big data (and were thus right to collect it).

We can predict probabilities about the future. The author is attacking these systems for not being 100% sure. Predictive policing is automated resource management. Militaries have been doing this for decades. It has its drawbacks, but also benefits (wiser usage of tax money, protecting low-income neighborhoods from falling in the hands of gangs).

The author also claims that algorithms automatically turn away people at the border for posting or liking or being connected to terrorist propaganda. But these systems just give a score and a human border guard makes the (more informed) decision.

A system not being 100% accurate is not an ethical concern, as long as we not treat those systems as 100% accurate and give proper recourse.

Just a spelling check can and does weed out poor candidates. Why does HR want to automate? Because they get 1000+ resumes for a single position. The manual glance they give them pale in comparison to what an automated system can do.

What is more likely? That these HR systems show promise? Or that the VC market has completely lost it (despite working with software and automation for decades, and have AI experts on staff) and is pumping billions into tealeaf reading, because now its called "AI"?

If you cheat the system by adding "Cambridge" or "Oxford" in white letters to your CV, is that ethical? Why not add it to your education section in black letters? Would you hire a good potential candidate, if you knew they acted like 90s search engine spammers? Maybe a candidate from Oxford or Cambridge really deserves to be on the top of the pile, or is it now unethical to look at education when hiring?

This presentation likes to mix ethics with technical success. Just say that a HR system is unethical, without calling it bogus with zero proof other than "some AI experts agree that this is impossible".

Yes, there is a lot of snake oil AI, and this will only increase. But these systems can and do work. I am sure there are AI experts building these systems right now.


Uhu seems good


I feel like if AI were so easy, someone would be making billions beating the stock market. It's got way more of the historical data and features than any of the problematic examples in this slide deck, but it's essentially unsolved.


Some are making billions beating the stock market

https://en.wikipedia.org/wiki/Renaissance_Technologies


[flagged]


Not to indulge the troll, but Arvind Narayanan is an (associate) professor of CS at Princeton and is one of the foremost researchers in the field on topics of ML/data privacy and ethics [0]. His papers/talks/tweets regularly attract attention on HN [1]. That you're judging the talk based on which conferences the author hasn't published in says more about your ignorance of the STS field than it does about the author's knowledge of the topic. This is top-notch content!

[0] https://scholar.google.com/citations?hl=en&user=0Bi5CMgAAAAJ...

[1] https://hn.algolia.com/?q=random_walker


It seems his main research focus is poking holes in popular tech, especially when he is the main author.


There are a lot of holes to poke, and not enough pokers.


[flagged]


Congratulations on your success! I'm actually familiar with the author's past talks and research, and am not just assuming he's competent because he lists his affiliation with Princeton.

I encourage you to familiarize yourself with the field of socio-technical systems. It is related to (but not the same as) "ML/DL", and it is important to know about if you are doing research in CS. A good place to start is the FAT* conference [0] (which was previously a workshop at NeurIPS).

Regarding manual scoring: The author cites this study [1] and specifically says: "This is a falsifiable claim. Of course, I’m willing to change my mind or add appropriate caveats to the claim if contrary evidence comes to light. But given the evidence so far, this seems the most prudent view." so by all means, do reach out to him with better evidence.

[0]: https://fatconference.org/2019/program.html

[1]: https://arxiv.org/abs/1702.04690


You seem to be knowledgeable on the matter then. Why hide behind a throw-away account and the ad hominem against the author though? You could articulate better your perspective, we would appreciate. (I hope you're doing okay)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: