Voices from the Valley Page 8

So I think a lot of the strong AI stuff is like that. A lot of data science is like that, too. Another way of looking at data science is that it’s a bunch of people who got Ph.D.s in the wrong thing, and realized they wanted to have a job. Another way of looking at it—I think the most positive way, which is maybe a bit contrarian—is that it’s really, really good marketing.

As someone who tries not to sell fraudulent solutions to people, it actually has made my life significantly better because you can say “big data machine learning,” and people will be like, “Oh, I’ve heard of that, I want that.” It makes it way easier to sell them something than having to explain this complex series of mathematical operations. The hype around it—and that there’s so much hype—has made the actual sales process so much easier. The fact that there is a thing with a label is really good for me professionally.

But that doesn’t mean there’s not a lot of ridiculous hype around the discipline.

I’m curious about the origins of the term “data science”—do you think that it came internally from people marketing themselves, or that it was a random job title used to describe someone, or what?

As far as I know, the term “data science” was invented by Jeff Hammerbacher at Facebook.

The Cloudera guy?3

Yeah, the Cloudera guy. As I understand it, “data science” originally came from the gathering of data on his team at Facebook.

If there was no hype and no money to make, essentially, what I would say data science is, is the fact that the data sets have gotten large enough where you can start to consider variable interactions in a way that’s becoming increasingly predictive. And there are a number of problems where the actual individual variables themselves don’t have a lot of meaning, or they are kind of ambiguous, or they are only very weak signals. There’s information in the correlation structure of the variables that can be revealed, but only through really huge amounts of data.

So essentially, there are n variables, right? So there’s n-squared potential correlations, and n-cubed potential cubic interactions or whatever. Right? There’s a ton of interactions. The only way you can solve that is by having massive amounts of data.

So the data scientist role emphasizes the data part first. It’s like, we have so much data, and so this new role arises using previous disciplines or skills applied to a new context?

You can start to see new things emerge that would not emerge from more standard ways of looking at problems. That’s probably the most charitable way of putting it without any hype. But I should also say that the hype is just ferocious.

And even up until recently, there’s just massive bugs in the machine-learning libraries that come bundled with Spark.4 It’s so bizarre, because you go to Caltrain [Bay Area commuter rail line], and there’s a giant banner showing a cool-looking data scientist peering at computers in some cool ways, advertising Spark, which is a platform that in my day job I know is just barely usable at best, or at worst, actively misleading.

I don’t know. I’m not sure that you can tell a clean story that’s completely apart from the hype.

For people who are less familiar with these terms, how would you define “data science,” “machine learning,” and “AI”? Because as you mentioned, these are terms that float around a lot in the media and that people absorb, but it’s unclear how they fit together.

It’s a really good question. I’m not even sure if those terms that you referenced are on solid ground themselves.

I’m friends with a venture capitalist who became famous for coining the phrase “machine intelligence,” which is pretty much just the first word of “machine learning” with the second word of “artificial intelligence,” and as far as I can tell is essentially impossible to distinguish from either of those applications.

I would say, again, “data science” is really shifty. If you wanted a pure definition, I would say data science is much closer to statistics. “Machine learning” is much more predictive optimization, and “AI” is increasingly hijacked by a bunch of yahoos and Elon Musk types who think robots are going to kill us. I think “AI” has gotten too hot as a term. It has a constant history since the dawn of computing of overpromising and substantially underdelivering.

So do you think when most people think of AI, they think of strong AI?

They think of the film Artificial Intelligence, that level of AI, yeah. And as a result, I think people who are familiar with bad robots falling over shy away from using that term, just because they’re like, “We are nowhere near that.” Whereas a lot of people who are less familiar with shitty robots falling over will say, “Oh, yeah, that’s exactly what we’re doing.”

The narrative around automation is so present right now in the media, as you know. I feel like all I read about AI is how self-driving trucks are going to put all these truckers out of business. I know there’s that Oxford study that came out in 2013 that said some insane percentage of our jobs are vulnerable to automation.5 How should we view that? Is that just the outgrowth of a really successful marketing campaign? Does it have any basis in science, or is it just hype?

Can I say the truth is halfway there? I mean, again, I want to emphasize that historically, from the very first moment somebody thought of computers, there has been a notion of, “Oh, can the computer talk to me, can it learn to love?” And somebody, some yahoo, will be like, “Oh, absolutely!” And then a bunch of people will put money into it, and then they’ll be disappointed.

And that’s happened so many times. In the late 1980s, there was a huge Department of Defense research effort toward building a Siri-like interface for fighter pilots. And of course this was thirty years ago and they just massively failed. They failed so hard that DARPA was like, “We’re not going to fund any more AI projects.”6 That’s how bad they fucked up. I think they actually killed Lisp as a programming language—it died because of that. There are very few projects that have failed so completely that they actually killed the programming language associated with them.

The other one that did that was the—what was it, the Club of Rome or something?7 Where they had those growth projections in the 1970s about how we were all going to die by now. And it killed the modeling language they used for the simulation. Nobody can use that anymore because the earth has been salted with how shitty their predictions were.

It’s like the name Benedict.

Yes, exactly, or the name Adolf. Like, you just don’t go there.

So, I mean, that needs to be kept in mind. Anytime anybody promises you an outlandish vision about what AI is, you just absolutely have to take it with a grain of salt, because this time is not different.

Is there a point at which a piece of software or a robot officially becomes “intelligent”? Does it have to pass a certain threshold to qualify as intelligent? Or are we just making a judgment call about when it’s intelligent?

I think it’s irrelevant in our lifetimes and in our grandchildren’s lifetimes. It’s a very good philosophical question, but I don’t think it really matters. I think that we are going to be stuck with specific AI for a very, very long time.

And what is specific AI?

Optimization around a specific problem, as opposed to optimization on every problem.

So, like, driving a car would be a specific problem?

Yeah. Whereas if we invented a brain that we can teach to do anything we want, and we have chosen to have it focus on the specific vertical of driving a car, but it can be applied to anything, that would be general AI. But I think that would be literally making a mind, and that’s almost irresponsible to speculate about. It’s just not going to happen in any of our lifetimes, or probably within the next hundred years. So I think I would describe it as philosophy. I don’t know, I don’t have an educated opinion about that.

Money Machines

One hears a lot about algorithmic finance, and things like robo-advisers.8 And I’m wondering, does that fall into the same category of stuff that seems
pretty over-hyped?

I would say that robo-advisers are not doing anything special. It’s AI only in the loosest sense of the word. They’re not really doing anything advanced—they’re applying a formula. And it’s a reasonable formula, it’s not a magic formula. They’re not quantitatively assessing markets and trying to make predictions. They’re applying a formula about whatever stock and bond allocations to make—it’s not a bad service, but it’s super hyped. That’s indicative of a bubble in AI that you have something like that where you’re like, “It’s AI!” and people are like, “Okay, cool!”

There’s a function that’s being optimized—which is, at some level, what a neural net is doing.9 But it’s not really AI.

I think one of the big tensions in data science that is going to unfold in the next ten years involves companies like SoFi, or Earnest, or pretty much any company whose shtick is, “We’re using big data technology and machine learning to do better credit score assessments.”10

I actually think this is going to be a huge point of contention moving forward. I talked to a guy who used to work for one of these companies. Not one of the ones I mentioned, a different one. And one of their shticks was, “Oh, we’re going to use social media data to figure out if you’re a great credit risk or not.” And people are like, “Oh, are they going to look at my Facebook posts to see whether I’ve been drinking out late on a Saturday night? Is that going to affect my credit score?”

And I can tell you exactly what happened, and why they actually killed that. It’s because with your social media profile, they know your name, they know the names of your friends, and they can tell if you’re black or not. They can tell how wealthy you are, they can tell if you’re a credit risk. That’s the shtick.

And my consistent point of view is that any of these companies should be presumed to be incredibly racist unless presenting you with mountains of evidence otherwise. Anybody that says, “We’re an AI company that’s making smarter loans”: racist. Absolutely, 100 percent.

I was actually floored, during a recent Super Bowl, when I saw this SoFi ad that said, “We discriminate.” I was just sitting there watching this game, like, I cannot believe it—it’s either they don’t know, which is terrifying, or they know and they don’t give a shit, which is also terrifying.

I don’t know how that court case is going to work out, but I can tell you in the next ten years, there’s going to be a court case about it. And I would not be surprised if SoFi lost for discrimination. And in general, I think it’s going to be an increasingly important question about the way that we handle protected classes generally, and maybe race specifically, in data science models of this type.11 Because otherwise it’s like, okay, you can’t directly model if a person is black. Can you use their zip code? Can you use the racial demographics for the zip code? Can you use things that correlate with the racial demographics of their zip code? And at what level do you draw the line?

And we know what we’re doing for mortgage lending—and the answer there is, frankly, a little bit offensive—which is that we don’t give a shit where your house is. We just lend. That’s what Rocket Mortgages does.12 It’s a fucking app, and you’re like, “How can I get a million-dollar loan with an app?” And the answer is that they legally can’t tell where your house is. And the algorithm that you use to do mortgages has to be vetted by a federal agency.

That’s an extreme, but that might be the extreme we go down, where every single time anybody gets assessed for anything, the actual algorithm and the inputs are assessed by a federal regulator. So maybe that’s going to be what happens. I actually view it a lot like the debates around divestment. You can say, “Okay, we don’t want to invest in any oil companies,” but then do you want to invest in things that are positively correlated with oil companies, like oil field services companies? What about things that in general have some degree of correlation? How much is enough?

I think it’s the same thing where it’s like, okay, you can’t look at race, but can you look at correlates of race? Can you look at correlates of correlates of race? How far do you go down before you say, “Okay, that’s okay to look at”?

I’m reminded a bit of Cathy O’Neil’s book Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy [2016]. One of her arguments, which it seems like you’re echoing, is that the popular perception is that algorithms provide a more objective, more complete view of reality, but that they often just reinforce existing inequities.

That’s right. And the part that I find offensive as a mathematician is the idea that somehow the machines are doing something wrong. We as a society have not chosen to optimize for the thing that we’re telling the machine to optimize for. That’s what it means for the machine to be doing illegal things. The machine isn’t doing anything wrong, and the algorithms are not doing anything wrong. It’s just that they’re literally amoral, and if we told them the things that are okay to optimize against, they would optimize against those instead. It’s a frightening, almost Black Mirror–esque view of reality that comes from the machines, because a lot of them are completely stripped of—not to sound too Trumpian—liberal pieties. It’s completely stripped.

They’re not “politically correct.”

They are massively not politically correct, and it’s disturbing. You can load in tons and tons of demographic data, and it’s disturbing when you see percent black in a zip code and percent Hispanic in a zip code be more important than borrower debt-to-income ratio when you run a credit model. When you see something like that, you’re like, Ooh, that’s not good. Because the frightening thing is that even if you remove those specific variables, if the signal is there, you’re going to find correlates with it all the time, and you either need to have a regulator that says, “You can use these variables, you can’t use these variables,” or, I don’t know, we need to change the law.

As a data scientist I would prefer if that did not come out in the data. I think it’s a question of how we deal with it. But I feel sensitive toward the machines, because we’re telling them to optimize, and that’s what they’re coming up with.

They’re describing our society.

Yeah. That’s right, that’s right. That’s exactly what they’re doing. I think it’s scary. I can tell you that a lot of the opportunity those fintech companies are finding is derived from that kind of discrimination, because if you are a large enough lender, you are going to be very highly vetted, and if you’re a very small lender you’re not.13

Take SoFi, for example. They refinance the loans of people who went to good colleges. They probably did not set up their business to be super racist, but I guarantee you they are super racist in the way they’re making loans, in the way they’re making lending decisions.

Is that okay? Should a company like that exist?

I don’t know. I can see it both ways. You could say, “They’re a company, they’re providing a service for people, people want it, that’s good.” But at the same time, we have such a shitty legacy of racist lending in this country. It’s very hard not to view this as yet another racist lending policy, but now it’s got an app.

When we talk about fintech in general, does that refer to something broader than advising investors when to buy and sell stocks, and this new sort of loaning behavior? Or is that the main substance of it?

Fintech may most accurately be described as regulatory arbitrage: startups are picking up pieces that a big bank can’t do, won’t do, or that are just too small for it to pick up. And I think fintech is going to suffer over the next five years. If there’s a single sector that people are going to be less enamored with in five years than they are now, fintech is definitely the one.

The other side of it is that they’re exploiting a hack in the way venture capitalists think. Venture capital as an industry is actually incredibly small relative to the financial system. So if you were starting, I don’t know, a company that used big data to make intelligent decisions on home lo
ans—which is probably illegal, but whatever, you’re small enough that it’s no big deal—and you say, “Hey, we’re doing ten million dollars a year in business,” a venture capitalist will look at you like, “Holy shit, I’ve never seen a company get up to ten million dollars in business that fast.” The venture capitalist has no idea that the mortgage market is worth trillions of dollars and the startup essentially has none of it. The founder gives a market projection like, “Oh, this is a trillion-dollar industry,” and the venture capitalist is like, “Oh, that market is enormous. I’ve never seen numbers like that before.”

It’s much more of a clever hack than an actual, sustainable, lasting, value-creating enterprise. One of the biggest flagship fintech companies, LendingClub, is in a ton of trouble.14 SoFi is probably illegal. And those are the flag bearers for the sector.

The other thing that happened was the San Bernardino shootings—apparently the guns that were used were financed by a loan from Prosper, which is another peer-to-peer lender.15 And you just think about where this is going to go. Are we eventually going to get to the point where we have the credit models to assess and not give that guy a loan because of the risk that he could be a Muslim terrorist? Is that the society that we will be living in?

< Prev Next >