NY Information Security Meetup (NYIS) Fireside Chat
with Vid Jain,
Founder and CEO of Wallaroo Labs
June 25, 2020
David Raviv: Welcome to the New York Information Security Meetup/ Cyber Guild. I appreciate you taking the time to chat with me today.
David: You're a PhD. I’ve had several PhDs that are founders in recent years and it's always fascinating because you had a great corporate job. You were successful working for someone else. Why did you decide to move from the light side and move to the dark side and create your own business?
Vid Jain: I actually like working for myself and there's pros and cons to that but at the end of the day I feel like I’ve accomplished something and I have things that are under my control. My frustration working with a large company is really the slowness at which it can move, the lack of transparency and understanding how the whole thing fits together.
As you mentioned, I have a PhD from UC Berkeley and I did research. It's not a giant corporation that's doing research; it's a small team, and you collaborate. You iterate quickly, and I like that. So, for me it's a very natural fit to go to a startup. The one time that I did work for a big company which was Merrill Lynch, it was like an internal startup. My job there was to work alongside eight other people to build a high frequency trading business within a very manual trading business which involved thousands of people. So even though I was working in an organization that had 100,000 plus employees we were treated as a startup internally. We had funding like a startup and our job was to move quickly and we did that. We built a huge business in three years, a billion dollar business and that was exciting. I enjoy building new things and disrupting old ways of doing things and that's what we're trying to do now.
David: In terms of trading are there some skill sets there that you took away and were able to apply to Wallaroo Labs. Some of that knowledge, the ability to scale and process data?
Vid: The trading business that we built at Merrill Lynch is a data business using computerized trading strategies. Nowadays we call them machine learning models. You develop models and then you run on tons of data points every second. We had 25 data scientists or quants that developed these models, but we had hundreds of other people that were doing productionization, that were supporting, that were operating, and scaling. We spent 100 million dollars a year on infrastructure. So the biggest takeaway was that you can relatively quickly develop some very powerful models towards extracting value from the data. But getting that into production against live data at speed and scale is a hard, expensive task.
When we started Wallaroo Labs our mission was to simplify that. Our mission was: you have a model, you have an algorithm, you have some analytics idea and we wanted to take that experience and some of the technological tricks to really get you up and running very quickly and to let you run at large scale and very low cost. That was born out of that trial by fire in the trading business.
David: What's the origin of the name? What is a Wallaroo?
Vid: It's an Australian animal, it's like a small Kangaroo. Our developers came up with Wallaroo as an idea for our product name and the rest is history. It seemed if Wallaroo resonated internally it would resonate as a name to data engineers and data scientists externally.
David: People are saying data is the new oil. Can you touch upon the problem of aggregating big data and making sense of it? There's all kinds of jargon out there that people use.
Vid: You're right - there's so much jargon it's hard to keep track. Fundamentally where we fit in are three trends: one is around the movement away from batch processing to what's now known as stream processing and the reason that's important is because there's a lot of value in the data as soon as it’s generated; the second is AI; and third the explosion of data.
David: It'd be great if you can give us a batch 101.
Vid: So batch processing is what everybody usually is very familiar with. Data is generated. It might be generated from your mobile phone, or from all the interactions that you're doing on the web. It might be generated from your credit card transactions. Data is generated in real time as life happens. But then traditionally that data is not analyzed straight away, it's dumped somewhere in a data store or database and then at the end of the night all those transactions, all those events or at the end of the week or the month, are analyzed in one go, in one batch. So you're not analyzing the data or acting on the data as soon as it's generated, you're batching it up. You're doing it in a batch that could be at the end of the day it could be in an hour it could be in a week it could be a month. That's batch processing.
David: Why is that traditionally been the way and maybe some other drawbacks for that?
Vid: Well there's a lot of different reasons. First, there was no expectation that you had to do something straight away. People and systems didn't operate at that speed so you didn't need it, there was no requirement. Number two, in a lot of enterprises you had very monolithic databases that stuff went into and it was hard to get stuff out very quickly and operate on it very quickly. So it was just very expensive and complicated to do anything other than let me write a query on some database, let me get a million records, and let me run some analysis on those million records.
David: What is stream processing?
Vid: Stream processing is you're acting on those events and you're understanding those events as they happen. For example, there's a transaction. You buy with a credit card - you are looking at that transaction not at the end of the day, not at the end of the month, you're looking at it right then. You're trying to figure out if it's fraud, you're trying to figure out what else I should recommend to the customer if they just bought blue jeans.
Stream processing is being driven by two reasons, one is that there's a lot of value in that data straight away. If you're looking to buy blue jeans and I want to recommend to you something that goes with those blue jeans it's far less compelling for me to recommend them to you at the end of the month because you probably already bought whatever you were going to buy whereas when you're shopping on Banana Republic or whatever, if I can recommend something based on what you're putting in your shopping cart there's some chance that you might actually buy it. Same with financial trading. Same with security, if there's an intrusion there's a value in knowing that straight away. So one reason is that there's a lot of value in that data that you need to get to quickly.
The second reason is that there's more and more data and the problem with batch processing is it's okay if you started with megabytes of daily data then you went to terabytes but now you're in the terrain of a petabytes of data a day. The problem with trying to do petabytes-scale processing in the batch way is how can you keep up? The other problem is storage, cost of storage. So, it becomes increasingly difficult and what's gone in security areas and in IoT in particular, you are seeing use cases where the amount of data generated in a day is 100X more than even that we had a few years ago. The data coming from connected cars, the data coming from manufacturing facilities. The data coming from a million IoT devices far exceeds the amount of data that we've ever been used to and so what is typically happening where people don't know how to deal with that data straight away is that they store it and then at some point they flush it because they can't afford to save it unless there's regulatory reasons. But it's not getting monetized. It's not getting used. It just becomes hoarding. If you actually do anything with the data you have to look at that data straight away not just stick it somewhere and forget about it.
David: If knowing that most organizations can't even keep up with it, what's the point of a device spitting out so much data?
Vid: Well what do we do with the data was an afterthought. In many places where we're talking to customers they've invested in data science and they've got little research projects but they haven't figured out how to put it into production. In the automotive industry there are hundreds of sensors in the car. There's thousands of data points that are continuously generated in a modern connected car and several terabytes of data in one day. There are all these different computing units on the car, there's actually a networking bus - modern cars are very complicated things. You have tire pressure, you have the common things, but you have all sorts of other crazy things and so there's a lot of data and the thought of what the hell do we do with it has been left for the most part to an afterthought. People read these McKinsey reports where they say that all this data will generate 750 billion dollars in value but nobody's really thought through how we actually generate that value. So they're sinking a lot of costs and infrastructure investment into all this IoT stuff but they've left the other part of it - how do we actually monetize that data - to later. At least that's my experience in talking to folks.
David: So your goal is to assist these organizations to mine for value out of that ocean of data as fast and as low cost as possible. What are the issues aside from to capture it and to scale it?
Vid: For most organizations, it's a little haphazard so it depends on the maturity of the organization and on how long they've been dealing with data. So if you look at a large enterprise that has historically been data-driven like a bank you have different groups within the bank. You've got the security group, you've got the credit card group, you've got the trading group, you've got the risk management group. These individual groups have done their independent thing and they're technically quite advanced and they started doing some data science stuff. The first place they need help is the fact that they've got disparate data. The data isn't normalized into a standard way and they're using different tools and different technologies to get their analysis and science models launched.
David: What do you mean by that, what does that look like and why is that even a problem?
Vid: Let's take a very simple example. Let's say you want to do a model around credit card fraud, that seems relatively straightforward, right? But the reason you have all sorts of data problems, forget even the model, the reason you have all sorts of data problems is first of all you have a transaction. Great, but probably a good model doesn't just look at the transactions, they look at the customers history and their profile. So now you need a database which holds a bunch of consolidated customer information and as you know within these large enterprises there are typically many different data stores with customer information in different formats. So, you immediately have to solve the problem of bringing all that data together and creating some kind of unified view of the customer. If you're going to use that information another thing that you might do - it's not just a matter of putting these models into production - but you have to validate these models after the fact and audit them, and be able to reproduce results. (The reason I use a credit card example is because if you're in security you can understand it. If you're in consumer marketing you can understand, it's an example everybody understands.) Let's say you have to be able to explain later on whether this model is working and whether that transaction which was denied was was correctly denied, so you need to know you know if that transaction was denied a month ago, you need to know all the information that went into that transaction, that that scoring as of a month ago. If you don't have that then you have to go to all the different source systems again and reconstruct that data. This was a transaction, this was what the model saw, this is what this prediction was, this is what the customer profile was from a month ago. You have fundamentally, before you can actually do anything with the data you have to solve these problems around data fragmentation sitting in different places and being structured in different ways. You have to solve that to make any progress.
The second problem, let's say you solve that and let's say you have good data to work with you know the problem that we're attacking is once you have a model, getting that bringing that model to live data can be a very hard thing if you're a large bank or you're a whole host of companies. The simplest way to understand that is Twitter as an example. One of the core applications of Twitter are the trends, so if I give any developer a million tweets they can easily write some code that tells you that the top five trending tweets are Trump and Black Lives Matter, etc. But then I say, okay can you take that and can you bring it to the live data, the tens of thousands of tweets a second in a way that it works it doesn't fall down, it scales, it's resilient. If I told you that's an hour's work you would kick me out of the room. It's not even a week of work. If I told you it was a month of work you might be well, I'm not sure if it's a month of work but let me see if you can actually do it. Twitter took years to scale that infrastructure. So, there's a lot of problems around building automated systems about running machine learning, algorithms, around automating things like security or fraud.
There's a lot of different problems that customers have to solve and we're not trying to solve all
of them but we're hoping to help them solve at least a good chunk of them.
David: Do you also provide help with the algorithms or the customer typically does that on their own?
Vid: Typically the customer has data scientists or they have analytics developers and we work with those teams to deploy at scale into our platform.
David: What is the secret sauce? How do you take that massive amount of data and be able to process that in real time without the system just going belly up?
Vid: We wanted to achieve several different things with our platform: on the one hand simplicity and power, and on the other hand it was performance and scalability. And they are often at odds. That's where the secret sauce comes in. So when I mean simplicity and power, think of things like Python which is just exploding. It's a simple language that doesn't require you to know distributed computing, it allows you to write relatively simple code. There's fewer and fewer developers that are working in Java or C, and there's more and more Python. The reason partly why that is because of the powerful toolkits like TensorFlow, etc. So, there's a lot of power, there's a lot of simplicity. But Python is not very performant, it doesn't scale. When you're going to throw a boatload of data at it, that's really hard. A fast distributed system that's resilient is what you want in production.
What we wanted is to let people write simple code, use the toolkits that they are familiar with. And then on the back-end, get this very scalable very fast system. The idea is that you write at the level of Python, you don't worry about all this distributed stuff, you use your TensorFlow or whatever, standard analytics stuff. And then it's going to run in a distributed environment which has the performance of C running across multiple CPUs on multiple machines. The Python that works on your laptop on a batch with a million records will in typical ML cases run without changing your workflow on a trillion records and it will run really fast and really cheap. That's our vision, it's what we've been building.
And why is low-cost a concern? The cost of processing an event has to fall by 10x from where it is right now because if you look at IoT and you project out four years there's 20 zettabytes of data per year altogether. That sounds made up but a zettabyte actually exists. It's way more data than we're used to dealing with. And if you apply today's cost structure to it you know nobody's going to process it. If your compute bill is $10M now you’re not spending $100M in 4 years. What's going to happen is you're going to descope your project. You're going to just throw away most of this data. That's what's already going on. Did you know major airlines like American Airlines generate from their flights 40 petabytes of data a day. The percentage of that data they use is more like two percent. So what's going to happen is if you can't reduce the cost and you can’t increase the speed people are just going to throw away their data or they're going to stick it somewhere and forget about it and then at some point wipe it clean. So in order to actually do all these crazy things, when you read the McKinsey report where AI predicts you have a health problem and the drone comes and flies you off to the hospital you need to be able to process data much faster and much higher scale with much lower cost otherwise none of it will happen because people can't deal with the data: it's too complicated, it's too costly, it takes too long.
David: In the cyber security space you mentioned IoT. Where do you think that the applications are right now, the next 12 to 18 months? Where do you think that Wallaroo is an applicable solution?
Vid: Everything spits out a lot of data points. If you think about the military, it's an environment with a lot of IoT on the edge and you're going to have to do a lot of processing at the edge, not just in a centralized cloud. Then think about the move to cloud. Cloud infrastructure at one large bank generates a million messages a second and they haven't figured out how to analyze it in real-time. So there's a lot of this new infrastructure whether it's IoT or cloud or whatever, generating huge amounts of data that people are trying to figure out how to analyze in real-time. Firms are extracting some value out of their data but there's a lot more value and if you could figure out how to analyze that quickly you could do a lot of things with it. So that's not always a security thing, but it's the same infrastructure and the data has a lot of different applications not just security but traffic routing, customer analytics, etc. So, the interesting thing is again you have the same data that can support a variety of very valuable applications and it doesn't make sense necessarily to attack these as completely separate groups doing completely separate things. Because you're replicating a lot of the work.
David: You know it sounds to me that there's a place for someone to tackle a specific cyber security problem - take your platform and then build a service around it, package it, write the algorithms to solve that problem and then go to market with it where it’s powered by Wallaroo. You're dealing now with some very large companies. But it might be a place where mid-size businesses do not have the capacity to tackle this problem on their own; someone might be able to fill that gap using Wallaroo.
Vid: We're very open to working with partners, we can't do everything.
David: Where do you see the company in two or three years?
Vid: We're hiring business development people and technologists. In the last two years the amount of data has exploded and the desire for people to act on that data in real time vs. batch has changed: doing it at the end of the day doing it at the end of the week used to be good enough but it's no longer good enough for a lot of people and they're starting to try and figure out what to do. The other thing that's happened in the last two years is that AI has really exploded, whereas two years ago you wouldn't see data scientists in many of these organizations. Now they all have some level of exploratory or sort of mid-level maturity of data science. And now the problem is getting it live. So I think there are a lot of different trends that are converging: the data science, the move from batch to real time and the explosion of data. We're actually very well situated with those trends which are converging to help other companies get value from their data.
Our simple task is to let our customers focus on what they're good at and we take care of the plumbing, that's it. We're plumbers fundamentally. We want you to monetize the data that you have. We want you to be the best in that category and optimize your resources.
David: I agree that the trend of companies focusing on their core competencies is being established now. I think companies realize that they should focus strictly on what generates money for them and then let somebody else deal with the rest. Parting words, the past several months - how did it affect your company, your business, and maybe some positive changes in your life in particular?
Vid: I don't think there's any positive changes in my life. Some folks say that they’ve learnt how to cook or establish new hobbies. Unfortunately I just end up working more so I could learn from these other folks. When Covid happened we decided to take a step back and reevaluate the marketplace and so there's a few things that we're doing. We used to be more horizontal and we're focusing now on a few verticals: financial services, IoT, and security. Secondly, we're leaning into AI right now.
We just had a call today with a company that develops security and other risk applications for their customers and they have huge amounts of data that they are gathering from different customers that they service. The customers are saying we want to do this with the data and we want to do that with the data and so what's interesting is we're now seeing over and over is that there are a lot of companies that are aggregating different sources of data from their customers and internal sources. They're creating data lakes - and this is actually even true within the military - and they're trying to figure out how to give their customers, their end users more access to the data and better ability to actually run analytics on it or run a model on it. And so the other trend that is going to be big is a kind of self-service AI. There's all this data and there's only so much that these three data scientists can do with the data or these two analytics developers internally within one group can figure out what to do with the great data so if you can open up access to the data and give people tools to analyze that data that's great internally for an enterprise organization and that's great for providing solutions to customers because then the customers don't have to deal with the infrastructure, they don't have to deal with the scale issues. They can hire one data scientist and that data scientist goes in and can create tons of value on the customer's own data.
David: The whole space is super interesting, and I think there's going to be more and more applications as organizations realize what they can do. Self-service makes total sense. You know that will allow for people to just go in and ask a query using nlp, etc. and the machine would just spit out the answer without really knowing what's the underlying plumbing.
Vid: Exactly. I mean who knows how the model works. You spin up something, you run some TensorFlow on it and so you don't need to know how the model works. It just gives you a result that's a pretty good result. You shouldn't need to know how the infrastructure works either.
David: Vid, it's been a real pleasure. I learned a lot. I hope the folks on this call learned a lot as well. I think you tackle a real big data problem. It's a worthwhile endeavor and I appreciate that. So thank you very much for joining.
Vid: Thanks for having me. I invite folks to please connect with me on LinkedIn. I’m happy to hear their thoughts or talk about use cases. I’m always willing to learn from other people's experiences.