Snorkel AI

Sep 22, 2022

Introduction

Snorkel landed on my radar in early 2021, before their last two raises. It is not often that you see a founder team of Stanford Ph.Ds emerge from stealth with an *estimated* $3.25M seed raise, as well as a fresh $15M cheque from Accel, Greylock, Lightspeed, GV, and many more.

At that time, the company was valued at ~$70M post-money. Since then, the team has raised another two rounds: their Series B was led by Lightspeed in April for a total of $35M @ $135M post-money, and Series C in August was led by Addition and BlackRock totaling $85M at an estimated valuation of $1Bn post (according to PitchBook). The company doubled its valuation from Series A to B and 7.4x that to a unicorn only in four months.

What compelled these investors in the space to write these cheques?

I believe there are multiple factors at play here. Sure, in 2021 we were in one of the biggest bull markets this millennium. But if Snorkel was raising this year, I still believe they would have little problem with their fundraising process (albeit probably at a lower valuation). The sections below will describe my rationale:

What problem does Snorkel solve?

To provide a good answer we need to understand the two components needed to utilize Machine Learning. Let’s say you wanted to train an Autonomous Vehicle to recognize “STOP” signs. From a basic level, you will need:

Training Data: First, you need training data to instruct the ML algorithm (i.e. thousands of pictures of “STOP” signs)
ML Algorithm: Second, the training data will be fed to the ML algorithm that you’d need to develop, which needs to be developed and defined by the user

You would be wise to assume the second component here is much harder to get right since that’s where all of the “ML magic” takes place. And you would indeed be correct, the development process is long and complex (which leads to the creation of some very high-paying jobs for extremely intelligent individuals).

However, the reality slightly misses the mark. Turns out, many companies aren’t able to creatively - and slyly - produce mountains of training data by getting you and I to click on Stop signs when we’re filling out CAPTCHAs. Your ML engine is as good as the training data you feed it, and you cannot expect great results without providing a large volume of high-quality training data. Without a sustainable solution to source your data, you may end up spending 95% of your time labeling data for an ML engine’s consumption.

In fact, the problem goes much deeper than tedious hand-labeling, here are some other issues one will encounter on their journey to build an ML solution:

Subject Matter Experts
What if we graduate from identifying “STOP” Signs to developing an ML-based predictor for lung cancer? You’ll need MDs to spend hours labeling each X-ray and medical data so an ML engine (which was supposed to help doctors in the first place) come up with an accurate diagnosis that the doctor could’ve easily made in the first place.
Sensitive Data
Sticking with the same example, you also can’t outsource these medical records to another firm to label, otherwise you’ll be stuck dealing with a mountain of paperwork and regulatory issues that very quickly nullifies the ROI in the first place.
Velocity
What if you’re HBO and you wanted to utilize ML to perform social media sentiment analysis on the House of The Dragon premiere? That sounds amazing until you realize that it’ll take at least four months to label all the relevant tweets to product training data, whereas the show is on air for only 2.5 weeks.
Auditability
This was mentioned by Snorkel’s CEO, which I thought made a lot of sense. Even if you hand-label data yourself or outsource to someone else, these labels remain prone to human error and it is incredibly difficult to audit the correctness of these labels.

Given all of the above obstacles (and more that I’ve left off), the problem becomes clear: Hand-labelling data is expensive, time-consuming, and incredibly difficult at scale. And that is where Snorkel comes into play.

What does Snorkel do?

In a nutshell, Snorkel claims its product is able to label training data at scale, which can then be fed to an internal ML model built by the customers themselves. They make this possible by letting the subject matter expert define several rules and filters, which is then utilized by Snorkel to systematically label data.

The dreamer part of me can’t help but get excited about the capabilities Snorkel can unlock for its customers. Most enterprises often have the operational and financial means to acquire the talent, software, and hardware needed to build effective ML models. It is often the training data that is an inconquerable bottleneck for most players, given the amount of hours manpower needed to manually label / outsource.

Snorkel claims this method can accelerate model development for its customers by 10-100x, a not-so-humble claim no doubt. In an ideal world, I’d definitely spend a night out with some ML friends of mine to verify these claims, but given my lack of free time (and recent resentment of alcohol), I choose to take these claims at face value. It is also helpful that Snorkel lists a variety of case studies to help demonstrate its value.

The impact of Snorkel, according to themselves.

To Bring it Back, What Makes Snorkel Compelling to Investors?

Putting my investor hat on, my thesis to invest/meet the founders boils down to the following few reasons:

Competent Team
I wish I had more time to research more but it’s really hard to miss all the “Ph.D”s in Pitchbook’s team section… which is exactly what I expect when it comes to an ML Enterprise business. It also helps to see a long list of great investors express faith in the company.
Big Recurring Problem
We covered the problem to an extent above, so I’d like to cover another perspective here. While there’s a great degree of “zero-to-one” difficulty associated with creating the ML algorithm, there is no (if barely any) upkeep associated with maintaining the algorithm itself. Where on the other hand, training data is a constant “operating experience” and fuel that becomes the bottleneck if unchecked. The point is, both problems need solutions, and Snorkel takes care of the annoying one so you can focus on the more challenging one.
Bigger TAM
Snorkel’s TAM is not as well-defined as your usual Enterprise SaaS market disruptor, it is a greenfield of customers that haven’t even realized they could use Snorkel yet! As the product evolves, Snorkel’s use cases extend horizontally to almost any industry (given the right sales team, of course). Maybe this fact can in part explain the company’s rumored 30x+ valuation.
Excellent Product
Competent teams solve big problems with great products. I also have a whole section on this above, but did I mention how Snorkel was started at Stanford, where it was constantly test through rigorous academic evaluations and expert opinion? That’s another stamp of approval that is rare to come by.

Last Thoughts

While I remain optimistic about Snorkel, the company is still in its early stage and will face many challenges before reaching true product-market fit. In fact, I am still not convinced on the idea that Snorkel is a platform business, at least not yet. Snorkel, at the current moment, provides an extremely helpful tool to its customers, which makes it prone to competitive pressures (or quite frankly, a swift acquisition). That being said, thank you for reading and I look forward to learning and seeing how Snorkel develops!

Source: The majority of my sources come from interviews, company websites, and specific research articles I read on the web. I do not have a degree in computer science/engineering, so if you are someone with subject matter expertise, I’d love to hear your thoughts! If you need a source for a particular statement, message me and I’ll share :).

AK Think

Snorkel AI

Snorkel AI

Introduction

What compelled these investors in the space to write these cheques?

What problem does Snorkel solve?

What does Snorkel do?

To Bring it Back, What Makes Snorkel Compelling to Investors?

Last Thoughts