Building the Data Infrastructure Layer for AI
Protege is building the infrastructure that connects holders of valuable real-world data to the companies building frontier AI. Training the next generation of AI models requires far more than text scraped from the internet. It requires data generated from people living their normal lives: de-identified hospital records, sports footage, podcast recordings, surgical videos, and motion capture data.
But this data is scattered across thousands of institutions, bound by privacy regulations, and has never been packaged for AI use. Protege aggregates it, curates it, handles de-identification and licensing, and delivers it to AI companies in a ready-to-use format.
The new capital will fund expansion beyond Protege's current four domains (healthcare, video, audio, and motion capture) into more than ten new verticals, while deepening its data partner network and scaling the team.
Two Exits, One Thesis
Protege is not a first-time founder story. It is the product of two operators who already built and exited two of the largest data companies of the software era, then came back because they saw the biggest data opportunity of their careers sitting wide open.
Protege's co-founders
Bobby Samuels and
Travis May built and exited two of the largest data companies of the software era:
LiveRamp (NYSE: RAMP, market cap approximately $1.8 billion) and
Datavant ($7 billion merger). This is the third time they are working together, and their entire thesis for Protege comes from what they learned building data networks across both companies.
"One of the things that LiveRamp taught me was in the world of data networks, it is supply. You gotta figure out the supply and then you figure out demand. I'd say for us now, supply is probably our key constraint, not demand."
Bobby Samuels
CEO of Protege
At Datavant, Bobby led the privacy business called Privacy Hub before the $7 billion merger. That experience gave him something less obvious than domain expertise: a gut-level framework for handling sensitive data that now runs through everything Protege does.
"One of the big values that we have is the loved ones test, which is if a loved one saw what we were doing, would they be proud of it?"
"Anytime we run into any sort of question, whether that's dealing with a partner or a data type, or a model builder or an investor, we always run it through that lens."
"We're dealing with really sensitive stuff and we owe it to everyone to work with the data accordingly."
Bobby Samuels
CEO of Protege
The founding team includes Chief Scientific Officer Engie Zon, a professor Bobby met through Datavant, and CTO Richard Ho, who leads engineering. Bobby is blunt about the division of labor: "I'm not an engineer. I am not a data scientist. I can usually ask the right questions or something close to the right questions. But without Engie and Richard, we wouldn't have a business."
The founding moment was simple. Travis had seen demand for AI training data in healthcare. Bobby believed the opportunity extended far beyond a single vertical. Once that clicked, speed became everything.
Public data was running out, AI companies were desperate for new sources, and waiting even a few months meant losing the window. As Bobby recalls, when a couple of major 2025 deals materialized that would not have been possible had they started three months later, the urgency proved justified.
"When we were coming up with this idea, Travis said, look, I've seen this need for training data in healthcare. And then I was like, I think this is everywhere. And as soon as we got there, it was just like, let's go."
Bobby Samuels
CEO of Protege
The Internet Is 0.0000002% of the World's Data
2024 was the year of scraping publicly available internet data. AI companies consumed massive volumes of text, images, and code from the open web to train their models.
But those sources were quickly exhausted. In 2025, manufactured data rose to fill the gap: companies like
Mercor hired doctors at $170 an hour to create mock patient cases, while platforms like
Scale AI expanded their data labeling and annotation operations to serve foundation model builders.
Bobby sees a hard boundary on how far that can go. He draws a line between three categories:
"Synthetic data is data that is created by AI to train AI. Manufactured data is data that is created by humans, specifically for AI. Real-world data is data that is created by humans living their lives, and that is being used to train."
Bobby Samuels
CEO of Protege
"Ultimately, you want data that is most representative of the human experience, and real-world data is that subset."
"Synthetic data is like reading a book, having the model then write a book, and trying to learn more from what you wrote. It just doesn't make sense."
Bobby Samuels
CEO of Protege
The scale of what's still untapped puts this in perspective. A single
Common Crawl release, a snapshot of the entire open web, contains hundreds of terabytes of uncompressed content. Meanwhile, total global data is set to surpass 175 zettabytes (175,000,000,000 terabytes). The entire internet accounts for a fraction of a percent of all data in existence.
Throughout 2024, the prevailing narrative in AI was that the industry was running out of data. Bobby watched that conversation unfold and thought it was fundamentally wrong.
"There was all this coverage of, 'we're out of data, we're out of data,' and we firmly believe that that's wrong. We're not out of data, we're out of public data."
"If you think about all the data that's generated, we're just data machines. Most of it is not captured. And so when you say we're out of data, it's like, no, we're not."
"We're out of publicly available scrapable data, but the vast majority of the world's data is still to be gleaned."
Bobby Samuels
CEO of Protege
Why AI Companies Can't Source Real World Data Alone
AI companies know real-world data matters. So why can't they get it themselves? Because there are structural bottlenecks on both sides of the market.
On the AI company side,
even the largest foundation model builders lack the internal resources and time to source, negotiate, transform, audit, and deliver data from thousands of partners, each with their own industry-specific nuances. Direct deals like the ones
between Reddit and OpenAI or Google exist, but they are the exception, not the norm.
On the data holder side, the problem is the mirror image. Most organizations sitting on valuable data have no AI expertise, no sense of what their data is worth, and no intention of building an AI sales team. Bobby explains with a real example.
"For a model builder, we are a one-stop shop for their data. For a data holder, their business may be something totally different."
"We work with a Polish soccer league and AI data's not their business, but we can be an extension of their business to help drive this totally new line of revenue in a way that's a very small lift for them."
Bobby Samuels
CEO of Protege
Bobby explains the practical difference Protege makes on the buyer side with another example.
"We've worked with partners where they had these huge complex acts around healthcare that would've taken them a year plus to figure this out on their own."
"And because we had built all the relationships, we understood how the data could be married together, we understood the privacy constraints, we understood how the delivery needed to happen."
"We were able to do all this in the course of weeks instead of years."
Bobby Samuels
CEO of Protege
"I think the industry that Scale and others are participating in is really significant, but it's not our area."
"We're good at data and real world data and building those networks. We're not an annotation or labeling firm and we really stick to what we do."
Bobby Samuels
CEO of Protege
This is also where the distinction from companies like
Scale AI and
Mercor becomes clear. Those companies primarily generate manufactured data, hiring humans to create data to specification. Protege connects existing real-world data to buyers through licensing and revenue-sharing agreements.
$30 Million GMV and a Flywheel Taking Shape
In 2025, Protege recorded over $30 million in GMV, growing more than 20x year over year. The company now works with hundreds of data partners and dozens of customers across industries and around the globe, including the majority of the Magnificent Seven. An early concern was whether AI data deals would be one-time purchases. Instead, the majority of early customers signed multi-year deals or expansion contracts within six months.
The business has started to exhibit a flywheel dynamic. Foundation model companies that originally came to Protege for healthcare data now source audiovisual and other data types through the platform as well. More data attracts more buyers. More buyers attract more data providers.
"So much of our business is around trust from data partners and model builders that we're doing right by them. And it's something that we hold sacred."
"The trust is sacred. We've been able to establish that quickly, and now we need to just continue to build on that, because it's the core of the secret sauce."
Bobby Samuels
CEO of Protege
The revenue model reinforces that trust. Rather than one-time data sales, Protege uses a royalty-based revenue share with its data partners. Partners earn payouts each time their data is used. Bobby says the structure was as much about speed as principle: "We chose the royalty model in part because it really aligns incentives: you make money when we make money. And it made it easier for folks to just move faster." Protege has already distributed tens of millions of dollars to its data partners.
Real World Data Will Power Every Layer of AI
Protege currently operates across four domains: healthcare, video, audio, and motion capture. Existing customers are already requesting data types outside those verticals. The plan for 2026 is to expand into more than ten new domain areas, deploying general managers to run each as an independent business unit and making strategic investments to quickly build a critical mass of data supply.
"I fundamentally believe that AI's impact on the world can be and hopefully is extremely positive. If we can play a role in accelerating the development of healthcare and its utilization of AI, that's huge."
"People live longer and have richer lives. And it's not just healthcare."
"If you're an 18-year-old kid, you don't have much resources, but you have a great idea for a story, 20 years ago you'd have to figure out how to cobble together the resources and you probably couldn't. Now the ability for folks to tell stories is so much more at people's fingertips."
Bobby Samuels
CEO of Protege
The team's view is that real-world data will underpin every layer of the AI stack, from pre-training to post-training, fine-tuning, evaluation, RAG, and continuous learning, across every industry. What started as a validation of AI training and evaluation data in a few domains is expanding into something much larger.
Protege is targeting a nine-figure GMV in 2026 and $1 billion in GMV within three years. When asked how he will know the company has succeeded, Bobby resists the idea of a finish line.
"Protege will have succeeded if we can meaningfully accelerate the development of responsible AI."
"But ultimately, the work is never done. Microsoft started 50 years ago and they are still pushing hard. When you're building a company, you don't really think about the finish line."
"It's just how do you build something impactful and hopefully more and more impactful over time."
Bobby Samuels
CEO of Protege
He pauses, then offers a simpler version.
"To build great AI, you need great data, and that's what we make possible."
Bobby Samuels
CEO of Protege
Further Reading
About This Funding Round
Bobby Samuels on a16z's Raising Health Podcast
Understanding the AI Data Landscape