Grey Beards on Systems - 62: GreyBeards talk NVMeoF storage with VR Satish, Founder & CTO Pavilion Data Systems
Episode Date: June 16, 2018In this episode,  we continue on our NVMeoF track by talking with VR Satish (@satish_vr), Founder and CTO of Pavilion Data Systems (@PavilionData). Howard had talked with Pavilion Data over the last ...year or so and I just had a briefing with them over the past week. Pavilion data is taking a different tack to … Continue reading "62: GreyBeards talk NVMeoF storage with VR Satish, Founder & CTO Pavilion Data Systems"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Howard Marks here.
Welcome to the next episode of the Graybeard on Storage monthly podcast,
a show where we get Graybeard storage assistant bloggers to talk with storage assistant vendors
to discuss upcoming products, technologies, and trends
affecting the data center today.
This Greybeard on Storage episode
was recorded on June 8th, 2018.
We have with us here today, VR Satish,
founder and CTO of Pavilion Data.
VR, why don't you tell us a little bit about yourself
and Pavilion Data?
Thanks a lot, guys.
I'm VR Satish, founder here at Pavilion Data. Thanks a lot, guys. I'm VR Satish, founder here at Pavilion Data.
My life, I've actually been a lifer at Veritas,
and then it got acquired by Symantec.
So I ended up there.
Towards the end, I was the CTO for the storage division.
That's where I am.
So here I am, excited to be doing my own company,
working on bleeding edge technology around NVMe and NVMe or fabrics.
Well, bleeding-edge is the best technology.
So what is Pavilion Data, and what's going on with it from the perspective of availability and that sort of stuff?
So in terms of the product, I mean, the basic premise is what interested me let me start with
that is you know we have had storage systems that have been stagnant pretty much in their design for
the last 25 30 years and they've been designed for um spinning media you know as time goes by
you know new innovations from intel with nvMe and the whole consortium of companies,
I mean, we have gotten to the era of solid state everything. Now here we have media,
which is tremendously fast, reaching as fast as network speeds, if I may use that word.
And it's time for us to just look at, is it still valid to do what we were doing for the last 30
years in this new environment? Or does the entire storage system need to be completely redesigned
from the ground up so that we can keep up with the speeds of the media?
And that's precisely what we're doing. We believe we have a storage system
that has been designed from the ground up with the faster medias
in mind. And we have a great product and the product
is available right now and it's shipping and
people can buy it and use it and try it out for themselves. So are you using NVMe storage SSDs
and that sort of stuff? That's exactly right. It's not only just on the on the backside using just
the SSDs alone. See, let me get a little bit philosophical here. You know, I am a big believer that great companies are formed
because they came up with a very simple question
and answered it with a very simple answer, a profound answer.
So we ask this fundamental question saying, okay, like I said,
media is reaching the speeds of network.
I mean, they can, if you take a bunch of NVMe drives,
and they can actually
sync in up to a terabit per second. And that's pretty much what you have on the network side.
So if that's the case, you know, if you go walk into a data center and ask the question,
what is the one device which you see in the data center that can ingress a terabit per second and
egress a terabit per second at land rate.
These are not the servers.
These are the networking devices.
They figured out how to do this at scale.
So we said, okay, if that's the case,
why isn't storage made like a network device and why is it made like a server?
And aren't all the network stuff, a lot of hardware, intensive development,
that sort of stuff, to get to those speeds?
Well, that was the case long back when Cisco originally started.
There are lots of ASICs and FPGAs and all that stuff.
But if you go to the world today, you know, you have the Quanta guys who are using merchant silicon from Broadcom and all these people
and building up a system.
It's just good hardware design, but not necessarily at the component level.
You can go get the components in fries, if I may use that word.
You know, it's a local department store for electronics. And that's just fair. You have to just put them up together and
have a really good design, a system design. I'll give you that the network gear can carry
packets at essentially line rate at a couple of terabytes per second for a switch.
But that's just plumbing.
They're not actually doing anything.
They're not storing the data.
They're not retrieving the data.
They're just moving it from port to port.
That's the easy part.
Okay.
That's an excellent question.
The first question you have to ask is, how do I move the data?
Now, once you know you can move the data, you have to now ask the second question,
saying the tougher one, according to you, is how do I move the data? Now, once you know you can move the data, you have to now ask the second question saying,
the tougher one, according to you,
is how do I process this data?
How do I store this effectively
with whatever format you want to store it
at that line rate?
Maybe give or take a little bit.
I'm not saying, look, you're going to do
a terabyte per second at line rate.
I mean, I don't think that's needed.
But you take the fundamental concepts
of how they were moving data so fast
and interject the storage concepts between that they were moving data so fast and interject
the storage concepts between that and say, maybe I will not move it at a line rate. I'll be close
enough. So what we essentially did was borrowed some of the basic underpinning concepts of the
network design, network switch design, but we incorporated a lot of storage concepts on top
of that. Because I'll give you an example. Network world is very forgiving.
You can drop packets and it's okay.
Somebody will retry it.
In storage, you can't do such things.
Depends on the network you're talking about.
In the typical Ethernet network, just drop the packet.
In storage, you can't do that.
The quality of service is much more profound.
I mean, the resiliency requirements are much more profound.
So it's not that, I mean, if you look at the concept there,
you have line cards, you have port processors, you have ports.
People can buy more ports as they want.
We try to bring in similar concepts in this world.
Just as you have line cards here, we have controller cards.
You know, controller line cards, as we call them.
And you have a whole bunch of CPUs, and you can expand them as you want.
You can add more as you want.
Now we put in a larger CPU
when compared to a very, very low powered one,
which you may get in the network world
or sometimes driven by ASICs.
We put a Xeon D, you know,
a little SOC, which is low powered
inside our line cards.
And we had enough juice there.
Xeon D has got a usable amount of horsepower.
It's got a usable amount of horsepower. It's got a usable amount of horsepower.
That's a big step up from an arm.
Yep.
And that was good enough for us
to actually do some of the rate calculations that we did.
I mean, if we wanted to do
and some of the metadata computations
that we wanted to do.
And we found out, we were surprised
when we did this as a prototype in our lab long back,
we were surprised that we could actually clock speeds
up to 120 gigabytes per second.
128 gigabytes a second?
120 gigabytes a second is what we could clock.
And then we added data management.
Reads are actually really fast
because data management really does not affect that.
When you do writes, obviously you're not going to get that.
You're going to have some impact
because you're doing rate computation and all that.
But at the end of the day, what it proved us was looking at it from outside in, not following the traditional model
of buy a server, put some disks in it, put some software on top of that, and voila, this dual
proc server is now going to become your storage box. That's not what we did. We did a complete
outside-in thinking. And we said, we're going to look at it in a completely different way.
Not only do it for the speeds of today,
but how is this going to be for the speeds of tomorrow?
You know, you see Intel making announcements
on the 3D cross-point stuff.
Very interesting work that's going on
in the industry, in the media.
This is, in my word,
this is the age to be innovating in storage.
And it's exciting for me.
I'm a storage guy.
I'm a storage lifer.
Yeah, well, it got boring there at the end of the disk drive era for about three or four years.
It didn't last.
Things haven't been boring for a while, though.
Absolutely.
And I love what I'm doing, by the way.
Otherwise, you know, when we started this company, you know, everybody used to come and say, oh, another storage company?
Why are we doing this?
You know, how many skeletons are there in the closet?
But when you have a compelling value proposition, when you think you can disrupt the market,
when you have a compelling business model, all these three start to come together and we think we have something here and we want to make this big. Okay, so we haven't actually come out and
said it yet, but we're talking about a storage appliance right that's correct we're talking about a 4u storage appliance which is based off of nvme or fabric that means the connectivity is
nvme or fabric okay we do two two kinds of fabric uh we can talk over a regular rdma rocky v2 or
rocky um or we can also do tcp both so for those for those hosts which want to connect to our box,
which don't have the luxury of having any of the new fancy RDMA cards,
that's game two.
We can actually change the personality of a controller to say,
you will now talk TCP and everything will be over TCP.
So legacy compatibility is also there. Well, I can see where you're coming from with legacy,
but the NVMe over Rocky code is in the latest Linux kernel.
That's been through the standards organization.
The NVMe over TCP hasn't yet.
That is correct.
So do you have a driver for that, or are you sending people to SolarFlare?
No, we have a driver for that.
We have actually ported a Linux version, and we go all the way back a few versions.
And those of you who want to install our driver
on the host side for TCP, that's fair.
And for Rocky, we need nothing.
It's completely agentless.
Agents are evil.
So just do a Linux 7.4 and power it up
and you're done to connect.
And as long as you have a Rocky NIC.
As long as you have a Rocky NIC, you're done. Exactly. boom, power it up and you're done. And as long as you have a Rocky Nick. As long as you have a Rocky Nick, you're done.
Exactly.
Wow.
There's the configuring the network magic,
but this is a storage podcast, so we ignore that part.
So the storage appliance has line cards?
Yes.
So see, when we went about designing this, we had, you know, this was the tough part.
We had five guiding principles for us.
The first one was we wanted to go after the new trend that is happening in data centers at scale, which is rack scale design.
In the sense, people are starting to look at the rack as the unit of computing. If I may call a glorified server, right?
And they do hyper-convergence across racks, if I may use that word.
I mean, they want to deal at the rack level.
Now, there's also a lot of talk about disaggregation inside the rack.
That means keep the CPU separate, keep the memory separate, keep the storage separate.
Unfortunately, technology for the CPU and memory is not out of the labs yet,
but storage definitely is out of the labs with NVMe UIF. So we wanted to do something for the
rack scale. This means we wanted to build a storage box, which was fast, small, and dense.
Why fast? Because, I mean, think about it. In a rack, you could have 20 servers. And assume that
each of these 20 servers
have two NVMe drives.
This is take two NVMe drives per server.
Now, if you want to give the equivalent of that,
that's two NVMe drives times 20
is what the power you need to give,
performance you need to give from your box.
So it has to be fast.
The second one was it had to be small
because what good is a rack scale architecture storage
when you're going to use three-fourths of the rack?
It's completely dumb, right?
So we had the challenge that we had to keep it small
and inconspicuous within a rack.
So we came up with 4U.
In my opinion, personally, even 4U is kind of just about there.
If I can do it in 1U, I want to do it in 1U.
But then you can do density in 1U,
but not necessarily performance with data management in one view, right? And then it has to be dense,
because you have to take all the storage that is there in each of these servers and consolidate
them at the bottom of the rack. So it has to be dense also. The second guiding principle we used
was it had to use commodity components, because hey, at the end of the day, we don't want to build
an exotic piece of system, and then charge an arm a leg like $3, $4 million per box.
That's kind of a non-starter.
We wanted to democratize this whole notion of what we were trying to do.
When you say commodity components, you're talking about chips and that sort of stuff that you had been hardwired designed for.
That's correct.
All the chips we use are readily available.
You can go and get them and use all the chips we use are like you can readily available you can go and get them and
use all the chips are readily available the third one is okay but you guys did do your own
circuit board and mechanical engineering that is correct we did the system design we didn't do any
of the silicon design because we also talk to people who say you know commodity components
meaning dell servers well that's because they always come, for them,
anything that is software-defined has to run only in a server.
I think it's a little crooked thinking, in my opinion.
The third one was reliability.
I mean, what good is a fast system
and your consolidating storage for the rack if it's not reliable?
So we had to build availability, reliability, fault tolerance into the system, which was for you.
The fourth one was we had to add data management, sufficient data management,
because, you know, the blast radius obviously increases when you consolidate.
So we had to make sure that, you know, we had, you know, rate six.
We supported some of the things which were wanting in the DAS world.
For example, how do you do a backup?
How do I do a consistency group backup
for a distributed application?
How do I do thin provisioning?
How do I do snapshots?
These things were a little foreign in the DAS market.
Though we have known this in storage.
You don't do thin provisioning in DAS.
You stick an SSD in and that's how much that server has.
Exactly.
So because we had to give the economics of pooling, so we support all that stuff. The fifth big thing
was guiding principle was ease of use and frictionless deployment.
At the end of the day, if you have a product that requires
let's say, you know, an exotic purple cable, if I may use that word.
I mean, you're lost. You don't want that.
It had to fit into what people normally do. So frictionless deployment,
not necessarily put agents across all the 1,000, 2,000
servers that are out there and upgrade them and keep the lifecycle on. So it has to be
plug and play. And ease of use was very important for us.
So these are the five. Let me get back to one of the things you said. You mentioned data
management. Not a lot of the NVMe over fabric startups these days
have data management at all.
Other than, you know, it handles multiple NVMe spaces and that sort of stuff.
But that's about it. Snapshots and things like that
are kind of foreign concepts to this world.
You know, like I said, I'm a storage lifer.
For me, storage without this data management doesn't exist, in my opinion,
as a shared storage.
So you're absolutely right.
You really were in charge of volume manager for a while, weren't you?
Yes, I was a CTO for volume manager, file system, clustering, VCS,
all that stuff.
So, you know, look,, VCS, all that stuff.
So, you know, look,
that's where the real challenge is.
When you start doing data management is when the CPU starts
getting involved. When the CPU
starts getting involved, that's when you
run, once you have to test the CPU,
you have to look at the number of PCIe
lanes and how does it go on a scale,
all that stuff starts coming in.
One way to scale with data management is to say,
look, I'm not going to do any data management in my box,
I'm going to put agents on all the hosts,
use one core of every host.
If there are 20 servers, I'm going to use 20 cores,
one per server, and I'm going to get the lanes out,
I'm going to do all that stuff.
That's one way of doing it.
But doing data management inside the box at speed, is rocket science it's not easy and it requires us that's exactly what made us
look at this entire design outside in you know how do we how do we do this at speed what's this
number of cpus we need that's why we have 20 controllers to kind of do all this work there inside.
20 controllers?
Absolutely.
The box actually has, you know,
so the way we designed it was,
you know, you don't want to build a God box here, right?
It's a petabyte and 20 controllers
and you go try to sell that,
you'll never get anywhere.
What we did was,
it had to be consumable in the sense,
you should be able to increase the number of controllers as you want.
That's increasing the performance.
You should be able to increase the amount of disks inside as you want.
So they are independent from each other.
So you can start off with four controllers
and go all the way to 20 controllers.
That allows us to scale the performance independent of the capacity. So there's 20 controllers. And that allows us to scale the performance
independent of the capacity.
So there's 20 controllers.
That's what we do.
And each controller is one Xeon D?
That's correct.
And then there's a PCIe switching fabric
connecting them to all the SSDs?
That's correct.
So that was another challenge.
This is not as easy as it seems.
One of the classic problems is, if you look at PCIe,
the PCIe device tree is actually rooted at the CPU complex.
Now, if you have 20 controllers, which one?
This is a challenge.
Well, there's MRIOV, but nobody ever did that.
IOV.
It has been a challenge. The specs are kind of there, kind of not there.
People tried doing this.
It was not easy.
In fact, I'll tell you honestly, we don't do MRIOV.
We actually found out an ingenious way to do this with what is standards today.
And I will not go into the details of that because that's a lot of the IP that we have.
But in essence, we've got the ability
for all these controllers to look at all the drives
and do rewrites to these drives,
and we have achieved that.
Dual path?
Oh, yeah, dual path.
Actually, any number of paths.
There's no limitation.
We support dual paths today from the external host side.
But between the controllers and the media itself,
we provide availability. So one of the things we did inside this would be interesting
is we don't require we use standard u.2 you know nvme drives okay so it's commodity drives
you can go buy from any of the big vendors that are outside out there and we don't require dual-path drives at all.
We don't need that.
We can actually work with single-path drives,
and that's perfectly okay for us
because we have a small, you know, Mezcard kind of...
Oh, like a SATA to SAS interposer?
Yeah, it's an interposer.
That's a perfect word.
So interposer, which actually detects if a fabric fails
and then switches over to the next fabric.
So in essence, you can actually get all four lanes active
even though you have availability.
It's not divided by two, one for each side, and we don't do that.
So get full bandwidth.
And it's cheaper. It's not redundant to use a single path one.
So I've now got this petabyte of storage and 20 controllers.
And I want to provision an NVMe namespace to a server.
Is there a GUI, a CLI, a REST API, God willing?
Excellent question. So before we go there, I actually want to correct something that a REST API, God willing? Excellent question.
So before we go there, I actually want to correct something that he said.
It's not an all or nothing.
So you can start with 14 terabytes, depending on the drive capacity.
Okay, I have 500 terabytes and 12 controllers.
12 controllers.
So we have a web-based UI.
So the first principle is, the way we did this stuff was,
we had an API first model.
That means we developed the web services API.
On top of that, we built a CLI,
a web user interface on top of the API,
as well as expose the API.
If you go to our web user interface,
there's a small link out there that says,
go read the API.
So it's self-documenting.
You can go and use it and all that good stuff.
So there's a full-fledged web user interface.
So if you look at the workflow, it's as simple as you create an equivalent of a rate group inside
with a minimum of 18 disks.
That's how we do white striping inside.
And then once you create a rate group,
you create a volume, assign it to a controller,
and on the host side, you just do an NVMe discover
with that IP address for the controller
and then connect to the namespace that you want
and you're done and you're running.
So that's how easy it is.
I mean, we also have security authentication
and all that stuff if you want to do all that. But in essence, see, this is the beauty of NVMe.
It's as simple as using something like an NFS, right? You know, the IP address,
and you know, the share export from the NFS box, you just mount it and you're good to go.
That's how simple it should be to consume.
That's how simple it should be to failover.
That's what we do.
You have a 20 controller cluster in 4U.
It's amazing.
That's correct.
I mean, these are like racks of storage here.
I mean, 20 controller.
Exactly.
By comparison, we're used to their little
controllers, right?
I understand. I understand. But they're
also blazingly fast controllers.
Yeah. I mean, look, that's the
beauty of using commodity parts,
right? You get Intel
doing great innovation around
the Z on Ds, and the
ARM guys are also catching up. So
you know, why build my own NAND and my own ASICs for controllers
when these are readily available?
Okay, and in terms of data services,
I've already heard thin provisioning, which is just logical.
Yeah, so thin provisioning, we also do RAID 6.
We do snapshots and clones, writable clones.
We don't do compression yet.
It's in the works.
We don't do deduplication.
Because, you know, look, at this speed,
people who really care about going to NVMe at this speed,
they don't want you to do deduplication.
In fact, that's what we've heard.
I think that's probably true for the next two years or so.
And we'll catch up.
As it becomes normal, then, I mean,
you deduplicate to make it more cost-effective. Yes, exactly. two years or so. And we'll catch up. As it becomes normal, then you duplicate
to make it more cost effective.
Yes, exactly.
And as that performance level
becomes more normal,
then people will start putting
more cost-sensitive applications on.
And, you know,
maybe there'll be a new company
and maybe it'll be us
which will do, you know,
do you do first NVMe storage?
I don't know, you know,
but it's a natural progression.
You know, the best, here's a joke I use all the time
the best part about being a CTO in the IT industry is
you really have to just look back and look at the future
history has been repeating itself
well, you don't have to tell us, we've been around a while
it's basically the premise of this show
okay, so you do RAID 6 Yeah, yeah, it looks like our harness. It's basically the premise of this show.
Okay, so you do RAID 6, and is it always 16 plus 2, or can I add drives to that?
So it's always 16 plus 2.
Okay, so expansion is in 18s, which is fine.
In 18s, yeah.
And how do you handle controller resiliency?
Good question.
So, you know, what we have built, we serve blocks.
We don't do files, by the way.
We serve blocks.
That's NVMe or Fabric, right?
But in order to do things like the provisioning and snapshots and clones,
we pretty much had to build the equivalent of a cluster file system without the namespace.
Oh yeah, there's always metadata.
Each controller, yeah, there's always metadata.
Now when we built that, when we built something
like a cluster file system inside,
we built such a way that things are active-active.
Even a volume can actually be assigned to multiple controllers,
and you should be able to do IO on both the controllers.
That's what we built for.
We have not enabled that in this release.
Active-active for a per volume is not enabled,
but even though the controllers are all active,
they can be doing other work.
Okay, so it's kind of a broad-spread ALUA?
Yeah, it's a broad-spread ALUA.
So when one of the controllers die,
we have an internal heartbeat mechanism
which detects a controller going away,
or the host side detects that there's a link breakage
and starts the connection over to the other side,
and we do a failover and I.O. has continued.
And we have to do a failover in a small window of time,
otherwise you start getting timers.
So we've taken care of all that stuff.
Do you need custom path selection plug- plugins for operating systems and hypervisors or
no right now see look right now nvmeof is in its infancy uh we work in uh with uh
linux and linux community is very active with multi-pathing and and we have made some enhancements to the current Linux open source,
and we put it back on the open source
with multi-pathing enabled.
So the next version of Red Hat,
you'll probably have full multi-pathing enabled.
So we've gotten some flags.
Some people say, oh, they're doing multi-pathing,
so it's no longer agentless.
You've got to put an agent.
My thing is, agentless, in my opinion, is not all that bad,
but I don't want to maintain the agent.
Next release, if Red Hat and Windows and VMware catches up,
I'm not in that business, right?
My technology should allow for that.
Yeah, as soon as the standard covers those things,
you don't want to be doing it.
That's correct.
That's correct, yeah.
So the clone stuff, you mentioned that you provide support read-write access.
Is clone a physical copy, or is it more or less another version of Snapshot?
And I assume Snapshot is space efficient, right?
That's correct.
So we just duplicate the metadata that's supported, not the data.
And we use a redirect on write model.
So clones are very efficient.
They have no impact on the performance of the
original. And you can write as long as you have space and then you'll get a bunch of alerts when
you're reaching close by. And that's about it. That's how about it is. And the granularity on
that is in Kbyte is in the order of kilobytes, not gigabytes? In the order of four kilobytes okay i'm sorry you can snapshot a four kilobot
kilobyte block i thought this would be a volume good no no no no yeah you snap
the granularity i gotcha redirect right i understand you i'm not by the way just a
disclaimer i'm not saying it'll always be four kilobytes we could go to 16 kb as the drive
sizes increase, right?
Yeah, and small changes like that should have small effects.
That's correct.
We have dealt with systems with 16 megabyte page sizes,
and they go, and we do snapshots.
It's hard to call that space efficient, that's all.
Yeah, yeah.
So, I mean, internally, in order to maintain four kilobyte block snapshots,
pointer redirects, and that sort of stuff,
typically you would need some sort of shared storage,
shared memory across the cluster of controllers.
You don't seem like you have anything like that.
Yeah.
So one thing, in this version, we don't use any NVRAM or anything like that.
But one of the beauty of what we have done is we have a PCI fabric in between,
which connects the controllers to the drives and also connects the controllers to the controllers.
So in essence, each controller can actually watch a portion of memory of another controller. So one of the ways we do for really fast metadata
is we actually replicate our metadata
across multiple memory systems inside the box
so that we have high availability there.
And then we have enough hold-up time through supercaps
to actually give us enough time to flush the metadata
in case there's a power outage or a failure.
So that's how we get resiliency.
And performance is, look, we're writing at memory speed,
you know, pretty much with the latency of PCIe.
So look, a typical write across from one control to another controller
will be in the order of 500 to 600 nanoseconds.
Can't beat that.
Can't beat that.
So this system is designed to be, hey, costless, number one,
because we didn't use fancy parts.
Second one is designed for speed and designed for the new class of media
from the ground up.
This is not a retrofit.
And we did not pick open source ZFS and just hack it out of it
and then call it a file system for NVMe media.
We didn't do that.
We designed the system for our,
we designed the software for our hardware
for the media that we are trying to do
for today and for what's going to happen tomorrow.
So, I mean, have you guys done any testing
with Optane and that sort of stuff,
the storage class memories?
We have, but I think, you know, look,
we're not serious about it today.
I'm sure we'll get serious about it very soon.
The price has dropped a little it today. I'm sure we'll get serious about it very soon. The price has to drop a little bit more.
I think even...
Intel has to be able to produce enough
for the price to drop a little bit more.
Yeah, so it's a chicken and egg thing.
So, I mean, look, just now,
I think the U.2s have started rolling out,
which have decent capacity.
If any U.2, I mean, look, the box is open,
so you can pop out one of the drives,
a bank of 18 drives,
pop in your Optane drives
and measure the performance,
and it should be fast.
I mean, look, we're not publishing numbers
on the Optane.
Do the drives in a RAID 6 group
have to be the same capacity?
They have to be the same capacity,
but not across different groups. They have to be the same capacity? They have to be the same capacity, but not across different groups.
They have to be the same capacity is one thing.
We recommend it to be having,
from the same vendors,
having the same characteristics
in terms of performance.
Because even if one guy is slowed down,
you're going to bring down the whole guy
and others, they're going to slow down
the other people.
Because we white stripe it.
Yeah, so all the usual RAID concerns.
That's correct.
So, I mean, what sort of market are you trying to go after with this solution?
Well, excellent question.
So even though the solution, as it stands, is more horizontal in nature, can go after anything,
but we have, as a startup, most important stuff for us is to have focus.
Focus in both in-source strategy as well as the market we go after.
We see a big gap where there's a need for performance.
And if you look at the last five, ten years,
people are writing applications with predominantly
the open systems architecture, right?
In the sense, they pick up a MySQL from somewhere
or a MongoDB or a CouchDB.
And it's all scale-out in nature,
Cassandra and all that stuff.
And they start building this in the department level
and they transition over to the IT.
And IT says, oh my God, what am I dealing with here?
I have, you know, a big checklist of 20 things
before something is compliant in my IT world.
And they find out, you know,
how these things are not able to fully comply
because they're all distributed,
charted all over the place.
And they're trying to get grappled with it.
So we believe that we can, A, go to the market by saying,
look, we're going after these new age applications
where there's an absolute need for speed.
There's an absolute requirement for being compliant.
And you want to, you know, data is growing so much,
the cost of NVMe drive is still not as cheap as disks, not yet.
So they're expensive.
So you want to conserve and increase the utilization on these things
and not put three copies and all that stuff.
So that's what we're going after.
And we're seeing this, you know, go after the Splunk world,
go after the MongoDB world, go after the, you know,
Couchbase worlds and Cassandra's.
And that's the market we're going after, the new age application.
We're not going after the traditional place where you have fiber channel sands,
and that's a heavy lift.
And, you know, there's too many cooks there,
and they have great account control,
and we don't even see them where we go to.
Okay, and because those applications have their own replication models
and replication isn't that important?
Yeah, they have their own replication model. Absolutely.
I mean, one of our customers actually made this basic
statement saying that within the rack, it's your
job to provide resiliency, right?
So they call that local resiliency. Local resiliency is
the job of Pavilion Data, its appliance to provide that. Global resiliency, as they call that local resiliency. Local resiliency is the job of Pavilion Data
is appliance to provide that. Global resiliency, as they call it, which is going across racks or
going across data centers or whatever, they actually put the onus more on the apps. You know,
it was old days where people were doing SRDFs and replication with the box. Today, the apps are much,
much more efficient in doing, you know, replication in doing replication. Look at Oracle Data Guard.
They do it directly at the SQL level, transaction level.
Well, you call it the old days,
but most checks in the world still get generated
out of applications that work that way.
I mean, I'm sure if I'm talking to a bunch of guys in mainframe,
they would say the same thing too if I call them old days.
Oh, yeah.
A lot of business is still run on that.
So, you know, I mean,
I'm talking about, you know, chronologically.
I mean, the new era,
as we call it, the modern era.
And so, you guys, I mean, as a product,
you know, startups typically have a challenge trying to go outside of
the U.S. or things of that nature.
I mean, are you available globally,
or how does that play out?
Well, we are, Well, it depends.
It depends on.
So we're nimble.
Let me put it this way.
Not the nimble of the company, but we're nimble in our actions.
And if we see a big opportunity in a particular area,
we're nimble enough to go and stand up something out there.
We're currently available in Europe, and we are available in USA.
Okay, and you're shipping now?
We're shipping now.
And when a customer purchases a product, is it on a
capacity basis, on a controller
basis, or how does that work out?
So right now,
it's right, okay,
right now it's
on a capacity basis. And that
includes like a line card,
so they get a certain
number of line cards per terabyte or something
like that, is that how it would work?
That's correct.
We sell in units of five controllers.
Five controllers?
How can you deal with an odd number of controllers?
There's something illogical here.
Well, yeah.
So actually, we have 20 controllers.
And we have four virtual arrays.
Right.
No, I understand.
So there would be four pairs of five controllers.
That's correct
we'll take the hit on the one we'll take the hit on the one that's yeah yeah i guess it's easy to
explain to them one fourth is what you get so it's one fourth half three fourth and one to grow on
or actually one as a spare yeah so when you go to a um and i'm not sure i've got the right term
the second subsystem that would be in a subsequent rack.
Does your management system command all those rack storage systems,
or is it a separate management system per rack?
Yeah, today, look, we're all walking in the other way.
Today, we are one rack at a time.
Unfortunately, that's how it is. And with my background in Veritas, I've built several of the
global distributed systems to manage. You know, one thing we did was we, we built it such a way
that the entire management or the API layer works on a PubSub model, right? So it's easy for us to
federate across multiple boxes.
In fact, there's a component here which
sends proactive
telemetrics over to a cloud
component that we have so that
we can do support, intelligent support.
We have that also in the product.
So we can find
out if a drive is failing and we can
call up and we can tell you, hey,
did you know that you're not
configured your volumes correctly we can actually do that uh no data we don't we don't send data
it's just telemetrics uh about you know some configuration stuff and if there's any outage
going on or potential components not working we send that over so vr do you know how many um i'll
call it sensors that you're you're you you're collecting on a periodic basis like that?
I can't even count.
I mean, there's like IO, throughput, bandwidth, latency, temperature.
Yeah, yeah.
For different various components.
There's a lot of things we collect.
There's a lot.
Right, right.
And we don't send it in real time.
I mean, we bunch them together and send it, unless it's an alert.
If it's an alert, we send it in real time.
So we also have the other way.
We can actually push firmware updates from the cloud
if you so want it.
So we can just do installation from there.
You know, those things are fancy stuff.
We don't do that in our release one.
But all the mechanics are there.
We're actually not releasing that feature yet.
But the sense is, you know,
these should be autonomous entities
which are out there, and we should be able to
update the firmware on demand from a customer
or allow a portal for a customer into our global website
and they can see their arrays and they can manage it themselves.
So that's the thought process that we have.
You mentioned that you don't have any NVRAM.
I assume that means you're not
using memory caching for
the data? So you're just going directly?
No, we don't.
NVMe is so fast.
I understand. I just wanted to make sure I clarified
that up front.
When you write something and we
acknowledge that it's written, it's really written.
Yeah, I understand.
So the beauty of that is it allowed all our controllers
to be completely stateless.
There's no state stored in any of the controllers.
That's the beauty of something which we did.
When you have 20 of them,
each controller can be serving volume one right now.
It could just disappear.
Next time it comes up, it could be serving volume two,
so on and so forth.
Yeah, but you do have to optimize your rate algorithm.
Yeah, but there's a group of controllers
that are assigned to a particular RAID group, you know.
So, and you can float around in that,
and that's perfectly just possible.
But the essence of the design
where the controllers were very stateless,
and part of that guiding principle
is what made us saying,
look, we're not going to have cache,
we're not going to have NVRAM inside the controllers,
then you crash.
What kind of 4K write latency are we talking about?
Let me put it this way.
Roughly end-to-end, we're looking at, for write latencies,
around 100 to 125, 150.
That's where we're looking at microseconds.
That's device-level latency.
I mean, that's what you get from an NVMe device.
No, no, no, no, no, no.
NBME lights are actually pretty fast.
You can actually do it at under 40 microseconds.
Because, you know, most of these medias, they have DRAM inside.
Yeah.
It would be interesting to see in your architecture the performance difference between DRAM heavy and DRAM light SSDs.
Hmm.
Interesting. Interesting.
Yeah.
Because it looks like you're the perfect use for a smart SSD.
Smart SSD?
Give us more ideas.
Give us more ideas.
You know, we love it.
I mean, this is the best part of being in a startup.
You know, we are so nimble.
Right.
I mean, again, nimble that right i mean again flexible
we're so nimble that we could actually do good interesting stuff which is very
not possible in large companies so this has been great howard are there any last questions for vr
before we uh sign off no i just heard had started having a very strange thought about this architecture and host-managed SSDs, where you could direct all of your rights to one of the four banks and have the other ones doing garbage collection at the moment.
But I need to think that through a little bit. So, you know, now that you mentioned that, one interesting stuff we do here
is we actually manage space in line.
In the sense that we don't go around in the background
doing garbage collection at our layer.
Because every layer that does some form of redirect on write
has to do some form of garbage collection.
The SSDs do that, and sometimes the the the software
host software the controller software that runs right if it does redirect on right it also has
to do it on top in our layer for our software we don't do it ssds may be still doing it one of the
big reasons why we did that is we could actually uh interface and understand the ssds more so that
we know exactly where they're doing garbage collection
and actually avoid going to those areas when we have to
to give predictable performance, right?
So we can do that.
And also, it sets the stage very nicely
for integrating with things like Project Denali,
which is coming out.
One last question for me.
Are you guys using a log-structured file
on the back end of this on the NBME devices?
That is correct.
Hey, VR, is there anything you'd like to say to our listening audience before we sign off?
Thanks, by the way.
Thanks for giving this opportunity to talk to you guys, by the way.
And I hope I answered all the questions that you have.
Do come over and visit our site, www.paviliondata.com.
We have a Twitter handle as well as we're on LinkedIn.
Please subscribe to that stuff.
I think we have a great product,
and our initial customer traction has been pretty good.
The feedback has been pretty good.
And I'm hoping that this fits in pretty much every enterprise
that we go out today.
So thank you very much for this time.
Okay.
Well, this has been great.
Thank you very much, VR, for being on our show today.
Thank you.
Next month, we will talk to another system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it,
and please review us on iTunes as this will also help get the word out.
That's it for now.
Bye, Howard.
Bye, Ray.
Until next time.