Programming Throwdown - 121 - Edge Computing with Jaxon Repp
Episode Date: October 22, 2021What is "The Edge"? The answer is that it means different things to different people, but it always involves lifting logic, data, and processing load off of your backend servers and onto ot...her machines. Sometimes those machines are spread out over many small datacenters, or sometimes they are in the hands of your customers. In all cases, computing on the edge is a different paradigm that requires new ways of thinking about coding. We're super lucky to have Jaxon on the show to share his experiences with edge computing and dive into this topic!!00:00:23 Introduction00:01:15 Introducing Jaxon Repp00:01:42 What is HarperDB?00:08:10 Edge Computing00:10:06 What is the “Edge”00:14:58 Jaxon’s history with Edge Computing and HarperDB00:22:35 Edge Computing in everyday life00:26:12 Tesla AI and data00:28:09 Edge Computing in the oil industry00:35:23 Docker containers00:42:33 Databases00:48:29 Data Conflicts00:55:43 HarperDB for personal use01:00:00 MeteorJS01:02:29 Netflix, as an example01:06:19 The speed of edge computing01:08:43 HarperDB’s work environment and who is Harper?01:10:30 The Great Debate01:12:17 Career opportunities in HarperDB01:18:56 Quantum computing01:21:22 Reach HarperDB01:23:53 Raspberry Pi and HarperDB home applications01:27:20 FarewellsResources mentioned in this episode:CompaniesHarperDB https://harperdb.io/MeteorJS https://www.meteor.com/ToolsRaspberry Pi https://www.raspberrypi.org/Docker https://www.docker.com/If you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/Reach out to us via email: programmingthrowdown@gmail.comYou can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Hey everybody, this is an awesome episode. I'm really looking
forward to this. Edge computing is one of these things where when I first learned about it,
I thought it was just client-side computing. I thought it was something on the browser
or something on your mobile device or something like that. When you think about it, there's
actually a whole gradient between that and some backend-end server, right? So imagine Netflix releases a new
episode that they know it's going to be super popular, their most popular show, and everybody
just starts downloading it, and that would just completely blow them up, right? They also just
can't put a two-gigabyte show on everybody's phone proactively. They can't do that either, right? So there has to be an answer there.
And edge computing is a big part of that answer.
And so there's a lot of complexity
around how this all works.
And I'm so happy to have, you know,
head of product at HarperDB, Jackson Rep here
to really dive into edge computing
and learn as much as we can about this topic.
So thanks for coming on the show, Jackson.
Thank you very much for having me.
I really appreciate it.
Cool.
So we always kind of start this off.
How has COVID kind of changed Harper
and changed your kind of work style?
What do you feel has been the sort of big salient point from that?
Well, we're a relatively young company formed in 2017.
We were in a co-working space. We had a bunch of different little isolated spaces together,
and we had finally got enough traction, got our product where we thought, yeah,
we took another round of funding and we leased an office space in January of 2020.
Oh, wow. One month later, everybody was like, maybe that was a bad
idea. And we became a fully distributed company in much the same way we are a distributed product,
a distributed database. So after a few months, we took a company survey and every single person was
more productive than they had been in the office. I think we all realized we had a much better work-life balance.
My cats are much less lonely, although lonely was kind of their jam.
And we decided sort of as a company that the money we would have spent on rent, we think
we're going to spend on annual or biannual retreats where employees can bring their families
and we'll go somewhere like Mexico and
do morning, you know, planning meetings and afternoon is yours. Because to be honest,
we're still small enough, we can pull that off. And part two is, you know, a lot of us,
at least at our company, are pretty okay with, you know, being able to better divide that line
between, you know, software development is often one of those
all-encompassing things that takes over your soul and all of your available hours. And I don't know
about you, but I'm not as young as I used to be and I have kids and I would like to see them.
They are not terrible people. I agreed with you up until the end. No, just kidding. But no,
I think you're totally right. I think, you know, having,
you know, a get together at some cadence, maybe, you know, every year or every, you know, semi annually or something like that, you know, that is super nice. And that can, that can, you know,
keep that bond going and like start that bond up with new folks. But coming into the office every
day, I mean, you know, Patrick and I used to have huge commutes, like hour each
way commutes. And that just eats so much of your day. And so many times, you know, you're there.
And there were days where I went in and left and didn't even really talk to anybody. It's kind of
like, well, what was I doing for those two hours? Right. So, yeah, I think, you know, you can get
a lot done. You can get by with not coming into the office every day.
I think that's that's definitely something we've all taken away.
So what happened with your lease? Like, were you able to break that lease or or how does that work?
In general, I'm always fascinated with what happened to corporate property at this point.
I think we got out of it and I think it was because basically they had entire multiple floor tenants who were trying to fight the same battle.
And to be honest, I believe probably the real estate agents lawyers were too busy fighting real battles.
Not to worry about little startup who took a corner unit with almost no windows.
I honestly, I look at it and I say, to your point about commutes, I used to commute
an hour every day. And then I moved to Harbor DB and my commute was 15 to 20 minutes. And I was
like, that is so much better. And then I switched to a five second commute from my bedroom to the
living room. And there is something to be said for the decompression that comes with a
commute oh that's true yeah from being sitting there and staring at your code like that kid who
doesn't know why it works and doesn't know why it doesn't work and all of a sudden it works again
and you're like okay i'm done and you walk out in the living room and kids don't respond that way
they just you cannot you cannot debug them they are just a constant challenge. And, and to be able to be
present with them after you've spent, you know, a whole day banging your head against your desk,
that is a skill in and of itself, being able to walk out and be present and not be focused on work
still. Yeah. Oh man, you hit on something that's such a, such a good point. I mean,
so a couple of things to riff on there is when I started taking a walk, so I started
walking to work where I'd basically just walk in a circle around our neighborhood and that's
my walk to work.
And I just feel like mentally it kind of puts me at a different place.
So far, it seems to be working.
Maybe it's a little bit of placebo effect there or something, but I feel like it's doing
something.
And then the other thing is, yeah, I feel, I feel like even at work, you know, there's, there'll be a situation that's totally
on fire and you have to make some really hard decisions very quickly. And there's,
it's a zero sum game and everyone's really upset. And then you go from that to, uh, let's say a,
a meeting, a one-on-one meeting with somebody who's doing an amazing job. And you have to kind of switch gears from, you know, your, your, your sort of debate face to, to someone who's like
extremely excited and happy and appreciative. And then that meeting ends and now you're with
your family, which is like another dimension. So I feel like being able to toggle all of these
different persona that, that has been really, really difficult, um, over VC. Whereas,
whereas, whereas in a real office, you would at least walk from one room to the other and
you'd have time to sort of, you know, like reframe yourself over and over again.
Yeah. Just sitting down to dinner and, you know, kind of crossing your hands and saying,
all right, let's talk about your performance today.
That's right.
It's my understanding that you spilled food on your,
on your shirt at the beginning of the day.
And then you had to walk around with that stain.
That's clearly not the image we want to project Grayson.
Yeah, that's right.
You might have to go live next door.
Oh man. So cool. It sounds like a, yeah, it really worked out for the best. And I think that this acceleration, I mean, definitely, you know, you wouldn't ever wish COVID on anybody or any country or anything like that. But there has been some real silver lining. I think this has been one of them where we've started to understand the working relationship better. Cool. So yeah, let's dive into edge computing. So initially, I thought edge computing was just on the browser, on the mobile app, right? And that's definitely
like the extreme edge. I mean, there's definitely things you want to do in that space, but there's
a whole bunch of stuff in between that and let's say an EC2 instance you have running.
And so kind of walk us through what that really is,
like what is available there and what can people do on the edge?
Well, the edge is defined loosely, to say the least. We used to think of the edge as
not so far out as the browser because there's so many limitations in terms of what you can do. You're in a sandbox.
So the next smallest compute unit that we would focus on, and this is both at HarperDB
and my prior company, which is an IoT platform, things like Raspberry Pis, small microcomputer,
you know, Jetson boards, stuff like that, where you can run code that handles a workload
and will perform some smaller
tasks than could be handled on a larger server up in the cloud. What you find though, is that
that hardware is not super ready for a lot of dynamic programming and workloads. If you want
to be out in a vineyard, for example, an autonomous vineyard just outside of Tucson, Arizona under my previous company. And we are automatically watering the vines and using compute to analyze soil moisture
content and humidity and temperature and canopy and shade and infrared and all of that stuff.
And we are still using ruggedized raspberry pies out there because it's hard to find something that
will give you the flexibility to install a platform
or a database or whatever you might want to store or run on it and then likewise uh handle the
reality that it rains outside or that it's 150 degrees sometimes when when you're down there
measuring temperature so the hardware was a huge challenge and we were banging our head against that at
HarperEB because we knew that distributed computing would require distributed data.
And we knew the benefits of distributed computing. Running an AI model at the edge on a small data
set or stream data set is much more efficient than shipping it all up to the cloud, especially
when you have intermittent connectivity, which is often the case out at what we call the edge.
Can you describe it?
So I thought that the edge was like maybe the ISP or something like, like what exactly
is the edge?
Is the edge like your house or, or is the edge some server in between you and the internet
or what exactly is that? The Edge is everything outside
what I would call a major colo facility,
like a major cloud provider.
So we've partnered with Lumen,
the old CenturyLink slash Level 3.
They're pushing micro-edge data centers,
which are still data centers
and still much more server capacity than
I have ever had in my closet at home. But to them, that's the edge. And that was really the change
for us to realize that the edge isn't a wearable because we can't be installed on it, but it is
the edge to some people who can compile apps and put it on a watch. For some people, that's the edge. For a lot of people,
you know, where your sensors are and where you're collecting that data is the edge. That's how you
define it. So you fight the battle and you find the hardware to pick and survive there and collect
that data. But for other people, the edge is simply my customers have a really fast connection.
I can trust that that connection will exist.
I just want to move my application closer to them so that the round trip to the API
is a millisecond instead of 300 milliseconds.
That's the sort of performance that I want to get back.
So we find that lots of people are defining the edge.
The people who own giant cloud data centers people are defining the edge. I mean, the people who own giant cloud
data centers are definitely defining the edge as a slightly smaller data center, slightly closer to
the users. And the people who are building apps that collect sensor data and make use of that
with machine learning, they're pushing it out further and solving those hardware challenges.
I don't think we really need to truly define it more than that because tomorrow we're
going to invent some new technology that's inside me. Yeah, that's right. We're all the edge.
Okay. Oh, that totally puts it in perspective because I did a bit of research prior to the
show. People who listen to the show know I'm an AI person, so I don't have any background in full
stack, but I did a bit of research and what I saw was things like Cloudflare Edge and AWS Lambda Edge.
And that sounded like, as you said, just a smaller data center.
And there's just a lot of them.
But it sounds like Edge is much bigger than that.
I mean, it's also your vineyard is a great example.
So in this case, you have this swarm of raspberry pies on this vineyard, and now that's the edge.
So with Harper, for example, are you concerned with all of those different types of edge computing, or are you focused more on the former or the latter? out when I first joined HarperDB, we were very, very much focused on let's go into AI-powered
classifications in the mining industry. And that's a very small, it might be a Dell Edge device or a
Raspberry Pi for proof of concept. And let's do our calculations and let's provide real benefit.
And we could absolutely do that. But the client loves it.
The result is great.
And they're like, cool, we ran that, but we can't really run a raspberry pie in this hot, smelting environment.
So what hardware solutions do we have? And inevitably, you run across budget concerns because if we've got 150 of those across a plant and you're going to spend now $3,000 on a ruggedized piece of hardware, well, now all of a to create a HarperDB device with our product on it,
and we incur that capital cost. And now they can pay monthly so they can run it as OpEx.
And from a budgetary perspective, it's very, very challenging to have that be the edge.
And we found that the big guys with all the money who are absolutely pushing their giant cloud service offerings to
these smaller data centers, they are just as desperate to move data and compute and functionality
and capability to the edge. And they have much bigger pocketbooks and they can make it happen
more quickly. So, I mean, not a lot more quickly. We can do POCs all day long for smaller companies,
but there's still a real hardware problem out there for true edge computing.
Yeah, that makes sense. So let's step back a little bit. Now we've defined edge computing,
which I think is super, super useful to set the frame. So what got you into edge computing? Give
us kind of a bit of a background on your kind of story and what
led you to HarperDB. Sure. I was a partially reformed software developer.
Partially reformed. Partially reformed. This is my eighth startup. I was at a
communications startup where we were joining together all your phone calls, texts, emails into threads for customer service.
And the UI for that, it basically, I wrote React,
like literally a year before React came out.
Wait, how is that possible?
There's like a preview release or something?
No, no, I wrote effectively the functionality
of these modularized HTML because there wasn't one,
but I knew that that's what we needed. Wow, isn't that amazing how, how there's like a, it's literally great minds think
alike. Like, you know, there's, there's, there's this idea and it just, a lot of people kind of
come to the realization at the same time. Yeah. It just seemed, it seemed so obvious.
And then I immediately moved, uh, well, my wife was pregnant at the time and she,
she kept getting more pregnant.
And I didn't want to pay cash for that baby.
So I had to get a job with insurance.
And I worked for DirecTV.
And it was one of those jobs.
I've never worked at a company where I had to wear khakis and a button down before.
And it just didn't fit.
So I started looking for my next opportunity.
And I found an IoT platform, sort of a low-code drag and drop, drag in your sensor, block.
And now the data that comes off of that, capture that from port three and divide it in two
and keep a running average in your memory buffer.
And if it goes above, by the way, fetch a threshold from a database, if it goes above
that limit, then send an email. So it was a super easy to use platform a la Node-RED, only sort of enterprise grade. And it was a great product. costs for any given operation dropped dramatically because we didn't have to have
massively powered servers on the floor and run cables to everything. These could be
wireless connections. We could use low energy Bluetooth if we had the ability.
So there was a tremendous opportunity for us to capture and refine data and then not send
every single piece of sensorized data,
every piece of sensor data up to the cloud. You very quickly, when you're in an installation with
a thousand sensors, realize how much of your pipe you're taking up, sending everything up there to
be analyzed. So it made sense from a filtering perspective to me. And then it was just about
how can I make this easier, faster, more stable? What are the
challenges that I have and how can I overcome them? That makes sense. And so that company that
you're at, which had the IoT devices, that was from there you went to Harper. Is that the step
before Harper? Correct. I was looking for a network fault tolerant data solution. And I actually, I found HarperDB just, I think may have actually searched for that. We had great SEO back then.
Nice. and I integrated, I built a block for it for our platform. But I ran into some documentation issues.
Their postman collection appeared to be a little outdated.
So I rewrote that postman collection and just sent him an email at hello at harperdb.io.
I'm like, hey, your documentation appears to be a little out of date.
Here's a new file.
I rewrote it.
So that worked for me.
So I could use the postman collection basically locally.
I said, thanks.
I continued to implement it, ran into a couple of things that I thought would be cool to
have.
And I wrote to them and eventually they just wrote me back and said, we would really like
perhaps you to work for us.
Cool.
It would be cool if you could consult or just help out.
I'm like, ironically, this office is about to go virtual and I have kids at home and I don't think that I can survive that.
And so they invited me to office with them half time, which turned into full time.
And here I am two and a half years later.
That is awesome.
I mean, that's I think it's a really great story.
We get a lot of folks asking, how do I get into the field?
Right.
And I mean, there's a perfect example where they,
well, I mean, they might have looked you up, but let's assume they didn't. I mean, they might not
know your college degree or whether you went to this bootcamp or that bootcamp, but you're
providing real value to them. And you knew, they knew that you knew what you were doing.
And so they reached out to you and got that process going. That's a real sign. I think
it's inspiration for people out there who want to get into the field of you just get in there and
start using things and be a part of the vocal part of the community. And that can go a really long
way. Yeah, I think one of the things I've noticed when I work with newly onboarded employees is they're very much
two types, the ones that need to wait to be told, you know, what to do and how to solve a problem,
or they, they always have questions. And then the ones that bring me two solutions to a problem
they've encountered and they Google them or stack overflow them and, and found them and say,
I don't know which, but both of these would solve the problem.
All day long, that makes me do a happy dance inside as opposed to the other.
Likewise, when you're working with a product, I know that they don't want their documentation to be outdated.
Nobody wants that.
I hate documentation, but I was willing to do it because I needed that postman collection to work for me anyway.
As soon as I got it to work for me, I'll just send it to them. And that way,
nobody else has to have that problem. We have enough problems as programmers.
Yeah, that's right.
Let's not let them fester out there in the world.
Yeah, totally. Cool. Yeah, that's awesome. And so, yeah, that's great. So you were using HarperDB as part of this IoT project. You're communicating with them.
And you said, wow, this is actually a really cool piece of technology. I want to go there full time.
And also the idea of virtual thing. So let's jump into how do people write code for the edge?
And how is that different from, you know, building a regular
server like in PHP or something like that? Like what is, what makes the edge environment different
to work with? It's a very good question because based on our previous, based on your previous
question about what is the edge, I think that's changed for me a lot because for most people,
I think who've been doing this for a long time, because lots of people have been programming at the edge.
And that's, you know, microprocessors written in low level C and, you know, basically super.
I mean, you guys were rocket scientists, right?
And you worked on things that went into space.
So theoretically, you've worked on extremely resource constrained devices. And that
always felt to me like what edge was. Edge is, it doesn't do a lot, but it's very purpose-driven.
It's not super dynamic. And to be honest, once you put it on that board, it's never going to change.
Yeah. Yeah. I think resources have changed and the Raspberry Pi kind of opened everybody's
mind to what was possible and that you've got Arduinos and all of these little things where you can add your custom code and even
update it over time to continually adjust to changing workloads. Yeah, really to double click
on that, our car connects to the internet. I mean, it's not a fancy, it's just a Honda Odyssey,
but it connects to the internet when we get home. And one day we were out driving
and I stopped at a red light and the car shut off and I panicked, but it turned out that this was
just a new feature that had rolled out where the car literally turns off when you hit a red light.
And then when you let go of the brake, it turns itself back on. And I guess that's somehow more
economical, but it just randomly happened.
So we've kind of moved from, when Patrick and I were doing embedded work, there'd be this firmware update and you'd have to carry a briefcase with a laptop in it and a cord, and
you'd go to the site and you'd plug it in and update the firmware. And it took thousands of
dollars for you to fly halfway across the world to do that.
And now my car just does it.
I don't even know, right?
I mean, it's just so different nowadays.
I mean, I do feel like maybe they should send you an email
telling you your car is going to shut off starting Friday.
You would think, right?
It just randomly started happening.
Don't freak out, but this is going to start happening.
Yeah, but it shows a dramatic change.
And to your point, Raspberry Pi is also just a massive game changer because it puts it
in everybody's hands.
I mean, I needed a lot of handholding to do embedded work, having no background in C or
anything like that.
And now with Raspberry Pi, you have an entire Debian OS at the edge, which gives you a ton of flexibility.
And so I think as we look at where the edge has moved, it becomes resource.
And now you look at AWS Lambda and you look at our new feature custom functions.
Realistically, JavaScript is the, and I don't want to start a flame war.
I don't want to get you flame war i don't i don't
want to get you guys like a million downvotes but javascript is an exceptionally easy language to
learn and to be honest if sandboxed properly it can be less it can be as non-dangerous as you
want it to be and it can be as performant as you as you architect it so So I feel like it's certainly the future of,
I think, at least edge prototyping.
I would be hard pressed to say
that there is still not going to be a use case
once you figure out what you want to do
with data at the edge to continually lower costs
and perhaps solve that hardware problem permanently.
You are going to be on more constrained devices
and maybe you're going to use something like a Kotlin
that can compile down and run in the JVM
or something that's going to be able to function out there
but still has that, I don't know,
the mental clock cycles of the developer in mind
and the ease of use,
the, I don't know, call it the usability,
I guess. I want it to be usable because I always call it when we're talking about collecting sensor
data and doing workloads at the edge. Right now it's so new. Everybody's been talking about edge
computing for years, but we are collecting so much data. And I call it the Rumsfeldian challenge
because we just, we don't know what we don't know.
So we better collect everything. And obviously transporting that all to the cloud is not ideal.
So we will solve this problem, but I think it takes a lot of experimentation in the very
beginning and you need something flexible for that. And so these wholly capable standalone
Ubuntu environments like a Raspberry Pi are the ideal place for us to figure out what it is we're even going to do when we're out there at the edge. the AI for the Tesla autopilot. And he was effectively saying, well, we just need to get
enough data and then we'll be done. And so it's really a data problem. I think that the challenge,
I mean, I agree with him in principle, but practically the challenge is some data is
more important than others, right? So for example, if you're driving on a road here in Texas
and there's nobody on the road
and it's 70 mile an hour speed limit
and you're just going on a straight road by yourself,
that is a lot less interesting
than you're part of a 17 car pilot, right?
And so that second thing doesn't happen very often,
but when it does, it's really important to collect that data
because you want every single time that happens, you want to learn as much as possible, right?
Anytime there's a black swan event, you want to learn as much as possible. But as you said,
you can't collect everything all the time or even half the time or even a tenth of a time. So you need something that's smart,
that's saying, you know, is what's happening right now interesting? If it is, then start
collecting it. If it's not, then throw it away. And that smart thing has to live on the edge,
you know, by definition. And so I think that the differentiator, and I'm not an autonomous vehicle
guy or anything like that, but, but just looking at the Tesla idea, I feel like the differentiator
there is, can they do smart things at the edge? That's going to make or break that whole idea.
And so I, I think there's probably a hundred other examples where, where edge computing is going to
make or break a lot of the next generation of
tech and of ideas. I agree. And the classic example that I always talk about, I was working
with the oil and gas industry, the turbines that are processing at their refineries, they spend
20,000 RPM. And if something goes wrong, it goes really wrong. And those things shut down and it's at peak natural gas prices. It's a million dollars a day that they're losing for just one turbine. So it shuts down and it's not able to refine it. They're losing a million dollars a day or in some cases a lot more than that. And they were collecting data and pushing it into an old school data historian from sensors.
And their resolution was every five seconds.
And you can look at the data points leading up to a failure and you can say, well, there's probably something there.
Yeah, that's right.
And they had one guy.
They introduced me to the one guy.
And he comes in and he looks at it,
and he's like, well, yeah, here's what happened, and everybody else in the room, we were all
looking at the exact same screen, and we had no idea what he was talking about, and he's like,
well, I remember once in this other place 20 years ago, I saw something like this. We didn't
have sensors back then, but it was a lot like this.
And he just had all of his tribal knowledge and he was 60 and he desperately wanted to retire,
but he could not. He would get called up in the middle of the night and have to fly off somewhere
in the world to analyze something that had gone wrong. And so our mission in this consulting
project with this company was to provide five you know, five millisecond resolution, but you cannot
record that and put that all into a historian because you're going to overwhelm that test period.
So we built a system that basically kept a rolling buffer of five minutes and would wait
for an anomaly to occur, would wait through that anomaly or shut down if that's what happened,
and then capture five minutes on the other side, wrap that up into a package, put that into a local
edge instance of our product, which would then be transported up to the cloud for analysis.
So the ability to understand what an event is, what data led up to it, and capture that in higher
resolution when you're just normal,
steady state analysis of five second resolution where you're like, as long as the line is flat,
everything's great. But as soon as that line starts to move a little bit and infinitesimally
so, and you're talking about vibration, it's something that's spinning at 20,000 RPM.
It's important to know all of those fluctuations and important to be able to
look at them because not everybody has that tribal knowledge to understand that, you know, when it
goes up by a fraction and down by a fraction every five seconds, that what you're really looking at
is very real problems with vibration in the 24 hours leading up to this thing flying through
the wall and ruining everybody's day. Yeah. Yeah. I think the, uh, that problem of, you know, am I seeing something interesting?
He's like a really phenomenal, I think it's a type of an active learning problem. And so for
something like that. So, so I definitely, you know, I, I, uh, I think, you know, JavaScript is,
is, uh, actually I, I truly enjoy writing TypeScript,
which compiles down to JavaScript.
I feel like it's a solid language.
Do you think we'll get to a point
where the edge will be language independent?
Or is there something about the edge
where if you were to add more languages,
it's just a lot of work?
Is there something...
Could you describe a little bit,
what's this machine VM or what's this sort of box that runs at the edge in terms of
software and what actually is going on there? Well, I mean, it depends if it's, what we found
is that the, the, the opportunity in POCs is that you can put anything out there. It could be any
language and the box doesn't need necessarily to even survive the element. So you're not really hardware limited until you go out of
proof of concept and into production, right? So once you figure out what your functionality is,
then you start to look at what is the cost-effective hardware that can be run and how
do I replicate this functionality out there? And from a resource
perspective, will I have the benefit of an OS that supports Python? Could I run my statistics
in a Python script? Or could I use JavaScript because I want to do a bunch of API calls to
third-party resources? Or what language accomplishes my goal for the proof of concept
may well be different than the language that accomplishes the goal in final production.
So in the same way TypeScript compiles down to JavaScript, Kotlin, you know, compiles
down to something that runs in a JVM.
I'm sure that somebody very smart, you know, is going to take some super easy language that hasn't even been
invented yet that my children are going to learn how to write that will compile down
to C and ultimately be able to be shipped off through chip as a service.
And you're going to send me the thermostat program that my kid wrote to keep her room
cooler in the middle of the summer and adjust the thermostat automatically so i feel like i i want to believe that you know language the language shouldn't
matter that the programming languages and to whatever end there are you know tribes that
adhere and fight desperately for one language over the other, that probably all eventually goes away.
And we all just drag and drop some boxes onto a screen and say,
that's what I want it to do and go make it happen where I want it to happen.
Containerization is obviously a huge movement and it has been in the cloud for
applications. We see certainly in these smaller edge data centers,
everything is containerized. They're only pushing containers out there nobody's installing on bare metal
and you know even at the edge out at the vineyard we are running you know little docker containers
and those raspberry pi's so it's entirely possible to be you know a kubernetes cluster
you know pushed out to the edge k33S is a great, really, really minimalist container
management platform. But again, I don't know how the code gets out to the edge. And to be honest,
I try not to care. I try to work on how does the part of the puzzle that I'm working on,
how does it make everybody's life easier rather than more of a
problem? Yep. Yep. That makes sense. I'm trying to make it just work. And somebody with a lot
more money and a lot more time and who hasn't spent as much of their life banging their head
against the hardware problem is probably going to have to solve that one because I think I may
have given up on it. Yeah, that makes sense. Yeah. I think maybe, you know, it started as a, I believe a lot of
these Lambda functions started as, as sort of JavaScript kind of front end servers. And so,
and so you have the browser running JavaScript. And so I think maybe it's a starting point that
a lot of this, the cloud flare edge, I think is JavaScript only. And that's probably just because
of their pedigree, like where they came from and their inspiration. And, but to your point, I think, is JavaScript only. And that's probably just because of their pedigree,
like where they came from and their inspiration.
But to your point, it's all run on VMs.
And so it's just a matter of time before they'll say,
look, you can point us to your Docker Hub location
that could be running just about anything
and then just really open it up.
It teaches you a lot.
As you start to move into real- life deployments where containerization is the
standard,
it also teaches you how important it is to build a good Docker image.
Cause that's one of the things I feel like, I feel like Donald's news,
Donald Newt's axiom about premature optimization is one of those things that
can absolutely kill a company, but man, if you're gonna to spend it anywhere, a good Docker container is key. I think when I got to
HarperDB, our Docker container was 350 megs. And I was like, it feels too big. I mean, it literally
feels like it might be too big given that our actual installer, our actual installed binary
is under a hundred and all we needed was Node.js, no JS, uh, I feel like we could do,
we could, we could do this better. So we spent a lot of time just recently, actually with our
new release, working on that and making it what we're, I guess, the industry term is a first-class
citizen because truly I think containerized applications and workloads, and to be honest, dynamically distributed by large service providers that
provide rapid access to your application on demand, they don't want your Docker container
running out on their edge servers all of the time. They may only want to follow the sun.
And right now, the best framework we have for that is something like a Kubernetes cluster that can shut down and spin up and have access to persistent disks, certainly for data storage.
But a lot of it is ephemeral.
Yeah, that makes sense.
I had an issue recently with AWS Lambda, where I think Lambda can only be 200 meg.
That's their limit.
And so I wanted to run some machine learning. And as anyone knows,
who's tried to install PyTorch or TensorFlow or any of these things, like you type, you know,
pip install PyTorch, and then you get this, this, you know, in console progress bar telling you
you're downloading like 900 megabytes. And you're like, what, you know, but it's just, I think it's
because it has all these different optimizers, you know, if you're running, what? But I think it's because it has all these different optimizers. If you're
running on Intel hardware, there's this thing called MKL, which is some kind of linear algebra
thing. And if you're running on the GPU, they have that. And so it ends up being this massive
thing that really can't be, at least I don't know how to decompose it. And so I think I ended up
getting around that with some elastic
file system. So now Lambda function mounts this file system that Amazon is just holding onto for
you. And you can have a bunch of Lambda functions all using this. But yeah, I think you start to hit
a lot of limitations for good reason, because you're fanning this out now. It's not just some
server that could be uber powerful sitting in the Midwest somewhere, but you're fanning this out now. It's not just some server that could be uber powerful
sitting in the Midwest somewhere,
but you're fanning this out to many, many different nodes,
potentially all over the world.
And so that just creates a lot of limitations
that people might've not had to deal with otherwise.
It creates a tremendous number of limitations.
I mean, it also creates a lot of opportunity
for the challenges around
logistics and that's the other thing that kubernetes you know for better or worse is
very good at it's like i have an atom and i want this atom to do some work and i wanted to do some
work here across all of these places and it's very easy to script it and it's very easy to spin it up and it's very easy to spin it down and that is that is truly you know as the container sizes get smaller and as the edge
compute resources become more powerful you know you're just going to continue to push out and i
don't see any change in in you know containerized architectures coming because I can't imagine a better, more atomic way to send out
a core piece of functionality than in that container. I mean, would I like it to have
less overhead? Sure. Would I like it to be a little less complex? Yes. But it does a great
job. And that's why DevOps people are so angry all the time. Yeah, that's right. I think, and you
could please fill in the gaps here, but I think the way that the container system works is it's kind of like you start with some base image and then it keeps download Node.js and install it.
You know, those commands are run starting from some frame of reference.
Maybe it's an Ubuntu install or something like that.
And so, you know, you don't have to actually copy the Ubuntu install because that's sort of your base image that everyone has agreed on.
This is the Ubuntu image.
But you're copying over basically what you've done to that image.
And so I guess,
and walk us through this, but like shrinking the Docker container in this case, I guess means just,
does it mean doing less things to the base image so that there's less to keep there?
Well, there's a process called like a sequential build where you could bring in the Ubuntu image and then you install Node.js. But
all you really need is Node.js because on top of Node.js, which is the only prerequisite for
HarperDB, you install HarperDB. And so rather than carry the whole Ubuntu image, because again,
these are going to be installed over a Linux OS, right? That's what's running Docker or Linux subsystem on Windows.
So you've got Linux. You don't need all of Ubuntu. You definitely need Node.js because we require
that. So ultimately you want to install Node.js and there are Node.js based images, and then you
can install HarperDB on top of that. And then in our case,
because we persist to disk, and we're not just reading from inbound streaming data,
we need to persist something, your data and your config. We, on container start,
will the first time install, i.e. reach out to that persistent disk, set up all the files that
we need, set up your config, set up your data store, set up your data files. And then ultimately
that becomes your install. So it's not plug and play because otherwise we wouldn't be able to
persist any data, but it is as quick as it can be that first time. And then if it were to shut down and start
up, it will look at its persisted disk, its file mount basically, and say, oh, well, all of those
install files are there, so I'm good. I'm just going to basically spin up the APIs that HarperDB
has and wait for somebody to try to talk to me. Got it. I see. And if it reads those files and
it says this is a version eight of HarperDB, but on version nine, and it has some migration logic and all of that. Exactly.
Got it. Cool. Cool. That makes sense. So cool. So let's dive into databases now. So we have,
I think we've given a really good overview of edge computing. And so you kind of, you can see
kind of how this can follow. You have all these machines running, let's say, in the vineyard, and they want to do
things without having to phone home.
So they don't want to have to give all the data to some server, which could be a thousand
miles away.
They want to do some processing locally, a lot of processing locally, and then just send
back the most important things. And so to do that and to
coordinate that, we need to have some centralized place where we can have information. So to use
the vineyard as an example, you know, maybe we want a centralized place where we store
what we consider to be like anomalous temperatures. And that could change as the season changes.
And so we want to keep some information
in all of these Raspberry Pis
so that they're all kind of on the same page
and they can kind of make decisions
kind of as a unit, right?
And so what you end up having to do is,
if you were to write this by hand,
is do a lot of message passing.
And anyone who's, you know,
and I wrote Mame Hub a long time ago.
So it's a peer-to-peer kind of video game thing.
You know, anyone who's ever done peer-to-peer
knows how hard it is.
You know, getting two Raspberry Pis
to talk to each other,
you know, unless they have a public IP address
is super difficult. And just having a mesh network,
even a mesh network of public computers, is really difficult. So kind of walk us through,
like, what is kind of HarperDB? How does it solve this problem? And why is it able to do what it does?
Sure.
So HarborDB was built by developers.
I love the phrase, by developers, for developers.
It feels like every product is that, really, isn't it?
Well, maybe not for developers, but definitely by developers.
You got to have that.
Ultimately, it was built to solve a lot of the pain points that we found when we were building distributed applications. So I know that I have workloads that I want to run on disparate devices. I know that I'm going to collect some sensor data. I know that I want to
run some calculations on that. I know that sometimes the sensor data comes in
more quickly than I can run those calculations. And sometimes it comes in less
quickly. I need to persist that in some way. So I'm writing it to a file or I'm holding it in RAM,
except I lose power. And now all of a sudden I've lost all the data I was holding
or, or I was able to make my calculation. And now I've reduced that stream of data to
the every 10 minute running average that I really want.
And I have that 10 minute running average.
And now I want to transmit that 10 minute running average to the next data point in that 10 minute running average
off to the server that's going to analyze
all of the 10 minute running averages
across all of my data sensors that I'm collecting,
except I lost my network connection.
So now I need to build a buffer to hold that.
And oh, wait, somebody shut off the power again.
So I lost my buffer.
So that's a giant pain.
And HarperDB is designed to function
and to push that data storage out to the edge.
So you can run your calculation,
or sorry, you can collect your data
from the sensor with app process.
And then you can simply put it into the database. Then you can have a, and it is persistent and it is ACID compliant. And you
know that it's been stored. And then you can have a second process that will pull those things out
and aggregate your 10 minute averages. And then it runs every 10 minutes and then it puts the result
into a second table. And that table is now persisted and it is there and we know that we
have it. And the fact that you want to now move that data over to, say, the cloud node for analysis,
we have what we call clustering, which is not traditional database clustering, but we call
our bi-directional table level data replication. So you don't have to replicate an entire database with HarperDB.
You can literally choose within a schema or a table what records, sorry, what tables are going
which direction. I can publish it up. I could subscribe to say a thresholds table that might
bring the thresholds for an alert down to the edge. And then when I create my 10 minute running
average, I can publish that table up to the cloud. So my when I create my 10 minute running average, I can publish that table
up to the cloud. So my application gets a lot simpler because I only need to make local host
calls. I don't need to worry about network connectivity. I don't need to worry about
holding in a memory buffer. I don't need to worry about what happens if the power goes off because
I know it's persisted. I don't need to worry about, you know, wifi going out or, or whatever mesh network collapsing for a few seconds because somebody kicked a power cord. It's there. And when it gets plugged back in and HarperDB boots back up, it's going to say, oh, I've got these messages. Oh, I have not sent them. I'm going to send them now. And they'll send that. So really, if you look at what HarperDB does, it allows you to simplify your programming
by just sitting there being an always on, always connected data fabric. So you can move your data
wherever you need, and you can do operations on it, and you don't need to move all of it.
But ultimately, it reduces your application code to just making local calls. So it also,
to some degree, allows you to bolt
down that box a little more because you don't need your application code to be making calls out to
third-party APIs. You could have a cloud server making those calls, putting the results of those
calls into HarperDB, and then subscribing those calls, the results of that data, back down to a
third-party API call table. And so I don't need to make those calls, the results of that data, back down to a third-party API call table.
And so I don't need to make those calls from the edge.
Wow, that's cool.
So how does the developer handle conflicts, right?
So power goes out.
You say that the rolling average is X, but because the power went out,
someone else got there first, and they think the rolling average should be Y.
Power comes back on, and now you have a conflict. Is there some API to handle that? Or is there
something about the way the transactions are specified that there's always a logical way
that gets resolved? How does that work? There is. In most of the instances that I'm
kind of describing, add node of HarperDB with a piece of compute, maybe some sensors hanging off the side of it.
There won't be a conflict because the rolling average that I'm calculating is based on the sensors that I'm attached to this particular node.
So there may be a thousand nodes, but their rolling average is going to be basically tied to their sensors.
So there wouldn't be a conflict.
However, when you're looking at other applications
where perhaps my edge unit is not a Raspberry Pi
in a field collecting sensor data,
but instead is an edge node in one of those smaller data centers
and a user logs on and because of their IP address
and their location, they're steered to one
and they input some data into a form and that immediately is replicated up to the cloud,
which powers the massive UI for the core application.
However, somebody else is elsewhere and they may have entered a number into that value
a little bit after,
but their network connection was a little bit faster,
and they get it there first.
So the way HarperDB handles that is we have timestamps and we look at a unified time server and say,
this timestamp versus this timestamp,
whoever last writer wins, and we can overwrite that.
But there are often times where that will cause a further
conflict where you can get into that. Ultimately, it's an age-old problem in distributed computing.
And that is the conflict between data happening over a slow versus a fast network connection.
And so our next chapter or our next version, or maybe two versions away, forget, we're looking at CRDTs, which are conflict-free replicated data types.
So they have a bunch more metadata associated with them.
And they can therefore make comparisons and say, all right, I know you wanted to do this, but I have to handle this transaction first, even though it came in later.
So while I have persisted you, I'm going to unwind you and rerun this and now run you.
And the end result is going to be what was intended. Right now, you can do that with HarperDB simply through intelligent architecture. But if our motto is it should just work and, you know, sacrifice simplicity without sacrifice, which is our tagline, ultimately we should handle that for people automatically.
So we always have an eye on that.
And we're working, we're new enough that we're working with customers and we help them architect these solutions because, you know, distributed computing is a challenge for a lot of people that is new to them.
And so we're not the subject matter experts. We're not the only subject matter experts,
but we feel like we have a good handle on how we can architect around some of the limitations
of existing solutions. And we're always looking forward to try to figure out what the best
long-term solution is going to be. Yeah, I remember reading about this with Bitcoin, where it's called the double spend problem,
where basically two people are in different geographic regions, or maybe a better way of
saying it, who are somehow far apart in the internet space, can both spend the same money
at the same time. And then it might take a really long time for that to get resolved.
And until it's fully resolved,
if someone actually executes that spend,
then now you've both been able to buy a coffee for the same price or something like that.
And again, I'm not a crypto expert either,
but I think that what's going on there is there's,
I guess like there's just,
it becomes a kind of popularity contest
where there's this big battle over who is right.
And eventually there's a consensus.
And so I would imagine, yeah, something like if you're using HarperDB for e out further and further and closer to the edge because you want to get low response times and you want to get that request.
But then for most architectures, at least currently, there's one master database because that's how you solve that problem. And it's a giant vertically scaled instance
that costs hundreds of thousands of dollars a month
sitting in Oregon.
And ultimately you're going to overwhelm it
with a thousand different servers running your Lambdas
that are all going back to the same place.
And we did a proof of concept with a large social
company. And if you were in Buenos Aires and you hit their API, the ping to the endpoint,
like the connection was almost instantaneous because they had a lambda running in South
America in a data center. The data that would come back, your friends list took sometimes upwards of 11 seconds because it was all the way back in
so we ultimately realized that our benefit was we can handle pushing the data out to the edge we can
handle with our new custom functions lambdas that are at the edge also so you're you're basically
your data is right next to your Lambda that's trying to access it.
And then we handle moving the data around and the data that we move around and replicate to all the other instances in a globally replicated cluster of HarperDB is the transaction.
It can be as large as the initial operation, but it can also be smaller because it might
not change everything at the end of the day. So we can move less data around and we can move it on pipe that we control
because we understand the internal IP addresses, which are going to be faster than traditional
external IP addresses. And it becomes a homogenous data set with very, very low latency for everybody who's interacting on it.
And now literally the only challenge that remains is to make sure that multiple actors acting on data at the same time are resolved correctly.
So we say we are ACID compliant at the node and we are eventually consistent.
So right now we can't do, for example, financial services, right?
We're not going to
solve that double spend problem. But there are a lot of places where that's not critical. Social
media is certainly one of us, but we're working on a solution that would make us able to solve
that problem. Very cool. So we talked about kind of mining equipment and some of these like really
specialized environment. What about if someone is, here's a good example. What if someone's just building an email app? So an email
iPhone app, right? So, you know, they would want to have access to their emails. Obviously the
server has a copy of their emails. It's kind of caching, but it's also really more like a database.
I mean, you could imagine someone wanting all of their emails on their device, right?
And so could someone use HarperDB for something like that?
I mean, that's more of like a consumer facing,
you know, like on their consumer device
running an instance of HarperDB.
Is that part of the sort of use space for that?
Absolutely.
We don't run, we need Node.js,
so we're not going to run on an iOS device. We were able to use
Userland, which is an Android app that actually installs a Linux subsystem, a full Ubuntu copy,
and we could run it there. It was not a recommended implementation, but you certainly
can do it. You can get it running on an Android tablet. I built a vehicle telemetry app on a tablet that was completely self-contained.
And it would store local data in HarperDB.
And then when the tablet came within Wi-Fi range of the office, it would then replicate
that data into the cloud.
And you would see the vehicle and its path and any violations from its thresholds immediately represented.
So if it had cell service, it would be doing that in real time.
If I shut off cell service, it would still collect that data, still persist that data.
And when it had a network connection, it would push that up.
So it's very, very possible to persist that data without maintaining that connection.
And I think I forgot what the literal question was. Oh, yeah. So the question was, could you run HarperDB on an iPhone if you're
building some app that needs a window of the data locally? So imagine I'm building an email app,
I go to airplane mode, I still want to see my emails, I delete a few, I come off airplane mode,
it needs to sync. All of that
sounds, I would put that in the hard category in terms of being able to do that correctly,
where I don't delete the wrong email or have a double delete or something. And so it'd be amazing
if there was, and there might, I haven't done a survey on this, but it'd be amazing if there
was some technology out there where I could just use some library and I would have some snap, some, some, not
snapshot, but some slice of the data locally on my phone and they would take care of everything
else, which it sounds like what Harper's doing.
And then that's when you brought up the, the restriction around the, the Node.js and all
of that.
Yeah.
And there are, there are pure like client side, JavaScript browser level, JavaScript
libraries that, that can accomplish a lot of what we do.
They'll make use of IndexedDB as an underlying key value store.
We have an underlying key value store that we use called LMDB, which is Lightning Memory Map Database, which is extremely fast, very performant, written in C, but obviously
it's just a key value store.
So it doesn't have all of the properties that you'd want in a database, SQL querying and
indexing and stuff like that.
So we've built all of HarvardDB's functionality on top of that.
However, underlying that is a key value store.
So could we, if we had unlimited time and resources, replicate all of
that into just a client-side library that you could include in a browser app, have it sync data
down from a cloud and be completely performant, self-standalone. And if your browser on your phone
then reconnected to a network later, execute the exact sort of syncing that HarperDB does currently
from, say, Raspberry Pi or a smaller data center edge node. Absolutely. You could 100% do that.
And there are a few solutions that do that. The challenge is they maintain those subscriptions.
Maintaining those subscriptions is expensive on the server. So continually syncing that data back and forth and holding what is in effect a socket
open so that you can subscribe to a specific query from, say, a server-side entity is very
expensive.
Subscribing to a table is a lot less specific because you're going to have a lot less individual subscriptions. It's not that customized. Once you start getting those
query level subscriptions, it can become very expensive. Meteor.js was a great platform that
did that. And it was built on top of MongoDB and it looked at transaction log to figure out what
real-time data needed to be pushed down, but it was incredibly resource inefficient.
Oh, interesting.
I was wondering because I remember when Meteor.js came out,
I did try the demo.
I think we talked about it on the show years ago and it looked magical.
Like it looked like, okay, well, you know,
I have this slice of user data
and I just want it to exist over here.
And it just magically worked, but then it never took
off. And it sounds maybe like this is why, like it just, it just at scale, it just fell apart.
It was, it was, it was magical. It was truly magical. It was just, as soon as that group
started to move to include other databases, they realized how incredibly challenging that was
because they integrated it so closely. And so they ended up building an entire library that
moved away from Meteor.js and ultimately became Prisma. That's what it was, Prisma.io.
Oh yeah. I've heard of that too. Yep.
Yeah. So that was the next iteration of that. That was the next iteration of how do we sync data between a client and a
server and do that in a more efficient way and not necessarily overwhelm with
individual subscriptions.
And they're all great use cases and they are truly magical for users,
but they become incredibly resource intensive.
So we are focusing on, I'd say less
the long tail of simplicity and providing the bulk of functionality we can within what we know to be
the limits of data replication between every single client on earth and one central data
store. Because obviously the other challenge is if I give you access to every single piece of data,
then you could update that data. And now I have a billion clients that are all trying to resolve,
you know, who did what, when, what was your network timestamp? You know, who came first?
What's the right answer? And then you'll never get into financial services, which as you know,
is where all the money is. Yeah, that's right. Closer to the money supply. Yeah. So it sounds like the Harper,
sort of a center of mass for HarperDB is just using Netflix as an example. Netflix wants to
push its most popular videos to the edge so that you don't have to go all the way to Los Gatos or
wherever Netflix's data center is to get that video, right? And so you can imagine all over
the world, there's a ton of these like small data centers hosting whatever the most popular Netflix
video is. And so you have this cache and so people will go to the server. The server will say,
oh, yep, I have that video. It's one of these super popular videos for your region. Here it is.
Or, oh, I don't have this really esoteric video about leopards or something. I'm going to have
to go to the main data center and go fetch that. But any time you write any kind of logic or really do anything with computer with a computer, you're going to want to keep some records.
Right. You're going to want to keep track of how many people watched each video.
And so now you could every time someone goes to watch a video, you could phone home to the main server. But now you hit a whole bunch of other issues, as we talked about
with that main server now getting bombarded with tons of requests all the time, and it doesn't
scale. So what HarperDB could do is sit on these edge nodes, collect all of those statistics,
so that tomorrow Netflix knows what videos are the most popular tomorrow and it can keep that fresh.
And then all of that gets replicated as all these machines are ticking up this histogram
of videos.
And then at some point, maybe at the end of the day, someone or some process at Netflix
can get a copy of this database that all these agitators are sharing and read it and learn
some intelligence from it. Did I explain a use case pretty well or is there any?
You did. And I'd go one step further to say, you would run an AI machine learning model to actively compress all of the individual data points that maybe come
through a Netflix UI, a user experience. I might hover over a movie. I might watch the trailer for
it. I might only get halfway through the trailer. If you've ever thumbed through your Netflix queue, gone past a row of films and gone
back up, you'll see cover art change for films as they try to test different cover art to see if
you'll click on that. So a lot of these decisions are simply like, we want to try A, B test this
thing automatically. But at some point, somebody is going to realize that there's an advantage to
one of those covers versus the other cover, at which point that is going to realize that there's an advantage to one of those covers
versus the other cover, at which point that is going to become a policy that is rolled down to
every single client. We're saying, this is the best cover for this. This is what gets people
to click on this. Or based on this profile, we're going to show this cover and the demographic data
that we've classified. And we're going to run a machine learning model that will basically
classify all of our users into one of three archetypes and the cover art is defined by
that archetype all of that happens at the edge none of that you know aside from larger
you know aggregation or probably strategies for that knowledge is going to happen at the edge.
It'll happen in the cloud.
But most of it you want to have happen out there.
Otherwise, you run into the same problem everybody runs into
before distributed computing was even a thing,
which is, my God, we need this server to be literally the size of the planet.
Right. Yeah. Yeah, that makes sense.
I think, too, there is this study.
I'm sure you're more familiar with this, Ian, but there is there was some study that I think Google did this study back in like 2011, said, basically, for every millisecond it takes their site to load, their product gets hit in some significant way, or maybe it was every 10 milliseconds. And so there's the real economic advantages. It's one of these things that's probably innate or it's probably subconscious.
You're not sitting there looking at your watch saying, oh, this was 80 milliseconds.
I'm out.
But subconsciously, the product gets hit hard every 10 milliseconds it takes to return a
result.
And so anything you can push to the edge just will turn into material dollars and cents.
Exactly.
And I mean, ultimately, we say that at least in gaming and in computing that 16 milliseconds is what the human being can perceive as a delay.
So you want it to be down at 16 milliseconds. And I mentioned a case study earlier where users in Buenos Aires were spending,
you know, a few milliseconds connecting to a local API,
but then data would take anywhere
between 300 milliseconds and 11 seconds
to bring back a friends list.
And when we started running our tests
with our custom functions and the data,
which had been replicated out,
your friends list doesn't change all that often,
but we'd replicated the data out right to where the endpoint was.
And you were seeing response times of five to 10 milliseconds,
which, you know, I, we knew that under load,
we would see that push,
but our objective was under a hundred milliseconds and we,
and we beat that easily,
which you were never, ever going to do
if all of the data still lived in Seattle. Yep. Yep. Totally makes sense. Cool. Yeah. I think we
covered a ton of really good material here. I think we opened all the bookmarks, which is good.
Let's jump into HarperDB as a company. So what's something that is kind of unique about HarperDB? It could be
the way you play in your off sites. It could be the layout of the office. Or what's something
where when you showed up at Harper or maybe through your tenure there, it's really made
Harper stand out in terms of the work environment? Well, I think if you go to the site,
you'll see our logo is a dog. Harper's actually our CEO's dog's name.
Oh, wow. Okay. All of our demo datasets, if you go to our postman collection, if you go to
docs.harperdb.io, you'll see we have a ton of demos and our demo data sets are all the dogs owned by the people in the office and then a breeds table.
So you can do a join of those data sets.
So all of our demos are based on the concepts of dogs.
And at the end of the day, it's about somebody who is hopelessly loyal to you, always there.
And ultimately, they make your life better. And so if that is the driving
architecture of every employee we hire, every feature we look at on our feature up board and
say, do enough people want this? Is it going to make people's lives better? And a lot of us are
multidisciplinary software guys. So
we've seen lots of problems over time. And to be honest, this product was built to solve problems
that the founders were having in specific application at their former company. But they
solve a lot of problems that I've had too. And there's no limit to problems you face as a programmer.
And we call our approach ultimately collapsing the stack.
So we have now effectively Lambda functions or old school, you might call them stored procedures, but they're written in JavaScript and they're super easy to deploy and they
make your life easier and better.
And hopefully you can spend less time working on that and more time
outside playing with your dog, which is all they really want.
Yeah. Do you let dogs in the office? This is a great debate. I've worked at places where dogs
are in the office. I never had an issue with it. Definitely some people didn't like it.
And I've worked at places where dogs were banned and people really didn't like that either.
What's Harper's take on dogs at the office?
Well, when we had an office-
Oh, that's true too.
When we had an office, dogs were absolutely welcome. The irony is that Harper
was not a nice dog and Harper was the only dog. If Harper wanted to come to the office,
no other dogs can come in the office. But otherwise you could like, there was, we had, we talked about conflict, uh,
resolution, replicated data types. Um, ultimately there were also conflict resolution dog types
where certain mixes of dogs were allowed in, but if that dog was going to come, we definitely knew
you can't bring this dog because they will not get along. Yeah, you need operational transforms for dogs. This person
has to get transformed across the hallway or something.
The SQL query we're not in. Yeah, that's right.
Where dog's not in disagrees with this dog.
I think in the last episode, we were talking with the CEO of Pinecode,
which is a database that does vector arithmetic database.
And Patrick was bringing up R trees.
I think this would be a perfect example where we could have rectangles for each zone of influence for each dog.
And if we get an overlap, that throws an alert or something.
Yes.
The Venn diagram of dogs that don't get along is just a circle. Yes. The, the, the Venn diagram of dogs that don't get along. It's just a circle.
You cannot, it's just, we can't put all these dogs in one room. It's just too many dogs.
So, so, okay. So it's distributed. So, you know, are you hiring sort of interns or full timers and
where are you hiring? What kind of people are you hiring?
You kind of walk us through, you know, people could definitely, I'm sure you have a careers
page and people can check it out, but just ostensibly at a high level, what are, what are
you looking, you know, for HarperDB on the engineering side? What kind of persona are you,
are you looking for? We just went through a, we just had our first hiring round in a couple of years.
We built out all of our core functionality and now we're ready to, I think we got what I'd call
the first versions of this where we're figuring out what is the thing supposed to do? How is it
supposed to work? And what is our technical debt left over from that learning process?
And we've cleaned that up. And so now we're out there looking for a new full stack developer. And we were looking for a designer
and then an infrastructure developer, because we're finding that the bulk of the challenge,
once the product is sound, we want to increase the size of the team that's helping build cool
new features. But right now our features make it super easy to deploy and we think it meets the needs of most of our customers.
Now it becomes the services layer that we put in place to help big customers solve their architecture problems
because distributed computing is a new paradigm for many of them.
So an infrastructure developer, somebody familiar with taking a Kubernetes cluster and extending it across public private clouds, figuring out how to make it work with edge devices and script all of the intranode connectivity.
So we're looking for obviously very smart people in DevOps, a full stack software engineer.
Node.js is what we're written in.
So Node.js is a prerequisite there. We are also not solving, I wouldn't call them traditional programming problems.
We're in a very, very specific space. So we're not looking for extremely experienced programmers.
We're looking for people who sort of get it and understand the goal is to build something that's a joy to use.
And as such, there might be a little more heavy lifting on our side so that there's a little less heavy lifting,
you know, on the parts of developers who are using our product. So to that end, we really, really like to be able to take somebody in
and make sure that they care what the customer thinks.
Because there's a lot of developers who want a functional spec
and they want to build out according to code.
And then they want to check out.
And I met the objective and I'm like, guess what?
The objectives are going to change every day,
but there's one core and that's it'll only change
if it makes it easier to use, more stable, smaller,
tighter, faster, whatever.
And the other part is you're free to bring suggestions
to the table just as much as our CTO or myself
or our director of marketing
who's out there on dev too, uh, you know, and reading all the articles and all the feedback
on our blog posts. And it's like, you know what? Everybody hates this thing. Yep. Like they talk
about how we don't solve it, but nobody solves it and everybody hates it. And maybe we should look at
that. And that's just as valid as an idea, um, as the idea that, you know, our clustering engine
should perhaps change to something written in a lower level language so that it's faster.
Yeah, totally makes sense. Yeah. So, so the job isn't just, uh, you know just inverting the binary tree or solving some really tricky dynamic programming problem or something like that.
That's not actually the job.
That might be something you have to learn as a rite of passage, but it's not the job.
Yeah, I mean, you'll totally do those things.
You'll totally 100% do that. But we've got so much of the core written that at this point, our patented data model and our indexing and all the things we do are really, really solid. I think we'd love somebody to become familiar enough with it that our CTO could take a day off.
That'd be nice every once in a while, right?
But I think the other,
the other part is just the flexibility to say, I don't know the answer,
but also nobody knows the answer.
So let's figure out a way to write it two or three or 10 times,
test it all and figure out what the right answer is right now.
One of the things I realized that the most authoritative paper on resolving conflicts in distributed computing was written in 1984
by a woman at Microsoft. That's the paper that all of the articles eventually go back
to her site as the primary influence. And we've known it was a problem for a very long time.
And we still have and people still end up with that article because we have not solved that problem yet.
Yeah. Yeah. Yeah. I don't know if that is the same as Paxos.
I've heard the name Paxos a lot. I think that's some way to do, I think, leader election and solve conflicts.
That paper, at least in my circle, seems to come up a lot.
But what everyone tells me, and I'm sure you've seen this too,
is it's great in theory.
Everything's great in theory.
And then in practice, you have to find out the right corners to cut
so that something doesn't take three months to be consistent.
And also, on the flip side, doesn't have massive errors. And so it's playing
that game, I think, is a question of what do the customers really value? I think at the end of the
day is what really matters. I think somewhere down the line, the idea of RAF consensus, election, all of that will, will fade away.
And the data itself will contain bits of metadata that allow you to have a
leaderless distributed system.
So inherently all of the information you need to know about what you need to
do is present for each node to,
to execute rather than having a central broker that kind of directs traffic
because you could have a cluster or two leaders or failover or whatever, but inevitably it's
going to be a single point of failure if you wait for that one person to make that decision.
You're going to have lots of people doing things and it won't scale.
So it is my thought that ultimately it will be able
to decide in a deterministic manner by itself,
just based on the data itself.
And it has some interesting applications down the road
for quantum computing where, you know,
you can make probabilistic determinations across massive
data sets. And obviously databases are supposed to be deterministic. And there's a lot of debate
about whether or not quantum computing could ever be used for data persistence or data logic. But
I think there's a tremendous opportunity there to find the lowest energy solution or the probable lowest
energy solutions for a query. It just requires a lot more qubits than we have right now.
But 10 years down the line, man, my patent is going to be awesome.
Yeah. And to the point you brought up earlier, I mean, there is a double spend problem and there are like very specific niche cases where you do have to spend that time and you do have to have sort of that
perfect answer. But the vast, vast majority of the time, you don't. And so in all of these instances,
you can use edge computing, you can use things like HarperDB and all these like edge computing services that kind of make it easy to deploy to the edge.
Docker, all the things we talked about will be extremely, extremely important.
And then that one time when you actually click the checkout button, that time, you know, it can go to the server and take a long time.
And people kind of expect that. They expect, OK, if my credit card is go to the server and take a long time. And people
kind of expect that they expect, okay, if my credit card is going to get charged, I expect
to wait a little bit. And you can kind of have the best of both worlds just by being smart and about
when do I use A or B and, and, and both of them are extremely, extremely important.
Exactly. It's the, it's the challenge that you want. You want it to be as fast as possible,
but not too fast. Yeah, that's right. Yeah. Yeah. Now a fast as possible without hubris,
right? Exactly. That is, that is, that is the Sisyphean struggle, right? Yeah. We push,
we push the rock up the hill every day. Yeah. Very cool. Cool. So let's jump into how people
can reach, um, you know, you and how people can learn more about HarperDB. And what are some good resources for folks out there? And alongside that, we have a lot of folks who are in university who love can try out HarperDB for free? Is there a permanent free tier?
Or what are some of the opportunities for them?
It'd be great to kind of cover some of those bases.
Sure.
Our URL is harperdb.io.
On there, we have a docs tab, which will teach you everything you need to know from getting
started.
You can install HarperDB locally.
It just requires Node.js and NPM. You can just NPM IG HarperDB. Super easy. We also have a management studio, which is web-based. Even your local instances, because obviously your browser
is capable of making local network connections. You can manage local and cloud
and other instances that you might have installed through RStudio. That allows you to connect
instances to each other, set up intra-node data replication at a table level, pub and sub,
as well as our new custom functions feature where you can hang your lambdas basically off the side of HarperDB
at its own API endpoint. So you can not just use our operations API, but set up something with
third-party authentication that makes a query, perhaps inserts some data, then runs another
query, calculates an average, and inserts it into a time series table. It's a super cool piece of
functionality that then you can package up a project, click a button and send it to any of
the other HarperDB instances in your organization. So it's very easy to deploy these as well.
That system is backed by our own AWS hosted collection of Lambdas. And obviously, if you've
worked with Lambdas before,
you know that deploying them and writing them
and getting them out everywhere
is not necessarily always the easiest.
So we took a note on that and we tried to make it easier.
We think we accomplished it.
We do have, within our studio,
the ability to spin up a HarborDB Cloud instance,
which is our database-as-a-service product,
which is your own EC2 node
with HarborDB running on it. We have a free tier. There's also a free tier for the locally installed
instances. And you can effectively network all of these things together, watch the data move
around a system, run your local instance on a Raspberry Pi, collect some sensor data,
watch that get replicated up.
It's really, really easy to set up a very,
you know, a comprehensive distributed computing application
in only a few minutes using HarperDB and the studio.
That's super, super cool.
So folks at home, you can,
most people have a Raspberry Pi.
We've been telling people to buy
Raspberry Pis for what, half a decade or something. So you have a Raspberry Pi, you have a computer,
you can run HarperDB on the Pi, run HarperDB on the computer. And then whenever you, you know,
insert foo into table bar, it just shows up on the computer, which is pretty cool. I mean,
there's a lot of really fun stuff you can do with that. If you want to have a Raspberry Pi, we're working on a Raspberry Pi
water fountain to control a water pump. That's the latest project the family's been doing.
And so we could just have a database, which is just saying, when should I turn on the water
fountain or should I have it on right now? And then from our computer, we could just, you know, change, change a value in that database
and boom, the water fountain shuts off. So, so there's a whole bunch of really fun stuff you can
do with this. And then as you learn that technology and you go to a company that, you know, is moving
a lot of bits and needs to do things at the edge for all the reasons we talked about, you'll have that
experience. You'll be ready to go and you'll have sort of a leg up there. Absolutely. I wrote a
thermostat program for my own house. I have old fan coil units that either have hot air going
through or hot water going through them or cold water going through them. And depending on that,
based on the temperature, if you want it
colder, you need to know what the temperature of the water is because you don't want it to just
turn on. So you need this piece of data. And I wrote it on a Raspberry Pi with a little seven
inch monitor on top of it. And it runs HarperDB that stores it. And it does a little predictive
temperature curve. It does third-party calls to the weather service. It will
turn on the cooling if there's cold water in there and it knows it's going to be hot later.
It may turn on the air conditioning a little early. So it's a super easy and simple system
that then actuates the power button on any given fan coil unit in the house based on which window
is that facing and does that room face? And is it time for it to be colder in here now?
Or can I wait until later in the afternoon?
It's a very, very simple proof of concept,
but it's one that, you know,
a commercial thermostat was never going to meet my needs.
Yeah.
Wow, that's super cool.
And yeah, and this way they can all talk to each other
and they can all be aware of each other.
So if one of them is going, you know, all out, the other ones know that, okay, maybe
the temperature is going to drop and they can kind of bootstrap off of each other.
Exactly.
Yeah.
Very cool.
I'm just trying to save $2.
I just want to save $2.
Yeah.
That's the engineer thing, right?
It's like, well, I could purchase this product for $99.
I could purchase a Bugs snag subscription for 10 a month
but i think i'll write my own and spend three years you know no we're right from scratch
under the guise of someday i'll productize this and i'll get all that money back
yeah that's right never never works never works cool uh jackson was so awesome having you on the
show um i learned a ton i know patrick and I have learned a ton about edge computing from you and I really appreciate it. Folks at home have learned a bunch. If you want to reach out to HarperDB, they're on Twitter. We'll post a link to on social media and show off what you've built.
I think they'd love to see that.
And I'll also post the site and everything else.
Thank you so much for coming on the show.
I really appreciate it.
You're welcome.
I had a great time.
Cool.
And for everyone out there, thanks for subscribing to us on Patreon and checking out Audible
on our behalf.
We really appreciate that.
And we will catch everyone in a couple of weeks.
See you later. Music by Eric Barnwell.
Programming Throwdown is distributed under a Creative Commons Attribution Share Alike 2.0 license.
You're free to share, copy, distribute, transmit the work, to remix, adapt the work,
but you must provide an attribution to Patrick and I and share alike in kind.