Coding Blocks - Search Driven Apps
Episode Date: June 11, 2018We're talking databases, indexes, search engines, and why they're basically microwaves in this episode while Joe wears a polo, Allen's quick brown fox jumps over whatever, and Michael gives out fake U...RLs.
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 83.
Subscribe to us and leave us a review on iTunes, Stitcher, and more using your favorite podcast
app.
And you can go to the website, codingblocks.net, where you can find show notes and examples
and discussion and other stuff.
Send your feedback, questions, and rants to comments at codingblocks.net, follow us on
Twitter at Coding Blocks, or head to www.codingblocks.net and find all our social links there at the
top of the page.
With that, I'm Alan Underwood. I'm Joe Zeck. And I'm Michael Outlaw. All right. or head to www.codingblocks.net and find all our social links there at the top of the page.
With that, I'm Alan Underwood.
I'm Joe Zeck.
And I'm Michael Outlaw.
All right.
And for today's topic, we're going to be talking about search engines and how they offer highly scalable solutions and make certain types of problems very easy to solve.
How they solve these problems and examples of applications that may use this approach.
Yeah.
And first up, a little bit of news.
As always, got to say a big thank you for all the reviews we got.
Reading from iTunes, we got the all ID aller.
Bonergs, Matthias was already taken, and Galus.
And on Stitcher, we have Vavis, Java Runs 100 Billion Devices, Soul Survivor, and 21.
Dude, I love the Java Runs 100 Billion Devices.
Nice stab back at us.
I love it.
That's why we've changed our name to CodingBjava that's right that's right the uh the site will be updated soon uh so real quick this one has absolutely nothing to do with coding but
if you're ever in the metro atlanta area and you make your way on out towards carrollton
there's a place out there called historic banning mills if If you ever want to take your family out and have
some fun, they have zip lining. That is not just what you probably think of most zip lining. They
have the Guinness book of world records, longest zip lines. So it's a ton of fun. Uh, we had some
killer, great, uh, guides. So Tyler Skyler and Peyton told him I give him a shout out on the
next show. So there you go. A lot of fun.
Everybody should go check it out.
Have a blast.
The next thing, also, our buddy Sean from our Slack chat with our other buddy, James from the Cynical Developer, just did a show, Episode 79.
We'll have a link to that.
And they gave us a huge shout out.
So here's one back to them. You know, James has been killing it with the podcast.
And Sean, as always, is just one of our super awesome Slack participants.
So, you know, if you haven't joined us already in Slack, go do it because there really are just awesome people in there.
Yeah, and Sean did a great job.
He really sounded like a pro.
I had like one of those NPR driveway moments where I'm like sitting in the party lot of like destination XL,
it was the show and I want to get out.
Very nice.
Yeah.
The, the,
the end of it got pretty funny too.
So I highly recommend giving that one a listen.
Yeah.
Good stuff.
Uh,
and I want to mention again,
if you're in Orlando,
June 21st,
I'm going to be speaking at the backend devs meetup talking about,
Hey,
uh,
the same topic as tonight,
kind of except a slightly different slant.
And we'll have some code samples and some other stuff to show you some apps.
So 21st, Orlando, back in Dev's Meetup.
And as part of that, actually kind of grew out of that a little bit, been working on
a little app, all open source with people in the Slack.
Things have been really great, been collaborating there.
And if you are into programming podcasts, I think you might be
if you're listening to this, I want to encourage you to check
out QIT. It's a
working title I know. QIT.
Cloud. And you can go
on there and what you can do is you can search for
topics like say GraphQL and
it'll return all the podcasts that I
know about in our search engine that
refer to GraphQL. So you can
queue them up and play them right there and actually focus in on a
topic rather than the show.
So just kind of a slightly different slant on listening to podcasts.
And I think you might be interested and I'm very interested for you to listen
to it.
So you can give us some feedback on how you like it and making sure that it
works.
Like you think it should work and stuff.
And we'd love to hear that feedback and we'd love for you to consider
collaborating too.
So come on in.
Yeah. On the consider collaborating thing. Like seriously, there've been a lot of people that ask like, Hey, how do I get started in programming or how do I
do this or whatever? You know, there's some people leading the charge with a few different like UIs
and things on there. Come over there, right? Learn how to fork the repo, learn how to put in a pull
request and all that kind of stuff. This is a perfect opportunity to play with something that, you know,
you'll be interacting with just some awesome people from the community.
Yeah. And I was so weird.
Like I never know what to say people's like real names or Slack names or
whatever. But anyway, huge thing to thank you to Nicholas,
abysmal person, Arlene, Mads.
So I mix it up. So some people's real names, some people's Slack names,
and you just have to go to the QIT channel and the CodingBox Slack and hop in there and you'll
figure it out. Awesome. I want to add though, I think we, oh, an apology here. I think we
overlooked one of the reviews. So I want to make sure that this person is included.
VR6 Apparatus.
So I'm really hoping that that's a
Volkswagen reference because when I think VR6
I definitely think Volkswagen.
Maybe you do too.
How did we miss that? Was that Stitcher?
No, it was iTunes review.
So thank you for
taking the time.
Thank you very much. And so on to the topic. Sorry about that, Ron. Thank you very much.
Yep.
And so on to the topic.
We're talking about search engine powered apps, which is something that we've all had some experience and various kind of depths with that.
And we're all kind of in love with them.
So I thought it'd be really cool to talk about on the show.
And first, I want to talk a little bit about kind of what's's the problem that they solve like why why does this thing even exist and i want to kind of point out
that search is really a core tenant of like modern computing and usability like you know google's
kind of the obvious example there but like you can't even make a phone call nowadays without
like searching your phone right you go to contacts or whatever phone you start typing like da it's
like oh dad there we go hit well can you remember the days when you like you just knew the phone numbers of the people
that you interacted with regularly like you didn't even have to write them down yep now it's like i
don't even you don't even know your own number let alone the people that you have to call yeah
how crazy i can't use the phone i can't listen to music i can't buy underwear i can't use the phone. I can't listen to music. I can't buy underwear. I can't watch TV. I can't watch TV without searching.
Yep.
It's true.
That's just crazy to me.
Yeah, man.
And so, you know, I can tell you right now that there are not SQL databases behind all of those things.
And we'll get into the reasons why.
Yep.
And, you know, for some things, like, you can get away with that.
Like, maybe that add the Contacts app there where, like, maybe, maybe max you have 500.
Like, you can probably get away with, like, a couple simple likes there.
Maybe a SQLite database.
Like, no big deal.
You know, that's probably fine.
But when it comes down to it, like, people aren't really very good at searching.
Like, even if you know exactly what you're searching for, it's still really easy to typo or just get something wrong or not it's not
even typos like sometimes you just don't know how to spell the thing and google has spoiled us so
bad yeah i mean you've seen that like did you mean and you're like yes of course i did i mean i can't
be the only one that uses google as my spell checker sometimes all the time right wait how do
i spell this word let me go try it in google and see what it comes back with. Oh yeah, that's what I meant.
Anyway, in this episode, I have a feeling
that we're going to keep comparing kind of different
use cases to SQL databases, and it doesn't mean
that we don't like SQL databases at all.
I think we all love, especially Al, really loves
the SQL databases. It's just something
that kind of contrasts against Scrooge. We're going to be talking
about use cases that don't really jive with it very
well, like the typo thing. Like, how
do you say, like, hey, give, you know know select all the records that are spelled kind of like this well they're
actually a feature in sql server just so you know it's called sound x but is it really yeah no it
really is so that's the that's gonna be the way this entire episode goes. Alan's going to be like, well, you know that there's a feature for that.
Yeah, sorry.
And if there is, let us know in the comments.
Yeah, Alan.
In the comments.
I've got this
inside line to you guys.
So my comments are going to be
inlined.
So people
aren't very good at searching. You've got to write in for your chance
to win just like everybody else.
It's hard to get away with a
simple sequel like clause
where you put the little parentheses in there to
kind of do the wild cards or the
star type searches.
Or if you want to do a regular expression.
Oh, yeah.
I hate that word. archetype searches. And, you know, we can... If you want to do a regular expression. Oh, yeah. Yeah.
Ooh.
I hate that word.
I mean, what if there's just too much data?
Like, obviously, Google is a prime example there.
Amazon, too.
Like, I think I've got even some stats on how many products...
Yeah, Amazon sells over 500 million products.
Like, there's probably not a TBL product in their database
because you start thinking about all the products that are out of stock or not
for sale or that they sold five years ago.
That's something I need to run reports on.
And pretty much guarantee you that things are a little bit more sophisticated
to store that data.
Yeah.
That,
that was like super interesting to look at the,
uh,
the Amazon stats,
especially I found were interesting.
Like I never would have guessed, like they had the top 10 categories that they were selling things in.
And one of them was like top was clothing, shoes, and jewelry.
Like was by far and away the number one category.
And I'm like, this is a bookstore.
Right.
Like the thing that they were known for that they built upon and it's like only one-third of their top category.
Yep.
And now AWS is the real moneymaker anyways.
Yeah.
So 500 million products for sale.
And by the way, we'll have links to the articles where we got these numbers from.
I think the article is from like 2015.
And I'm pretty sure it's only gotten bigger.
18. The Amazon one was from 18 okay well some of these are from 2015 god you guys you gotta leave a comment at the end of the show oh dang it you're like i'm wearing
the polo so you guys don't you know challenge my oh good call good call all right we shall silence. Man. Google serves over 40,000 searches per second.
That's crazy.
Man.
That's just nuts.
I don't know what all that counts, but there's so many.
If you go to a lot of websites all the time, it'll be Google powering their search for their website.
It's just too much to even think about.
This was interesting. Splunk indexes hundreds of terabytes per day. their website and so that it's just too much to even think about uh this was an interesting uh
splunk indexes hundreds of terabytes per day and now that's kind of um kind of unfair because like
who you know who knows like if that's all in one place or distributed someone's on premises
you're on prema someone's in the cloud who knows but it's just the kind of notion that there's
this much data kind of flying around per day that even just deals with telemetry and event type data that Spunk traditionally deals with is pretty nutso to me.
I'm actually shocked it's not in the petabytes, to be completely honest.
I would have thought it would have been, but yeah, still a lot of data.
Yeah, and we got 5 billion videos watched on YouTube every day, and that stat is particularly old. Just think about all those people either searched or got a video suggested or something along those lines in order to kind of find those things in order to watch.
Because there's so many videos on YouTube that you're constantly finding needles and haystacks, and it works surprisingly well.
I just remembered I had some interesting data I wanted to talk about related to big data.
Dang it. We'll find it. I will. You guys go interesting data I wanted to talk about related to big data. Dang it.
We'll find it.
I will.
You guys go on.
I'll find it.
We will circle back.
All right.
In the meantime, could you imagine writing a query for something like that?
I know I searched for camera on Amazon, and I counted, I think it was 23, different kind of aggregations and buckets.
So, like stars, like four stars and up or three stars and up or price ranges.
That's what I mean with bucketing like $100 to $200.
There are different categories you could drill into.
You could pick colors and sensor sizes,
all sorts of different stuff in order to just kind of filter it on cameras.
And that was just totally random.
And you could imagine writing a SQL query, you know,
kind of real time, like based on a kind of traditional normalized database,
like what that would kind of look like,
especially for those one to minis,
like say color or categories or, you know, stars, stuff like that.
Like it gets pretty freaking nuts.
Okay. Imagine this for your, your SQL server instance.
I found, I found what I was looking for.
Let's start with, let's start with a small one, right?
Just give me a number real quick off the top of your head. How many,
to put it to scale, petabytes do you think that NSA collects per day?
Petabytes per day?
Per day.
The NSA. Oh, they're the ones that are sitting on the that's right never mind uh
petabytes i'm gonna say 200 petabytes well you're gonna be disappointed then 29 petabytes and that's
on the small end of this spectrum that's still a lot yeah i mean i was just being ridiculous yeah
how do you store all that like you've got to have people like throwing in hard drives as fast as they can day and night just to keep up with that, right?
Now you know why hard drives are still so expensive.
Yeah.
They must have ordered them by the shipping container.
Seriously.
How about I got Google, Facebook, Twitter.
Let's go Facebook.
All right.
I've got two numbers,
collects and stores. So I'm not sure what the difference would be there. I guess like there, this is all from, uh, the imposter's handbook has a section called what is big data.
And it goes over, uh, some, it has some stats on data collection from the big companies, all from data that came from followthedata.com.
Okay, so check this out.
Oh, shoot.
I forgot to say this part, too, is important.
This is data from 2014.
Oh, so this is a while back.
A while back, four years ago.
I think the difference between collects and stores is when you start looking at big data in practice, typically you aggregate anything that's within a windowed timeframe of five seconds or something like that, whatever.
Collects.
Let's go with collects.
2014, man.
That's actually a long time ago in terms of the amount of data.
I'm going to say 500 gigs a day.
500 gig?
No, gig.
No, terabytes.
Sorry, terabytes.
I was – terabytes a day.
500 terabytes.
All right.
Joe, I don't know if you wanted to weigh in.
As you might have figured out by me referring to shipping containers with the hard drives. I'm terrible with these kind of sizes.
So I'm going to say 20.
20.
Megagigabytes.
Megagigabytes, son.
I'm going to burn through these quick.
Facebook collects 600 petabytes per day.
Per day? Per day?
Per day.
That's more than NSA.
Right.
Right.
I told you NSA was going to be on the small end of the spectrum here.
They store 300 petabytes.
So they store half as much as they take in.
So you're right.
They're probably doing some aggregation.
Now, here's where it gets weird, though.
Because I like your aggregation theory.
I'm with you there. I know where
you're going with that, so it makes sense.
And it sounds logical, but Google
is about to blow the doors off that theory
because...
And actually, so will the NSA.
Google collects 100
petabytes per day,
but they store
15,000 petabytes
or 15 exabytes.
I don't understand how they like,
yeah.
Well,
Oh wait,
you know what?
They don't say per day on that,
on that one.
I'm sorry.
You're right.
Ooh.
And actually that's a good catch too.
Cause the Facebook doesn't say if it's per day either.
Yeah.
I mean,
that's a lot.
I wonder what that means then.
If that's just,
if that was total at If that's just,
if that was total at the time of 2014 when this was taken?
Probably, yeah.
Well, okay.
Now, you know, Twitter,
this was back in the old days
before we had like the wide lanes,
you know, so we didn't have 288 characters.
We only had 144
and it was 100 petabytes per day.
So Twitter was on the same scale as Google
in terms of data collection in 2014.
Now, your job as a SQL Server administrator
is to build out a server instance that can hold that.
Like, that's your task.
Just to finish out the numbers here real quick,
when it went to the data storage it was
since I mentioned the NSA
it was 10,000 petabytes
for the NSA
was their storage
so 10,000 bytes
how long would that take to zip?
that's what I do with my gigabytes
man that's ridiculous
yeah like
that would put Cloud blaze and all
these guys out of business they're unlimited backups for 50 bucks honestly that i mean though
we joke about the number of hard drives and stuff like honestly that's where i feel like we're going
to eventually run into a problem it's just the electricity to run all this stuff i mean seriously
from what i understand a lot of storage
places are typically put in colder regions of the world for cooling purposes i may be wrong on that
but i thought i'd heard that at some point i've heard that google has some not i wouldn't say a
lot but they i've heard that they had some in there but then i've kind of wondered about like
hey do we need to be worried about from like a impact on the globe? Right.
Think about the heating and cooling.
I mean,
it's gotta be water or like melting ice.
It's gotta be crazy.
Cooler servers.
Gotta be crazy.
This was,
um,
back in now,
that was the number from 2014.
So Google was collecting a hundred petabytes per day in 2008.
So six years before, they were processing 20 petabytes per day through an average of 100,000 map-reduced jobs spread across their cluster, which on average ran across approximately
400 machines in 2007. Well, I'm mining 0.2 bitcoins on my wine cooler here with cheese dust.
Well, the reason why I bring that quote back up, and again, this is all from the Impostor's Handbook,
and he's quoting somebody too, but it was just that this isn't happening across a single machine, right?
Right. Can't.
This isn't happening across two machines. Like, so when we talk about traditional databases,
you know, or database servers, you know, the old way of doing it is like you, you know,
you'd build up one beefy machine and you might have a second one just like it as a failover. But when you start scaling to data of this size, then that type of solution falls apart.
Yeah.
Even sharding databases, which if you've, if you've heard of it or not heard of it,
it's a clean show, man.
Yeah, sharding, some Jack Black references of, no, like sharding just for those who have
or haven't or haven't
heard of it. Basically, that's when you decide that you are yourself going to manage how data
is partitioned across multiple servers. Right. So let's say that you have, I don't know, 10,000
employees. And this is a really small number. But what you might do is you say, OK, well, I have
four database servers and I'm going to put 2500 on one 2500 on two,
and etc. Right. So that's sharding. But then that means you have an application layer that knows how
to put all that stuff back together. Right. And it's a very manual thing. As a matter of fact,
I think Instagram, that's how they ended up scaling initially, with Postgres as they started
sharding their database and doing it application layer. So anyways, moving on.
Yeah, the point is like most kind of databases, like just the origin story there is that they weren't really designed for horizontally scaling.
Well, that's kind of like a new kind of cloudy detail that's kind of come about in the last, like, I don't know, 10, 20 years.
And in relation to databases, traditionally not scaled horizontally well. Like Alan mentioned, like Postgres has got some stuff for it.
MariaDB and, youDB and Cassandra.
The databases are working about it,
but ultimately like most databases
that you think of that are relationable
are designed around this notion
of ACID compliance.
And ACID is an acronym.
You've probably seen it before
if you've done much with databases.
It stands for Automaticity, Consist isolation, durability, and that's it.
I was waiting to see where you were going to go with that one.
Yeah.
Was I going to say about the ad one?
Is there like a sentence you guys could give to like kind of sum that up?
It just ensures transactions right like that in a nutshell if you say that hey
i'm going to save an order it will make sure either the entire order is saved or it'll back
it out right like yeah or nothing in a nutshell that's that's it i mean there's a lot of complexity
behind it but that's typically what your asset is supposed to be well that's the atomicity is
the all or nothing consistency consistency ensures that the transaction will bring the database from one valid state to another valid state.
Isolation ensures that the concurrent execution of the transactions results in a state that would be obtained if the transactions were executed sequentially. And durability ensures that once a transaction has been committed,
it will remain so even in the event of a power loss or a crash or errors.
And it's really important because this is really core to what databases,
traditional relational databases, how they work.
Like you write something, say like a payment confirmation,
like when you read that back out, like you better get that payment confirmation.
And the relational databases jump through all sorts of hoops, like writing to logs first and doing this and that in case the power goes out or whatever, to make sure that things
are always what you expect them to be. And, you know, deadlocking is something you might have
heard of where the database will kind of lock certain data sets while it's writing and updating
data just to guarantee that consistency. And so they're constantly at war with being ASCII compliant.
And that's part of the reason why it's so hard to make them scale horizontally.
And NoSQL, the notion of not only SQL, whatever you want to call it these days,
non-relational databases, is that they kind of chucked that out the window and said,
you know what, what if we gave a little bit?
What if we went for, what's it called, base instead?
So basic ability, soft state, eventual consistency.
We said, you know what, it's okay maybe in some cases for us to write something
and if it takes a second or two for it to show up in the database,
maybe that's okay for like 90% of data or maybe some percentage of our data.
And that's, go ahead.
No, I was going to say the interesting thing there is the eventual consistency,
which you said in that base part of it,
because that's where the horizontal scaling comes into play when you have,
and it's not just NoSQL, it's basically scalable databases,
ones that will scale horizontally,
because the eventual consistency is if you update, you know,
Michael's name on server one and server 100 out
here on the edge, it's eventually going to get that update, which means that if you query these
databases, you know, one through 50 might be returning, you know, Michael's old name, whereas,
you know, 51 through 100 are getting his new name. But eventually, as that stuff goes throughout all the nodes,
then it comes up to date.
And that's why relational databases don't typically function that way
because they need to guarantee if you save this,
the next time you query it, you're going to get the new value, right?
Like period, that's it.
So there's a difference in how you have to use the things.
Am I the only one that's curious to know what you were going to rename me?
No, man.
I don't know why I went with the name
thing. You were just in front of me.
That was the easiest thing to say.
One example I like here
is on Reddit. Sometimes I'll post something to Reddit
and I'll call one of these guys and be like,
Hey Outlaw, how many downvotes have you seen?
He's like, 11.
Oh, I got 13 over here.
Maybe things are getting better. Maybe
getting worse. I don't know. The number is different.
Over time, eventually, things will catch up. The Maybe things are getting better. Maybe getting worse. I don't know. And the number is different. And,
you know, over time,
eventually like things will catch up.
The caches are invalidated and,
um,
things will kind of get in sync over time.
But for a few minutes there,
like depending on,
uh,
even when you write or when you read,
like most systems now are,
are,
um,
most applications that kind of deal with this are really good about making
sure that you read back from the machine that you wrote to,
but you can imagine with a Reddit somewhere where they've got like a
distributed system,
you click that like button and it's got maybe, I don't know, 100 nodes containing its data.
And it's got a right to some percentage of those nodes, right?
Because we can't assume that all the data is replicated on each of those nodes.
So maybe it's only on 10 of them.
Then when you request that data, you might get any one of those 10 and it may not have that data yet.
So it's not consistent yet, but it'll get there.
Gosh, darn it.
And it wasn't going to go too deep.
But if you guys were familiar with the notion of the cap theorem, there was this paper that
came out, the Dynamo paper many years ago now, but this is still talked about all the
time.
But it's really interesting because it kind of puts things in the paper.
It's not so simple,
but the,
the talks and stuff that people have done after the fact really kind of,
uh,
hammer home the main points where,
where you can kind of imagine a triangle.
I apologize in advance for the diagram description again.
Here's a good one.
Yeah.
You got a triangle.
Uh,
each node of the triangle,
uh, is one of the letters for CAP.
So it's either consistency, availability, partition tolerance.
And I always get mixed up on availability and partition tolerance.
But the kind of main point there is that you get to kind of pick two.
You can be consistent and available, or you can be available and partition tolerant, or consistent and partition tolerant.
But you can never get all three of those in a distributed database and maybe
even relational. I should look that up.
Yeah. It's, isn't it funny how all,
all things seem to come down to the triangle, right? Like the, uh,
the time versus, uh, amount of work and quality.
Like there's always a triangle.
There's something that's pulling away from the other two
resources and that's basically what we're saying here, right?
Relational databases are heavily, heavily based on being consistent.
Immediate consistency, that's the number one thing. It may not be available because if you
say one of your servers goes down or you're log shipping or you're
doing something, you may just not have your
data available because it can't trust itself. Or yeah, I always get the availability and partition
tolerance mixed up. So if you know a lot about the cop theorem and you can speak to me plainly
about the differences between availability and partition tolerance in such a way, here's the
trick, in such a way that I can then go on to explain it easily to other people. I would,
I would love you forever.
And I mean it this time,
not like the whole tattoo thing,
visual studio,
just move on.
Well,
you know,
the cap theorem reminds me of that thing,
you know,
like how there's the joke about like,
uh,
you can have two.
Do you want it?
Your choices are cheap,
fast,
or right.
You know, and, and cap theorem is kind of
like that like you get consistency availability and partition tolerance pick two right that's
what it always is yep and um no sense of shredding actually this is something something picked up
from santosh hey santosh uh just saw a talk from him at the no sequel about no sequel at the
back of developers group and he
mentioned the word shredding and he's talked about taking a data set and what we typically do in like
a relational database is like taking that data and normalizing it so if you've got one to many data
or you've got data that's like optional like you would take that that record and shred it into
multiple pieces and store into multiple tables and then when you query that data, you kind of put it all back together. So if you say like, let me see the user record, then you might go get my, my name from the user table. You might go get my addresses from the address table or, um, than one from the emails table and order history from the order history tables and get all that stuff. And you may reconstruct my record from that.
But every time you get any user's data, you're doing all that work to bring that stuff back together.
But real quick, just curious.
So, yeah, it sounds inefficient.
What would be the purpose of it?
Why would you shred it?
Yeah.
I mean, for doing relational type stuff, reporting,
um, for being able to write and return data for normalizing data. So you're minimizing the amount
of, um, data that you're storing on disk. Okay. So that's, that's what I was getting at. Like
the, the whole point of the shredding. So the, the other more standard name that most people
would be familiar with is normalized data. So when you actually have
your data broken up into its
constituent parts that are
that way you only are storing it once.
So the shredding
that's what I was getting at is this sounds like
a new term on the
on the
fundamental kind of database-y
way of looking at this, right?
Yep. And it's really nice too for normalization because a lot of times you might, like, say you've got a database full of, you know, users.
And you might have, like, a user type of, like, customer and a user type of customer service rep or a software engineer or, you know, whatever kind of data in there.
And you can store an ID that kind of points to that rather than storing, like, the plain English over and over again.
Like, it seems kind of crazy on one hand to say like, well,
we've got all these users and we're storing
the word customer service or customer
over and over and over again for each one.
Why don't we just store the number 14? And if we change
the name of something in one place,
it just kind of works when we put this stuff back together.
Yep.
And that's kind of the
opposite approach that like
NoSQL applications have kind of taken where,
and there's all sorts of flavors that we could do a whole,
we should do a whole episode on,
on no sequel.
And there's a lot of kind of leeway and gray areas and different things do
different ways.
And there's various shades of gray in there.
But we should probably say too,
though,
that that whole normalization came from, you know, its roots weren't from a performance perspective, but from a storage perspective.
Data consistency, right?
Well, not consistency, but just to reduce storage, to reduce redundant data storage. See, I always thought that might be true too,
but I always thought it was,
you remember hearing back in the day,
like I have dirty data or whatever.
I can't even think of the name of it.
Was it consistency?
It was something else.
Data integrity.
Integrity, data integrity.
I thought that was,
so storage might've been a thing back in the day,
but I thought integrity was always the more central purpose of the relational
database.
Because if,
if I have a Michael outlaw record and you know,
somebody changed his last name and another record,
like now all of a sudden you got dirty data.
And so it's hard to reconcile.
No,
not relational databases that that's fine for.
And that's why you have the ACID compliance for the data integrity.
I'm saying normalization, like the different normalization forms.
That was all about, the roots of that were not about performance.
It was all about...
Smaller bytes.
Saving storage for the data by reducing redundant data down to be like,
you know, okay, all of these people, uh, you take Amazon, Amazon could have multiple customers that
live at the same house, right? So you could have, you know, each user, each customer record could
have their address on the record, or you could just have a pointer for two users that point to
the same address,
right? Right. So that's where it started from, you know, way back when. That was the original,
you know, desire for it. But it's not, you know, like Joe said, it's not very efficient in terms
of performance. And that's why you often will denormalize things ahead of time in SQL worlds.
Well, I think that's where this next statement that Joe's getting ready to say, probably
with all kinds of love.
Yeah, I'm waiting for it.
I was actually just Googling database normalization.
I was going to read the definition of it if you guys wanted.
Sure.
Spit it out. Or rather, what the kind of stated
goals of
normalization beyond
first normal form was basically to free the collection
of relations from undesirable insertion,
update, and deletion dependencies.
So if you're updating one
small thing, you don't have to kind of lock the whole deal.
So like if we're changing the price of a product or something,
we don't have to go back through and
change every order or something like that. To reduce the need of restructuring the
collection of relations as new types of data are introduced. So if you change, maybe split a
category or something like that, you don't have to go and do a ton of work to every single record
that touches it. To make the relational model more informative to users and to make collection
of relations neutral to query statistics. So you can take statistics and use those for informing the optimizers
independently so maybe categories that's the one that many you'd want to treat those a little bit
differently than users which are i don't know our users good call we could do a whole episode
on normalization and then i would actually be prepared for that
sorry about that so so wait wait you've got something up here that i that i know that you
have to read out of all of us oh yeah so yeah i'm constantly whining about this writing dynamic
antsy sequel is it's the pits so it stinks so you could throw away the word antsy there writing dynamic sequel period
is the pits right like yeah it there's lots of reasons for that like we mentioned like the
locking we we mentioned it's kind of like acid uh ancestry you know and it means all sorts of
weird things like it means it's difficult to to to uh order by things dynamically um
paging is kind of weird and not really built in and,
you know,
updating and like,
say like functions in SQL server,
you just can't do it because of the way things are kind of optimized and kind
of built in order to maintain this,
this asset compliance.
It's really funky and it's not how you would expect it to work as like a
kind of conceptual model.
And what about ordering by just a pasting column?
How's that work out?
It's a nightmare.
You can't do it.
And it has to do with how the kind of the query optimizer needs to be able to do certain things and set stuff up and do all sorts of work in order to be performant.
And that's because the databases weren't originally designed to be, you know, stupid, fast performant.
They're not optimized for getting back tons of data like the Googles and the Amazons of the world need to do.
All right. So let's back up one step before we go into this next section. optimized for getting back tons of data like the Googles and the Amazons of the world need to do.
All right. So let's back up one step before we go into this next section.
So we've been throwing out the term NoSQL, and then we've been talking about relational databases.
Now, one thing to keep in mind, like this episode, we're basically talking about search engines,
why they're relevant, how you use them, what the difference is, right? In a nutshell, search engines, the, you know, the Elasticsearches, Lucene, Solars, all those,
which are all based on the same underlying technology, they are NoSQL databases, right?
In a nutshell.
So just know when we're saying NoSQL, we're not talking about MongoDBs and those other ones right now.
We're focusing specifically on search engines because it is technically a NoSQL database.
It was NoSQL before NoSQL was NoSQL.
Yeah, before it became the thing that it grew into.
NoSQL before it knew that it wasn't SQL?
It was NoSQL, correct.
So it really shares those benefits with all those other kind NoSQL, like the mongos and stuff.
It's got a lot of the same benefits.
It's easy to scale horizontally.
It stores the document or the kind of the record all together in one spot.
And that's why we're going to keep kind of talking about these things.
It's going to keep coming up.
Yep.
And then here's going to be the big question.
And I love this because this is where context comes in. So Mr. Black asked me a
question the other day on Slack in regards to a comment that I made in an episode a couple back
where I was like, your SQL server or your SQL, your relational database is not your reporting
tool. It shouldn't be. And he was like, well, what do you mean by that? And this is where we're kind
of going to dig into my thoughts on what that was. So DB versus search engine, why not what do you mean by that? And this is where we're kind of going to dig into my thoughts
on what that was. So DB versus search engine, why not just do it all in SQL? So for those who are
aware or aren't aware out there, SQL server specifically, and I'm sure that Oracle and
other ones have something similar. You can turn on a feature called full text search,
which gives you some of what we're about to talk about. So why not just flip
that switch, right? Why, why not just turn on full text indexing in SQL server and call it a day? Why,
why look at Elasticsearch or Lucene or Solar or any of these other ones, AWS Cloud Search or Azure
Search? So have you ever started with a query, right? And you're like select star from table
products and someone says, Oh, I want to search. So you'd be able to type in car and you should show me cars.
And so you kind of do that and you put a little like in there.
So if they type car or cars, it works.
I'm like, well, yeah, but it's getting carpets back.
So you get a little bit more creative.
Maybe you do an in.
Maybe you try to kind of sort stuff creatively and try to figure out what they mean.
And you maybe bring in some dynamic joins because now they want to sort by price or by category or something.
And so you don't need to do those joins if you don't need the data.
So you only need to bring in certain types of filters that you're doing.
And sometimes you need to join the same table twice.
And sometimes that's a duplicate, right?
So it's unnecessary.
So you don't want to join if you've already got it.
And sometimes you need to join a second time because maybe it's like a manager and user type of relationship where you've got the same data for two different meaning things in the same table.
So now you do need different aliases for these same tables.
Joe's just gone down a rabbit hole, a very deep one.
He's seeing this in his head.
He's drawing out this query.
That's right.
So let's take one step back real quick.
I'll give you a very simple scenario where a search engine versus something like SQL is extremely effective.
So we've all seen these global search boxes at the top of applications.
Think Amazon.
Think whatever, right?
You start typing something.
Let's say that you type God. All right. You go into Amazon, you type God, the Bible might come
up in category books, right? You might also get God of war in video games category. It's going
to give you those suggestions because it has a search engine behind
the scenes that,
Hey,
this guy's typed in the word God.
I know what to go find right now,
right?
If you were to do the same thing in a SQL database,
maybe it's all in a product thing,
or maybe let's say that typically,
um,
you search for that.
Let's say that you have multiple things.
Let's back up and take a different example.
Cause I think that one's not going to go the right place.
Just Amazon. Say what?
Godzilla.
Godzilla. But so with Amazon, we can assume table, you have maybe a computer's table,
maybe you're taking asset inventory of your entire network, right? So you have computers,
you have software, you have users, you have all kinds of things, right? And these are probably
all separate tables because users don't map to computers. Like that's not the same type of data.
If you were going to do that search in something like a database, you basically have to union all
those, right? Select star from users were named like whatever I typed in union, all select,
you know, star from computers were named like this union, all select, et cetera, right?
You get the point with a search engine, you can literally say,
hey, run this search against all the indices. Done. You don't have to know about the table names.
You don't have to know about, you might need to know about the field names, possibly,
but it really opens it up where you can say, go find this keyword and tell me where it exists.
Which indexes does this thing belong to? and you get that metadata back right there's one really simple use case uh we've already talked about a little bit scale like they're built for it you know if if sql server all of a sudden gets to a petabyte of data
how are you going to query that thing right i mean that could be fun times. I've had,
I've been in environments where we
had
SQL Server,
the main
data file, and then
on like a
$20,000
SSD. This was back,
this was a few years ago, so
they were really expensive for this particular type of SSD.
But have you ever seen those SSDs where it was the PCI cards?
Yeah.
So we had the main data file on that drive, and then we would have like something else for the tempdb file,
just so that we could get the type of throughput that we
wanted out of SQL server.
But at some point you capped,
right?
Like at some point,
I mean,
I think when you're spending $20,000 on a single drive,
right?
I would call that a cap.
But I mean,
even then that's the crazy part.
Like what we were talking about earlier,
when you're processing petabytes of data at some point,
that's not even going to get it done,
right?
Like you're going to eventually go over what that thing can handle.
And that's where something like a search versus something like a SQL server,
an Oracle or whatever,
like that database got that big.
You have hundreds of millions of records.
You can't efficiently get it back out right now.
Search engines are made to do that.
You want that thing to scale?
It can scale up five servers, 10, 20, 30, whatever.
Now there is an intrinsic overhead with that because now you have to make that data the eventual consistency.
It has to make its way out to all those nodes, right?
So there's a cost to it, but you can do it, right?
That's one of the costs.
We'll get to see the others.
I want to take it back to the SQL real quick. You remember I mentioned searching for camera on Amazon?
I mentioned there's some nice aggregates there. I'm seeing
over 60,000 results, which is up from 40,000 when I did it a month ago.
60,000 results in the camera and photo category, so they picked it for me.
In the meantime, I've got things like average customer review,
four and up. It's giving me numbers of how many there are for those.
Price, it's got it nicely bracketed.
So it's like under 25, 25 to 50.
It goes up to 200 and above.
So you can kind of imagine if there was like a program there, it was just in charge of like camera searching.
Like you can imagine the query that you would start to write for that.
Be like, okay, well, here's the price ranges that make sense for cameras. And they would have to kind of pick that
in there. There's the sensor type, stuff like that, video capture, resolution. There's also
interesting things like what we call buckets for resolution. So you could do like 12 to 23
megabytes or megapixels or 24 to 35. There's colors, there's brands. If you were writing this query by hand,
you could absolutely do this.
You could query for all this stuff
from a relational database.
You could select like,
here's all my 60,000 products.
Let me go get all the megapixels for all of them.
And then let me, you know,
count those guys and group by the megapixels.
And you could do that as one query.
And I think I kind of mentioned
there's like 20 something aggregates here showing on the left you'd be 20 doing 20 something categories uh sorry 20 something
queries in addition to fetching those cameras and you could do that you know you could write 20
something whatever and you could write the query to actually get those products and page through
them and change the numbers when you filter but now what if I want to search video games or cars or motor scooters or bicycles? There's not somebody out there that's doing
these custom things for each one of those. This has got to be highly dynamic because Amazon sells
500 million products. So what you're talking about are the attributes that show up on the
left side, right? If you go to an Amazon or a Best Buy or something like that, when you click into a category, it gives you all those attributes and the ability to check off the ones you want, right?
Yep. ratings, price range, you know, size of sensor, every one of those, you'd have to do a separate
query that gave you the group by what the counts to get that stuff, right? Now, let's take that
into the real world of that stuff's all probably coming off multiple different tables, right?
Some sort of attributes on these products or whatever. This is what SQL Server or any relational database server is not great at.
It's not a reporting system because as you start joining these things
and grouping these things and all that, that's a lot of work, right?
That's a lot of indexing, and it's very specific indexing
to be able to sort by particular columns
because when you start looking at indexing to be able to sort by particular columns. Because when you start looking at
indexing database systems, you typically have to index ascending or descending on the columns that
you're trying to sort by. And so now you've got very specific needs. Yeah. And if I tell you that
I want a gray camera or a red camera with 11 to 23 megapixels, that means that all those 20 something
queries need to take into account those other filters that I've got on these other categories. or a RED camera with 11 to 23 megapixels, that means that all those 20-something queries
need to take into account those other filters
that I've got on these other categories.
So it's like you need to go and query all the products first
and then kind of cross-reference that with those particular queries.
Now, we're programmers, so I think one of the things,
you might build that camera the first time,
and when they come back and say,
hey, now it's motor scooters, we need to do the same thing.
So yeah, I'm not rewriting all those queries
for all these little categories.
What I'm going to do is I'm going to write
some custom code here that's going to,
you know, take in a table name
and take in a this and that,
and it's going to kind of generate those queries
that I need to do.
But as we kind of mentioned,
that's where I started going down the rabbit hole
of like multiple aliases and then dynamic joins
and then next thing you know,
you're doing temp tables and CTEs
and pivoting and querying across databases,
and stuff just gets kind of wacky, and you're going to be spending a lot of time with that sort of stuff,
which will be fine.
Dangerously, like you might do just string concatenation to build your SQL.
Oh, for sure.
That's probably all you can do at that point, right, for the most part.
So that problem that he just talked about,
that's how you would kind of go about solving it in a somewhat generic way in a database, as opposed to in a search engine.
You have the ability to say, give me the aggregations and aggregate on this field, right?
And so you can literally do your search and tell it, hey, this is the field I want this search to match if you want to go that deep.
And then you say, give me aggregations based off category or aggregations based off color.
And it will actually count that stuff up and give it back to you.
Right.
And I think it's worth talking about how this data is stored a little bit because we talked
about in SQL server, you're going to have to join the different tables to get these
things and filter on them and bring that stuff back.
Right.
When this data is stored in a search engine application,
it's all stored in a single document, right? So if you have a camera, all the attributes of that
camera, you're going to store on the camera. It's not going to be in a separate record somewhere
that it has to go look up. It's all going to be on there. So if it has a color, if it has a sensor
size, if it has all that stuff, it's in that one document. And that is why that search engine is going to be fast about being
able to go back and get that data out of it. And you've, I know you specifically outlaw worked with
facets at one point with search engines, which I think is similar to what they're calling
aggregations. Now in Elasticsearch, you want to give like a little synopsis on that?
I mean, yeah, I've worked with a few search engines that do it. It's, you know, I've heard
it called different things too. I've heard it like way back when called guided navigation,
where the idea was like by giving you those counts or those aggregations, it was trying to like guide
you down like, oh, this is what I meant.
This is, these are the things that I wanted to see. Right. Like, um, you know,
so, uh, yeah, I mean, I, I think you pretty much summed it up pretty well.
You know, those,
those facets are basically going to be the attributes that are common among
whatever your search result is. So if you search for camera,
like going to expand on Joe's example,
you search for camera.
And then he mentioned,
you know,
there'd be a price range facet.
There could be,
you know,
based off the megapixels,
that would be another facet and it would have it,
you know,
whatever the attribute for the megapixel was and then the number of them.
So,
yeah. I mean, it's, it's pretty awesome was and then the number of them. So, yeah.
I mean, it's pretty awesome stuff. And here's another one. So, thinking about that in SQL is sort of mind bending and you can already start to see how that's going to kind of suck, right?
Like, I mean, just being completely frank, you're going to have all these things that build up and
it's probably going to all end up being a bunch of dynamic stuff that is going to be hard to follow and hard to maintain right then the next thing think think about this like um
and this only came up because somebody ryan our buddy mentioned it to or today or yesterday if
you need a range of aggregation so for instance you search you search by cameras but then you say
hey i want these things kind of dropped
into buckets. And I want ones that were introduced in the last 30 days, the ones that were introduced
in the last 60, you know, 30 to 60 days. And then the ones that were introduced in 60 or more days
ago. Right. So that way I can only find the newest ones or, or ones that might be going on discount
because they're being phased out or something, right? Like who knows, but you can do this thing with a search engine where you can drop them sort of into buckets.
You can tell it, hey, give me everything that was zero to 30 days that just showed up, everything
that was 30 to 60, whatever.
It's really easy to do.
Well, think about it.
Think about it this way.
Let's flip the script here a little bit.
Just go, let's build on Joe's camera example.
Go to Amazon.com, search for camera.
And now you tell me how you would write the query, a SQL query that would return back
that left rail, that left navigation, where it has all the categories with the ranges
and the common options.
You know, Amazon might not be the greatest example in that case
because they don't tell you the counts, right?
They used to, yeah.
Newegg does.
But, yeah, so fine.
Let's go to Newegg then and type in – does Newegg still do it?
No, I don't guess they don't do it either.
They don't do the counts either.
The point was trying to find one where it would write the SQL query where you could
return back that left rail, including the aggregate
counts. So check it out. If you go to Newegg, type in camera, and then
on the very left, the top one,
click on DSLR camera, then you'll see that on the left rail, you've got manufacturer,
then you got Canon 999 plus, Nikon, you know, 851 plus, then you have package types, useful links,
price, condition, seller, customer insights, image sensor, etc. There's like six or seven more.
Those are all independent SQL queries, is sensor, et cetera. There's like six or seven more. Those are all independent SQL queries is what we're saying.
So, well, no, but I'm saying it's like,
don't write them as independent SQL queries.
The challenge is the task that you're being asked,
write a single query, right?
Write a single query that returns back all the data
that you need for that left-hand side,
plus all the camera results that are being shown, you know, in the main, the main content area. Right. And that's where the search engines,
you know, whether, you know, you go back from the old days of calling it guided navigation to where
fast forward to calling it faceted search, uh, to now we're calling it aggregations,
um, you know, whatever you choose to call it, like that's where the power of, of using a search engine is going to come in because
I would love to see the SQL statement. If you could write such a thing and I don't care,
pick your database, uh, as long as it's relational database, but you know, if it's Oracle or SQL
server or whatever, I, I, I don't think that can be done without it being a gargantuan effort to even
try it.
Even if you could.
And then the performance.
Yes.
Even if you could, how long is it going to take for that to come back?
Right.
With the counts, with the aggregate counts.
In a search engine, this page loads in milliseconds.
It's nothing.
And when was the last time you saw Amazon, you shopped for a shirt or something and you
click blue on the left and it says
no results found.
Right.
You click black and no results.
Never.
It's because it's always,
it's getting rid of those.
It's only returning things that have data.
Yep.
Yeah.
So it,
all right,
so now let's go to another one.
And,
and it's easy to pick on SQL server for this one.
Uh,
because the paging that's in there is not the greatest in the world.
But let's say for...
Better than it used to be.
It is way better than it used to be, which means it used to be non-existent.
Well, it was non-existent before.
It's marginally better.
But let's think about you just have a grid of data, right?
And you do a search.
And let's stick with SQL server because the devil we know
when you get back a number of results, there's two things that have to happen. You have to know
the total number of results because you need to know how to set up your paging in the first place,
right? Or do you have 10 pages? Do you have a hundred? How many are you showing in time?
You need to know how much data you would get back potentially.
And then you need to be able to get back the slice of data that you were looking for.
So that right there is two separate queries, right?
There's really not much of a way around it.
In a search engine, it's built in, right?
You literally say search.
All right, here's the total number of records we found for that.
And we gave you 50 because that's what you wanted back.
Done.
It's hard to ask anybody who's ever written any of the paging queries in SQL Server or any other database.
Now, granted, MySQL allows you to do it a little bit easier, and there are some that do.
They're still not greatly performant because it still has scan the indexes,
order everything the way that you wanted it to,
to be able to get you back that slice of data, right?
And you imagine, too, like every time Amazon adds a new category or something,
if someone had to go and add a bunch of tables and then index those and then worm those into the queries, that would be a real nightmare,
especially when they come along and say, now video cameras, now cell phone camera accessories.
And there's so much crossover with those things, but they're also very specific.
So trying to write like one big master query to power Amazon is just nuts.
But what we were kind of going with with that kind of like dynamic approach where you say the table names
and you kind of configure the metadata so you know how to write that query dynamically
rather than doing it by hand for all categories.
What you're kind of describing there is like a kind of interface to a search engine where you say, this is what I want.
Now you go figure out how to grab it.
Yep.
And I think we pretty much covered everything up here.
So I think we definitely hit on why SQL is not so great for huge amounts of data and for a lot of search use cases.
And we talked about how search engines are kind of like specialized tools for this sort of thing.
And we're going to take a quick break.
But afterwards, we're going to tell you why and how search engines are so good at this.
Yep.
All right.
So with that, we do it every episode. We ask if you haven't had a chance
already and you've thought about it, you've been, you know, driving in your car, listening to us for
months or whatever, and you know, you want to do something for us, you know, please go leave us a
review. We really do read them all. And we've gotten a lot of great reviews lately of people
saying that, you know, I changed career paths and, you know, I went from doing X and now
I'm trying to be a developer and you guys have helped us, man. Like it's seriously, I know it
does for these guys too, but I mean, that is awesome to hear that we're truly helping people,
you know, live the life that they want to live and really, you know, get better at what they're
doing and gain the confidence they need moving forward. So, you know, if you get a chance, please do go up to codingblocks.net slash review and,
you know, leave, leave on with whatever your, you know, your preferred choices.
And with that, it's time for my favorite portion of the show.
Survey says, all right. So last episode we asked, do you regularly evaluate your weaknesses in an effort to strengthen them?
And your choices are, oh, my God, daily.
My personal favorite.
Or I try to pick up a new skill or get better at an existing one every few months.
Or, yeah, but realistically, probably only once or twice a year.
Or, I learn what I want to learn when I want to learn it.
Or, no, that's why I listen to you guys.
Also one of my favorites.
And lastly, why?
I already know everything I need to know.
All right, let's go Joe first.
I think the winner with 33% is going to be one hour for my commute.
What?
That's us.
Remember, we had a bug with the poll this time.
So we apologize if you went to go vote and you saw some weird answers.
We had an issue with a plug in there.
Darn you, WordPress.
So, my real answer then is OMG Daily with 30%. Really?
Nah.
Yeah.
Nah, no way.
I think that I'm going to go with yeah,
but realistically probably only once or twice
a year. And I'll
go with 30% on that.
Okay. Yeah,
but realistically probably once or twice a year,
30% and oh my
God, daily 30%.
Right? I got those numbers right? That is correct.
You both lose.
Really?
Yeah.
All right.
By both Price is Right rules and just being wrong.
Welcome to Loserville.
Yeah.
The top answer was by 38% of the vote, I try to pick up a new skill or get better at an
existing one every few months.
Okay.
All right.
And I got to say, I'm super impressed with our audience because I thought for sure, especially
after I double dog dared them to pick, why?
I already know everything I need to learn.
There wasn't any.
Nobody picked it.
Awesome.
Oh, that's killer.
And that said, everybody will probably go pick it.
Where did we fall?
No, that's why I left it.
Okay, well,
you barely had Joe
beat then. Okay.
That was the third
best answer. Third answer.
Yeah. I learned where I want to learn
what I want to learn. It was the next.
Yeah.
Comical.
But that was that was the next choice.
And then.
Yeah.
And then.
Oh, my God.
Daily.
And then lastly, that's why they listen to us.
Well, I guess lastly would be why I already know everything.
Yeah.
So with that said, this episode survey is,
now that you've had some time to digest the news,
how do you feel about Microsoft's acquisition of GitHub?
And your choices are,
very excited, looking forward to the awesome things
Microsoft will add to GitHub.
Or, I'm concerned, but not enough to do anything about it yet.
Or, I don't care at all.
Should I?
Or, oh my God, the sky is falling.
Why?
How could we let this happen?
And lastly, already packed up my code and moved to GitLab.
Yeah, this one should be fun.
This episode is sponsored by Datadog.
Datadog is a software as a service monitoring platform
that provides developer and operations teams
with a unified view
of their infrastructure apps and logs. Thousands of organizations rely on Datadog to collect,
visualize, and alert on out-of-the-box and custom metrics to gain full stack observability
with a unified view of all their infrastructure apps and logs at cloud scale.
Yeah, and they've got 200 plus turnkey integrations,
including AWS, PostgreSQL, Kubernetes, Slack, and Java.
Check out the full list of integrations
at datadoghq.com slash product slash integrations.
Datadog's key features include real-time visibility
from built-in and customizable dashboards,
algorithmic alerts like anomaly
detection, outlier detection, forecasting alerts, end-to-end request tracing to visualize app
performance, and real-time collaboration. And Datadog is offering listeners a free 14-day
trial with no credit card required. And as an added bonus for signing up and creating a dashboard,
they will actually send you a Datadog t-shirt.
So head to www.datadoghq.com slash codingblocks to sign up today.
All right, so we're back.
Now we're going to talk to you about why you need a search engine.
Stop using SQL Server to do everything.
Yeah.
I think we kind of talked about a lot of the problems,
but now we're actually going to tell you
how the search engines actually solve that problem
and why they're even a thing, right?
And so we talked about the problems with SQL Server,
and so we've kind of imagined that search engines don't exist.
We could take a NoSQL
implementation and just denormalize the heck out of it. So if you've got a sentence, break it out
into words. If you've got a product, break out every little piece of that data. And I say break
out, it kind of implies that I'm shredding, but I'm not. We still want to store this document
whole, but take every little piece of metadata out about it and throw it into a big hash table.
So if I search, if I go to this hash table and I go to blue, I'm going to have an array
of every product that relates to the word blue. And so if you tell me blue camera, I go to blue,
I get that big old array. I go to camera, I get that big old array. And then I join those guys
together, get the intersection and and return you the data.
We do that via MapReduce.
Maybe we can put those things out horizontally scalable so we've got different things on different nodes.
We have a big MapReduce job that kind of runs that across multiple jobs in parallel.
So we're doing the blue search and the camera search at the same time, and then we throw them all together with that map function.
And then we just described a really good solution for our problem but that's basically just a really basic search engine that you know we had kind of like with lucene and some of these other solutions
like back in like 1999 or 2003 or something um and things have gotten a lot better since then
is lucene really that old?
Yeah, I don't know.
Okay, I wasn't sure if you were actually stating fact or just making a joke.
I'm going to assume it was a joke.
Maybe move on.
I was using Solr a long time ago.
1999.
Oh, really?
Initial release, 1992.
I didn't realize it was that long ago.
Yeah, man.
Been a while.
And everything's based off of it just about,
it seems like, right?
Was the initial implementation based on ActiveX?
No, I'm just kidding.
You know, I heard a funny story about it.
I think it was Lucina.
It might have been Elasticsearch.
Where the story was that a guy wanted to write an app for his wife to help organize her recipes.
And so he started kind of building it.
Started working on it. K kept thinking of better things.
We wanted to scale it up to a billion users.
And three years later,
it had Elasticsearch,
but never did finish that recipe app.
That's awesome.
Yeah, I thought you guys would appreciate that story.
Were you pointing any fingers
at the people that you thought would really appreciate the story?
I did say a billion, so there was appreciate the story. I did save a billion.
So there was a hint there.
Oh, you know, I mean, it happens.
Yep.
Got to do big.
Oh, yeah. And I forgot to mention, if we throw some sort of declarative language on top of things to let that drive our map reduced and we've described a search engine.
And so I want to talk a little bit about that notion there of taking that hash table and kind of storing those keys like that. And I'm going to call those tokens. So like when I said
blue or camera, you can imagine prices in there or the star rating or whatever. And actually,
I'm going to defer to Outlaw here because there's a lot to know about indexes and he knows a lot
more than I do. Well, I wouldn't say if I know a lot more, but there was this conversation that as we were
putting this show notes together over the course of between episodes, Joe had said something about,
I think it was a reverse index or an inverted index. And it made me want to go back and just
put some information there because I thought, well, we're talking about indexes this whole time, but we never described any of these, right? And so I thought, well,
let's spend a little bit, a minute just to talk about indexes from, you know, real quick. So
if you hear the term a forward index or an inverted index, I think the best way to describe
either of those is to consider both of them at the same time using a book analogy.
So the table of contents at the beginning of the book would be the forward index.
It's telling you where to go into that book for a particular document or what we would call a chapter.
Versus the index at the back of the book would be an inverted index.
It's telling you where to go for all the uses of the relevant words that you want to see.
Right. So you see the difference there is that, you know, the inverted index at the back is very
specific about, you know, the use of something versus the forward indexes jumping into a chapter or a
document. Yeah. And an example I like here is if you think about the book of Joe, if you were to
flip open the table of contents, you might find a chapter there titled Joe's Favorite Metallica
Albums. Obviously, Justice, number one. And that might be the only entry for that. Now you flip at the back and you look up Metallica in the index,
you're going to see a surprising number of pages,
like three,
17,
78,
114 through 123.
You're going to see a lot more references that,
and that's because the table of contents is generally shorter and it's
organized kind of around like the topic,
but the index is supposed to point to every reference to that thing
so that you can go and find everything about that subject.
And then Mike's book would tell you why Joe's choice was wrong.
Clearly it would be master.
No.
All right, and so then, so I mentioned forward and inverted index,
but then there was this conversation
of reverse index, which was really like part of what sparked the whole thing.
You know, curiosity was like in me about like, well, wait a minute, you know, reverse index,
forward index, inverted index.
So if we were to consider that a row in a database is the quote document that we want, right? Then in a typical database
index, the indexes that we would create on that database would be a forward index. So in other
words, if I want a row with ID 123, we can seek immediately to it. Just like in our book analogy, if I want from the book of Joe, favorite Metallica albums,
I know exactly which page to turn to. It's the one where Master of Puppets was listed first.
And then, or to put this another way, the typical database index acts like the table of contents for the document or
i.e. row that we want, right? When you do this, though, depending on how your indexes are created,
you can get, you know, for very busy databases and or very busy indexes, your performance can suffer from what's called index block contention.
So especially for those tables where you have what's, now this is a trippy word,
but a monotonically increasing sequence. So if you have a table with like, say,
a numerical based primary key, like an integer based key, and each time you put a new row in, it increments the value by one, right?
So it's always increasing.
Those types of structures can suffer from the index block contention.
So let's consider...
They change from right to left, which is...
Is that the deal?
Well, hold on.
We haven't gotten there.
So if we consider that our table has this integer key that sequentially increases, so we have keys like 1, 2, 3, 1, 2, 4, 1, 2, 5, etc.
So then let's say that our index is created such that it's going to put 100 keys in each block, right?
So our first block will be 0 to 99. Our second
block would be 100 to 199. Our third block would be 200 to 299, right? If we have 100 write requests
come in at the same time, then what's going to happen is those write requests are going to queue
up and it's quite possible they're going to go to the same block. You know, best case scenario, they're going to go to two of those index blocks, but worst case, they're going to
go to just one, right? So the reverse index is trying to solve that problem. And so what it does
is for key one, two, three, it would reverse it. So it would go into block 321, or our 300 block, and key with ID 124 would go into
another block, which would be our 400 block as 421, right? So we're just taking the key and
flipping it, we're reversing it. This way, our reads and writes for that index are distributed across the various index blocks.
So in the example that I gave with 100 concurrent write requests, right, then in that scenario, they'd be spread across 10 different indexes in that scenario.
Yep.
Okay. Yep. Okay, so the idea there is it's going to kind of spread things out,
so it's going to be able to do things in a quicker manner because it's kind of playing a number game there.
So yeah, so this is basically like trying to distribute your load
within the same server there.
I think my takeaway from that section is that I should not try to build
my own kind of search engine because I'd be doing the 1999 version.
And apparently the world has moved on
and come up with some pretty nifty ideas
for doing things in a smart way
that's going to also be really fast.
And I know this kind of came up
because I was getting really confused
about reverse indexes and inverted indexes.
And if you do some Googling,
you'll see people talking about search engines
in reference to both.
And they'll get a little loosey-goosey with referring to things.
And I think that a lot of people, myself included, like initially mixed up reverse and inverted index because, you know, reverse and inverted are pretty similar words, right?
So we're going to be careful to talk about inverted indexes when we're talking about search engines.
Yeah. We're talking about search engines. Yeah, I mean, as best as I could find, when we talk about search engines,
we're talking about inverted indexes, not reverse indexes.
Everything that I could find talked about...
When we talk about reverse indexes,
we're talking about how we would index
typically a relational database like an Oracle
or a SQL Server or something like that.
So I would be curious if there is something out there about like, you know, search engines using
a reverse index. Maybe it's a thing. I don't know.
I know they do really smart things in order to kind of optimize just that sort of thing. And
even how they kind of split the data because you don't want to like keep pounding the same
node if you've got 100 nodes, but all your reads are going to, you know to nodes one through three because of some quirk of how you have the data organized.
And that's not going to be efficient.
You've got 97% unutilized.
So they're really smart under the hood.
I wanted to give you a quick example here of an inverted index.
And I'm just going to do an English sentence here.
You can imagine this could apply for anything, though, like numbers, products, categories, whatever.
I'm going to take the sentence, the quick brown fox jumps over the lazy dog. And it's got every letter in the alphabet.
So you've probably heard that sentence before. But what we're going to do is we're going to go
ahead and take that entire sentence, quick brown fox jumps over the lazy dog. We're going to throw
that over in a NoSQL database or some sort of maybe a flat file, who cares. We're going to
store a reference to it. And if we're going to throw away the words that don't really matter,
because we're smart.
So we're going to get rid of the, because people searching the,
you know, they're not getting any value out of that really for English anyway.
So we're going to toss those two words there.
And then we're going to break that sentence down into tokens.
So we'll end up with like quick brown fox jumps over lazy dog.
And we'll throw those guys into like a hash table like structure.
You know, it's probably going to be a lot more complex and really smart underneath the covers.
But you can think of it like a kind of like a normal hash table or a map in JavaScript or something like that.
And then each token contains a record or an array of a pointer back to that record so that we can reconstruct or sorry, we do not reconstruct the sentence.
We look up that whole sentence because we never broke it apart.
And then if the user goes along and then searches for lazy fox, then we should find a sentence because we're going to have two tokens that contain a pointer back to our sentence.
So in that case, we'd probably have in just a real kind of simplified version,
we're going to give it a score of two,
because we see that there's two tokens that point back to the sentence.
So now if you've got sentences in there, like, you know,
lazy dog or lazy river or other things, like we might find some of those,
but they're going to only score one.
They're going to be sorted lower than our sentence sentence because that had two references to the same thing.
And then you can imagine things get really complicated, and there's all sorts of really cool plug-ins
and ways for being smarter about how to do stuff.
And even the words themselves get really crazy with synonyms.
So maybe, I don't know, brown and beige would be synonyms in our database because we want people to search for beige to find brown.
Or gray with an E and gray with an A.
Yeah, or fox and foxes, right?
A lot of times they'll actually trim those kind of pluralizing kind of English rules down in order to make those search engines do better. And some of the examples, you know, I mentioned working on a little app here,
we can listen to podcasts by topic. And so I put in some synonyms for things like PWA and
progressive web apps, because people commonly refer to those either way. And so if I want someone
who types progressive web apps to find podcasts that only talked about or referenced PWA.
And another big one is.NET.
How do you spell.NET?
D-O-T-N-E-T, right?
Yeah.
I can't predict what the user's going to do.
Right.
You're doing it wrong.
Right.
So what I did is I want people to be able to find it either way.
So I put a synonym there that basically said.NET,.NET, however you spell it, either way, same thing. JS and JavaScript is another example. Those are all examples of
synonyms. But you can also have antonyms where you say like, hey, you know, Fox, F-O-X and F-A-U-X,
or maybe Fox and Foxes would be a better example. You say, listen, those are not the same. And you
can imagine too, if you've got like a real fuzzy search going on where things, you know, kind of like the Google will do like the DigiMean, then, you know, there are definitely
technical examples where things can mean very different things like tech, like the QT framework
is like a web framework for, or not web framework, a UI framework for doing like Linux-y UI apps.
And then, you know, TCL is another one or QIT is the app that I'm working on. So these things all have similar names, but we don't want them getting confused.
So we might have like an antonym set up that says like these are not similar, even though they look similar.
Java and JavaScript.
Yep, Java and JavaScript.
And you can also configure a lot of engines for really smart for things like special polarization cases or language-specific things.
We've been focusing on English, but that's not your only option.
These engines have all sorts of plug-ins and abilities to support other languages.
You can do your own custom stuff even around scoring.
Say, for example, if I wanted to score coding blocks higher,
any time it shows up in the search, I could probably do some tweaking stuff
in the search engine I'm using
in order to kind of cheat, although it's open source.
You put that in there, right?
Very well.
There's an issue.
But, you know, along this thing to when you said plugins,
like you said it goes beyond just, you know, words.
There are.
There's tons of plugins.
You can have it search zip files.
You could have it index PDFs.
You could have it index, you know, Word docs, all kinds of plugins. You can have it search zip files. You could have it index PDFs. You could have it index Word docs, all kinds of stuff. There's all kinds of crazy stuff that these do for you behind the scenes. thumbprint based on a couple seconds of the song and it goes and compares that thumbprint to a search engine of other
basically uses multiple
thumbprints for each song and it
tries to match it up.
I wouldn't be surprised if it worked like that. I don't know if that's the case.
You should let us know if it does.
If you work on the product.
Give away all your company
secrets and just leave
us a comment.
We talked about inverted indexes, but there are some downsides too.
Like it's really slow to write because you can imagine when you take that – oh, yep.
So yes, yes and no.
It is slower to write, but they've made some major headway in the past couple years over this.
So if you look at solar elastic search, they have near real-time indexing on documents now.
So literally, it could be, you know, depending on how big the document is, right?
Like if you're trying to shove a one-gig document in there,
that's probably going to take a little bit longer, right?
But they have come up to speed on these real-time indexing.
Well, I know for going along the lines of the slow to write, though,
as it relates to Elastic, there was a presentation that I was watching,
and if I recall that, they were saying that the segments
that they're written in are immutable.
So if you had to change any of that,
it would require changing the whole segment.
So that's why I'm, definitely going way back, you know,
in my uses of search engines,
you would have to recreate the index if you wanted to update it at all.
Now, I have seen, yeah, going like decades back.
Yeah, I was going to say, yeah, they've changed a lot.
It did used to be the way.
But, I mean, I have seen cases where you can, like, delta, like, hey, here, change
this one thing. But I don't know, behind the scenes, though, how that's working
there. Right. It might still be like a facade to be like, yeah, okay, fine, we're
going to change out the whole thing. But I don't know. Yeah, they might just be temporarily
switching it, right? Like, almost doing, like, a rename on something.
I don't know. I don't know how the implementation works.
Yeah, I'm taking some notes here from the presentation. I kind of want
to talk a little bit about indexes, shards,
and segments, and a couple of these
other things that I'm going to have to look
up. But one thing
that I think is a really bad use of a search
engine or something you
wouldn't really want to do here that SQL is much better at
is data that changes frequently. It's like a quantity for a product. If you've got a million widgets or
t-shirts or something, and you're constantly as people buy that multiple times a minute or per
second or per hour or whatever, and you're constantly changing that quantity, and that's
something that you want in your search engine, you're going to be opening yourself up to a world
of pain because they try to be immutable and they try to be really smart and index that sort of stuff.
So if they've got to kind of rejigger these indexes every time you change a number,
like that's going to stink. And that's something where you wouldn't necessarily want your search
engine to be your only data source. Like it's usually not. In fact, it's usually a lot of times
companies will store their products and whatever normalized in their database and they'll kind of export to the search engine.
And they'll do things like because sometimes you just need to recreate an index.
Like if you're making like column types changes or schema changes, Elastic's pretty good about this.
It's really good about it.
But some of the other ones aren't so good where you actually do need to recreate indexes. And what that usually means is like kind of creating a whole new index,
doing your thing, and then kind of swapping the pointer so that you're looking at a different one,
then you can delete the old one. Hey, go ahead. Well, I was going to say, you've seen this,
you know, going back to the quantity example, you know, if you go back to our Amazon case,
and you search for a product, you find that product, if they're low, right, If the inventory levels are low, you actually see they'll say like only three left in stock,
right? Because they're trying to create that sense of urgency.
And also, I want to go back to what Joe said, because the frustrating part is if you start
talking to people and you're working on a project and you want to introduce something like a search
engine because it meets your needs,
a lot of times people will be like, well, if we have that, why do we need our database, right?
They're not the same thing, right?
Like that is the hardest part to communicate.
When you start talking about this and you start talking about the benefits of, well, it's going to be faster,
you're going to get the aggregates, you're going to get, you know, blah, blah, blah, blah, blah.
They're going to be like, well, why do we even have a database? And what Joe just said is critically important. Search engines are great at retrieving
data that's been indexed, right? It's made for searchability. It's not a transactional system.
You're not going to use a search engine to say, okay, well, I sold X number of widgets,
reduce our widget count by 10. It's not how that works. It's not what it's for. You're not
guaranteed rights. Like the acid stuff that's built into a database is there for a reason,
right? So this is more an augmenting technology for other things that you're using.
So I wanted to point that out because I've actually had many discussions
with people about that.
Well, if it does all this, why do I even need it?
I was like, no, man, that's not it.
That's not the way to look at it.
Dan, you can think of those indexes
as being kind of ephemeral too.
Like those things don't always stick around
because you do end up having to re-index.
Like right now, I only have one index set up
for the QIT app.
So I ended up having to recreate the index.
So it was actually down for like, I don't know, 15 seconds the other day while I deleted
and then brought up a new index just because I was being cheap.
Or maybe if we were to think about this in different terms, right?
I'm kind of thinking about back to like maybe some Uncle Bob speak, right?
You know, the search engine is the search interface and the SQL server or, you know,
whatever your SQL based databases and not necessarily picking on that one,
but that's the storage interface, right? Yeah. And just like we mentioned, these things scale
really well. You know, you imagine that big hash table analogy, you know, just think about English
again. Like you can imagine if you've got 26 nodes, you could have the A's over here, the B's over there, the C's, D's, you know, that's not a good
example because there's less like Z words than there are A words, for example. So you can, you
know, see how you can be much smarter about it, but you can also imagine how you add a new node
and they all kind of reach your, those lines. And so that things are kind of split up evenly and
you can have an optimizer there. And I don't know what, how it works underneath the hood,
but I assume that the surgeons are really smart about kind of knowing how to divide things up onto nodes
so that they're being accessed equally.
And it's all optimized around search.
So it's not that inserts are that terrible.
They're not great.
And, you know, updates too, deletes not so much.
But it's all optimized for getting that result quick. And you can imagine
too, because we're storing that document whole, every document you store is at least as big as
the document. First of all, in full text, you know, not a lot of keys or no relations, like
no optimization or compression there. But also, you know, every kind of important word or term there is also stored.
So your indexes are probably close to at least as big as the data you're storing.
So you can imagine you load a 50-gigabyte SQL Server database into Elasticsearch,
and now you've got 100 gigabytes.
But that's not really true.
I'm oversimplifying because there's all sorts of compression algorithms
and 20 years of evolution on that concept.
So I've heard in particular that Elastis Search is really good at compressing that in order to have that not be the case.
But compression and caching and all the other stuff, they've all got their own kind of, you know, overheads and concerns and stuff.
So, you know, it's complicated.
But it is true, though.
I mean, what you said is the key, right? You're denormalizing data and purposely repeating search needs to be in the same document. Like it's not like a relational database where you have, you know, your products table,
your product attributes, your pricing table, all that. That's not how it works. You cram it all
into, you basically map reduce it into a single object. And then that way, anything that you need
out of that document at the time that you do the search is right there in that document. It doesn't
have to go look up something else, right? So that is really key. And so it can be
bigger for that very reason, because you can a lot of times repeat a ton of data because you're
not pointing to the color blue somewhere. Literally, that camera is going to have the color
blue on that one. And then the next camera, if it's blue, it's also got the color blue in that
document,
right? Yeah. Now you can think of like a SQL server. If you want to know the number of cameras
that you've got in stock, you would say something like select star from products where product type
equals ID one, two, three, or whatever points to that camera type, you do a count star on it
with something like an index, a inverted index search engine, you would say, hey,
type camera, and it would go and look at
its index and say, look at the hashtag
and say, hey, where's my key for type camera?
Got it. What's the length of my
array? Because it's all got these kind of pointers back to
the array. And that's it. And that's an O of 1
lookup that's wicked fast.
So that's really great. But that also
means, just keep in mind behind the scenes,
you could totally do this in SQL Server or any other database, right?
If you broke apart all the data in all your tables and you tokenized it
and you saved that information,
that's what's happening in these search engines, right?
When you save that thing, it's doing what he said earlier
with the lazy brown fox jumps over whatever.
Jumps over whatever.
Yeah, it breaks that apart, right? And it puts it all in places. a lazy brown fox jumps over whatever. Jumps over whatever. Yeah.
It breaks that apart, right?
And it puts it all in places.
So there's work happening when you save that document to index those things and break apart the pieces to store those arrays and all that.
So it's not a free operation,
but your reads on the other end of it are going to be crazy fast.
Yeah.
And depending on the search engine that you're using,
like Elastic's going to keep coming up
because I think they're kind of the kings right now.
They support crazy complex filters.
So you can have multiple indexes.
You can join those together.
I can't even imagine a kind of crazy situation.
But if you look at it,
there's something like 60 different types of filters.
I think there's a ton of them,
and they're all very specific, and they have their own kind of things and you can mix and match them in multiple different ways. And so it could be really
intimidating when you look at the docs to like try and figure out if you want some sort of complex
situation. A lot of it's like really math heavy. But for most use cases, if you're just doing like
a simple bookstore, you know, you can just start out really basic. And as you grow your company in Amazon, then you can kind of keep going, keep going with that. But let's also,
let's, let's add on to that. They support crazy complex filters with a declarative syntax
that is still a thousand times simpler than the SQL you'd have to write to get
those same results. Right. So I think that's the key here is like, so for instance, let's talk
elastic search. There's a JSON payload. So if you want the aggregates, you, you have an attribute
called ags, and then you tell it the, the field that you want it to aggregate on. Right. And then
you tell it the type of aggregation you want,
whether it's a count or an average or some sort of other thing.
That syntax is so much easier to understand and read
than what that equivalent SQL would end up being in a database.
And it's fast.
Now, there are things you can do that are crazy
that aren't necessarily beneficial to try and do.
And you can blow anything up, right?
So, for instance, one of the things I was reading in a best practices or avoid type scenarios is, let's say, for instance, that you want to take the top 10 basketball players in the NBA right now, right?
And then you also want to get their top 10 supporting basketball players.
Right. So there's probably some sort of algorithm that says, Hey,
when these two people are on the floor, it's the best combination, you know,
the top 10 combinations, that's a hundred permutations right there already.
You're 10 with their 10 supporting players, right?
Which would be really weird because there's only five players on a court at the time.
But so you see what I'm saying.
So you can actually write things.
You could basically blow up the search engine, right?
Like, give me all these combinations of things and you'll just run out of memory, right?
So there are things that you can do that are crazy, but they blow up anything that you
try to do.
Yeah.
And search engines do love memory.
Yes. They want lots engines do love memory. Yes.
They want lots and lots of memory.
And so this is not necessarily a solution for every case.
You need to consider all sorts of factors
if you're thinking about going this direction
with an existing SQL search that you're doing.
So it's definitely not something that you can just kind of drop in willy-nilly
and ship to production.
You're definitely going to want to do some testing
and really think about things and give it a shot.
It's not hard to do, though.
I built a little board game prototype that had 70,000 games,
and I don't think I saw any queries that were under
or that took over 0.2 seconds or something to run.
And it had all sorts of cool aggregations,
and so it had the mechanics, so you could click on dice rolling
or card deck building or whatever and kind of click on those.
It had different ratings from BoardGameGeek and stuff.
And it just gave me all those aggregates.
So it kind of gave me that shopping type of experience that you have on Amazon.
And the way I specified that, it was stupid simple.
It was basically like rating.
I had an aggregations array.
And the first element there, single quote rating, single quote.
Second one was like mechanics.
Third one was, you know, and then I eventually did get into the like kind of custom bucketing for like times.
Because I said, you know what?
I don't want them to bucket this for me.
I don't want them to try and like, you try and come up with every game under the sun.
If this game says 27 minutes, this one says 30.
I wanted to find some buckets here.
So I think I did 0 to 30 minutes, 30 minutes to 90 minutes, and 90 minutes to 2 hours, and 2 hours up.
And I just kind of defined that.
But the way I said that is the structure that I passed in a simple Jason message. And so like the total message I sent over to the server to get all the
aggregates to come up with like maybe 10 different kinds of filters was maybe
like 20 lines.
And you know what Jason's like,
like it's all fricking brackets.
Yeah.
Right.
There's no meat to it.
You know,
there's like five or seven actual like English lines in there.
Yep.
You know, uh, one thing though that i feel like we would
be remiss if we didn't at least address though is that um let let's live in a world where uh
we're we're updating our index in batch right like you're gonna in the environments that I've beenized data to send to the search engine
in some, whatever the format might be. Right. So something had to go ahead and denormalize that
data to get it into that format. And, you know, maintaining that code, you know, could be a thing,
you know, that, that, the, the process, depending on how much data you're trying to dump in, if you are doing it in batch like that, could be an ordeal.
So, you know, we're talking about all the strengths of the search engine, but you're getting those strengths because you're taking a lot of hits up front,
right. Through that denormalization process and then giving it, handing it over to the search
engine and letting it do its indexing. Um, you know, and even, even once you have sent all the
data, you know, the actual process of the index getting built and ready for use can take some time too.
Yeah, so we're talking about the data pipeline there.
For instance, let's give the real example like Joe with the QIT app right now.
You basically have, is it a command line or is it just a call that goes and pulls stuff?
It's a process that runs, right?
Yeah, it's like a little node process.
Runs in Azure Functions.
Okay, so that's similar to what Mike's saying, right?
Like you'll have some sort of, let's say that you are doing a batch,
you'll have some sort of process that kicks off maybe hourly,
maybe every two hours it's going to run, get that data and push it over, right?
And so there is an overhead cost to that. And there's also a maintainability cost to that as well, because as
new attributes get added or whatever you have to need, you need to make sure they go in there.
Um, so that's truly real. That's something you need to be aware of. And if you need something
more real time, then there are other solutions. Like one of the things that I've worked on in my,
in my personal time playing around with is like Kafka, right? So Apache Kafka allows you to push stuff. It's just a huge queue.
It's like a, um, a persistent queue for, for lack of a better term, but they also have the ability
to do streams, which means that as data enters a topic, you can have that process kick off automatically.
So if data comes in, like let's say, for instance, if you had data change, what's it called?
Data change in SQL Server.
I can't think of what it is right now.
But you can actually have it to where when data changes, that automatically pushes into
Kafka, let's say.
Kafka says, hey, let me watch for changes, push that stuff in. When that data comes in,
you have it automatically process that data. So denormalize it, right? Your map reduce function
right there, denormalize that data, and then automatically push that into something like
elastic or some other indexing engine. So you can get close to real time by just having a data pipeline set up.
But again, that is additional overhead.
None of this comes for free, right?
You still got to get it from point A to point B.
And there is overhead in that.
And then there's also overhead
once it gets point B to index that stuff.
Yeah, I just wanted to make the point
to make it fair that like, you know,
hey, this is not free.
Yeah, exactly.
This isn't free.
There's some work. There's some't free. There's some work.
There's some effort there.
There's some code that's going to have to be maintained to get the data
because you're probably, you know, more often than not,
you're probably going to have some kind of a relational database,
you know, behind the scenes with an index in front of it
that you're using as your search engine.
Now, I'm talking about like, you know, um, you know,
big enterprise-y kind of apps, right. Where you might do that. Um,
obviously there are apps where you would,
you might not need that hint hint QIT. Um,
well, yeah. And that's actually, um, like I've gotten that asked a few times,
like why did, why didn't you just do it? Like,
I think right now I'm indexing 3,500 podcasts.
It's like that easily could have been a SQL server or something like that.
It's really not a big deal.
I could have filmed some likes around there.
There's not a ton of like categories and stuff.
I'm not building the next Amazon.
Why did I use a search engine?
And for me, the killer feature was like fuzzy searching, which it sounds like RDBMSs have anyway.
But we kind of talked about that inverted index
and so you can imagine if I misspelled
Docker, then the search engine
is going to take
D-O-K-C-E-R.
It's going to see, hey, there's no index
for this name
in this hash table. But it's going to
run some sort of algorithm there and say, well, let me look for
ones that are similar. And it's going to probably score, you know, Docker, it's going to
see the same letters or whatever, however it figures out. So, you know, this is probably what
they meant. Let me just go ahead and return those. And it's got that kind of stuff built in. I don't
want to have to write that kind of query. And it sounds like there's some other options in SQL
server I didn't know about, but it just kind of made sense to me to do that. And I like the fact
that the code, and I put this in quotes, the code that I write from the front end is a URL with like a
question mark S equals on the end of it. In my case, I'm using it as a search, which doesn't
have like the big JSON kind of stuff. And it's not nearly as flexible, but it's just stupid simple.
And I can kind of configure that stuff when I create the index. And so the app is really dumb.
It just says like, hey, misspell Docker.
And it just kind of gets stuff back and you don't have to think about it.
I'm not writing queries.
I don't even have a backend.
It's just a website with JavaScript, totally static and a search engine.
Yeah, that's kind of crazy.
Hey, so two things that came to mind here.
One, we talked about things being horizontally scalable.
And when I first heard about search engines that could do that,
I was thinking, oh, man, like the whole point of like AWS
was you can run on commodity hardware, right?
Like that's how they kind of basically buy really cheap hardware
and they don't care if it fails because they've got all these abilities
to fail over and just move on to a new node or whatever. You can't think of search engines that way because Joe said earlier,
like they, they like a lot of memory, right? There is a cost to where if you have a, let's say it's
a huge search engine search index or many indexes. If you have 10 nodes set up, it's got to replicate that data
out across the nodes, or at least figure out how to distribute the data evenly across the nodes,
right? So as you get more of that stuff in, you've got that network latency that comes in.
So they actually, in their suggestions, they say fewer nodes with more RAM is typically better.
So that's one good thing to just kind of keep
in your mind. If you go to set this thing up, right? Like, don't think, Hey, I could scale to
a thousand nodes and you'll be all right. You know, a thousand nodes with a gig of Ram on each,
that's the same as, you know, 10 nodes with, you know, uh, however much I can't do my math right
now, but it's not the same. Right. Um, and then the other thing is this reminded me,
this has actually been years ago. This was before we did the podcast. Uh, I think at the time I was
playing with WordPress and some guy had written me on, on one of the sites that I was playing with.
And he's like, Hey man, I created this WordPress plugin, uh, search engine. Because if you go into WordPress and
you search for a plugin, it doesn't really give you the ability to sort by popularity.
Think like iTunes store, right? You go into iTunes and you want to look at games. You can't say,
hey, give me the ones that are only rated four stars or anything. It won't let you do that.
And he's like, man, I wanted a better way to be able to search this stuff. Well, he did it all
in MySQL. And so he ran into the same problem that we've talked about where you don't get a total count back,
which he needed you. When you start trying to sort by different columns, then indexes start
coming into play because you have to index each column independently. And he couldn't get the
performance up. Like he got it to where, yeah, if he was the only guy running the site, it would
come back pretty decently fast. But as soon as he got any user load, and this is where I think a key
differentiator comes into play. You can't think about, Hey, my database returns pretty quick when
you're the only person running the query. As soon as you start getting a hundred people on there,
uh, you start getting table contention, right? You start getting that index contention that you were talking about earlier.
As users scale up, databases don't scale up properly that way, right?
A search engine does.
It's like Joe talked about a minute ago.
You know, you go look up that key, it's got the count right there on the array,
and the aggregations are pretty simple.
So you also have to think about it in that context
that it's not just the one-off searches.
It's when you start getting some load on the system
that you'll start seeing these things shine
versus where they could bring down your core database.
And speaking of search engines,
there's a couple out there
where we definitely talked about Elasticsearch a lot
because they're kind of the big ducks in the space right now.
But it's actually built
on Lucene. So each shard or each kind of
node that it runs, even that's
closely simplifying it, but it's built on Lucene,
which is a little Java library.
Solr is also built on Lucene.
They basically wrap that and add
all sorts of complexity on top of it.
There's AWS CloudSearch,
AWS Elasticsearch,
which we just found out about today, which is a hosted Elasticsearch by Amazon and Azure search.
Oh, you know, and one thing we didn't mention too is like, um, with our, you know, a hundred nodes or however many nodes, if you want high availability, like those nodes still need to exist.
So you typically have replicas too.
So if you're talking about 10, you know, 10 nodes and you've got two or three replicas,
you're looking at 30 service.
Yeah, you've got multiple clusters set up.
An interesting note on the Amazon Elasticsearch that we did just learn about today, they call it AWS ES.
Yeah.
An interesting thing is Elasticsearch has their own cloud platform as well.
And the big difference between that and AWS's version is you get XPAC on the Elasticsearch cloud.
So I thought that was pretty interesting.
Yeah, we didn't even talk about that.
But Lucene is totally open source.
Solar is totally open source.
It's Apache.
Elasticsearch is totally source available.
So it sounds like it's mostly open source like the actual
elastic search is completely 100 open source permissive license but um they do have some
kind of uh upsell like kind of um things that do require a license namely their xpacs which
offer some really nice functionality but it's also if you could theoretically build yourself
if you wanted to and i think that's where a lot of the niceties coming in for like plugins for
like um like application performance monitoring and some other stuff where
you can just kind of add a line of code, uh, or a little line of config to your search engine,
rather than, you know, re like writing and doing a whole bunch of custom work.
Yeah. One of the interesting things too, to point out between these, cause
I think Joe said earlier, like we all kind of like elastic. One of the things that I really like about it over the others is even though they seem to have feature parity, at least a solar and elastic do the syntax of the declarative syntax on elastic just seems to be prettier and easier to read and reason about.
I mean, that shouldn't dictate anybody's decision in the long run, I don't think,
because, like I said, it looks like they have feature
parity to a certain degree, but
sometimes it is nice to be
able to just look at something, oh, okay, that makes sense,
right? Yeah, and Alexa
Search, it's got REST API, it's all built
around JSON, so if you want to add a different filter
or you want to add a different search, you're just kind of like
building this JSON object, which
I've gone on to do. The other ones are typically working with query par different search, you're just kind of like building this JSON object, which, you know, I've gone on to do.
The other ones, you're typically working with query params.
So you're doing, you know, question marks and ampersands and URL encoding.
And you can imagine as like things get more complex,
like that starts getting really nasty and really long.
And I'm sure you've seen that on like older shopping sites.
They've kind of hidden now,
but you used to see like question mark
and then like gobbledygook forever
because that was all the stuff
that needed to be passed on to the search engine.
Yep.
So examples of search engine powered apps.
Like we've talked a lot about these, so I'm going to kind of rush through some of them.
But there's a couple of things that we didn't really get to that I wanted to mention.
So free text type search, like things like Google, Wikipedia.
If you've got a blog like WordPress and you've got a search up in the top right corner,
I don't know.
It's probably doing free.
No?
No.
WordPress.
So if you're talking about WordPress.com, it's probably using a search engine.
If you're talking about a WordPress site, it's MySQL.
Okay.
So it's doing some sort of free text, something like that.
I don't even know if it's free text.
I think it's a like because WordPress has a simple database table behind the scenes that stores all your posts.
So I think it's just a simple like. I bet you can find a plug-in that'll that'll do it
probably good yeah and it starts yep we mentioned aggregation filtering like a board game geek or
amazon new egg that sort of thing um where you can kind of do that like the guided navigation
thing where like i was saying we're hey, show me four stars and up.
Now show me this price range.
Now show me red.
Now show me whatever.
And as you go in, you see all those options.
It doesn't force you to go any one path,
but it's all kind of leading you down to one product.
And that's really awesome when you know what kind of product you're kind of looking for.
Not so great for browsing now.
Now the third category we haven't really talked about at all, and I'm not so much familiar with these capabilities, and that's probably why I'm not talking about for browsing now. Now, the third category we haven't really talked about at all,
and I'm not so much familiar with these capabilities, and that's probably what I'm
not talking about too much, but it's the notion of this kind of logger and APM. And that seems
like it's come around more recently in like the last couple of years. And so this is kind of like
a new addition to the family of Elastic as far as I can tell. And they've added a lot of products
around this. Like we've mostly talked about Elastic Stack or Elasticsearch, but it's really a part of the Elastic Stack, which includes
other products like Logstash and Beats and Kibana, which is like an admin where you get
graphs and all sorts of cool stuff. And they've got this whole ecosystem around Elasticsearch
being the center of it. And what they've kind of tacked in now is this notion of a logger or APM or application performance monitoring.
So what that'll do is it will kind of lead you down a good path and smartly create search indexes for you that are specifically designed for, how would you say, like temporal data.
So for example, it might create an index for every day and throw like events from all your servers,
all your employees' computers, heartbeats,
CPU percentages, web activity,
things that might be happening on those computers.
And it's going to throw those all into the database
or into Elasticsearch,
but you'll be doing it through typically some sort of plugins
or some APMs that will know how to kind of divide those things up smartly
into different indexes, which is all about performance
so that you can go there and look at those things very quickly in aggregate.
And that's where companies like Splunk come in,
Airbrake, who's punched the show in the past, Prometheus, Datadog.
These are all companies that have kind of built a lot of really good tools, or Prometheus is open source.
But they're all about aggregating information.
So you can go and say, like, hey, how is my entire computer architecture doing?
And you can look at that graph and see, like, hey, these servers are under a lot of load suspiciously,
or there's some kind of ne of various activity going on over here, or here's the rate at which we're writing data. So someone
like the NSA or whatever, how do they know how much data they're writing? Well, they probably
have some sort of APM system that's telling them exactly and graphing how much data they are saving.
And so they can see like over time, like, oh crap,
we're saving too much or too little. We need to kind of change some things. And those are
kind of applications that are also typically built on search engines and especially in the
last couple of years. And so I want to make sure to call those out. Cool. Hey, there's one other
thing we don't have in the show notes. Actually, I need to put it in here because it's the next
section. So resources we like, man, I think I bring this thing up probably once every couple of months in Slack.
So anytime we're talking about performance and how things are set up like infrastructure stack,
I can't help but point people to the stack exchange performance page because, man, it's amazing.
So they have a fairly simple, that they expose here, infrastructure that is amazing.
They get 1.3 billion page views per month.
That's insane.
They transfer 55 terabytes of data per month. They're doing
all right there. They only have nine web servers. Dude, that's not a lot for the...
That's crazy.
I mean, how much time do we spend on Stack Overflow and those, right? That's not a lot.
They can handle up to three, or they handle typically 300 requests per second, peak up to 450 requests per second.
And their CPU usage is at 5%.
It peaks at 12.
It's nothing.
Here's where things get really interesting for me.
They only have for Stack Overflow, two SQL servers.
One's live and one's a hot standby.
They only have two for all of Stack Overflow.
Now, the specs on them, a little crazy,
1.5 terabytes of RAM per server.
So that's a little bit of memory.
Well, wait, that's per server.
I think these are the specs per server. because the two that you mentioned one is for stack overflow and the other one's for no you're
talking about two you're including that the hot the hot standby as the second one okay i'm sorry
and then they have another one for stack exchange as well okay because that's where i got confused
because their stack exchange anders in the metadata,
which have a different set of servers.
Yeah, I'm only talking about Stack Overflow
because that's the one I spend all my time on.
I'm not really usually messing with the Stack Exchange,
Career stuff.
But the site that we probably all visit a lot,
each SQL server has 1.5 terabytes of RAM, pretty decent,
and their database size is 2.8 terabytes.
That's big, right?
Like that's getting into big data.
Now here's another interesting thing.
Their CPU usage, 4%.
4%.
Their peak is 15%.
They do 528 million queries a day, it looks like.
Their peak is 11,000 queries per second,
right?
Pretty impressive stuff.
But the reason why this stuff all works is this right here.
So they have some Redis server city in the middle that each have 256 gigs of
Ram that are their cash for their entire thing,
right?
It does 3.7,
5 billion operations per day.
Their CPU usage is 1%.
It peaks at 2%.
It's almost nothing.
But then down here, they have three Elasticsearch servers.
Three.
Each one of them has 196 gigs of RAM, and they're load balanced.
That's a lot of RAM.
That's a ton of RAM.
But think about Stack Overflow, right? Like how many, this, I actually heard, I think it was Nick, Nick Craver,
who had done a podcast interview with somebody. And he's like, if you think about what stack
overflow is, it's mostly reads, right? It's mostly people coming there to find answers.
So I forget at the time he had thrown percentages on there in terms
of how many rights there were versus reads. And it was really low. Um, so yeah, man, three of them
with 196 gigs of Ram is probably doing most of their, their serving back to their site, right?
Stack of when was the last time it was down? When was the last time it ran slow?
Like, I've never noticed it. Their Elasticsearch servers, 7% CPU usage, 34 million searches per day.
34 million searches a day.
And their index size is pretty large, 528 gigs.
So it's really cool when you think about using technologies to augment your stack where you need it can truly enable something to run.
And then down here at the very bottom, it's really cool.
All of the things in the stack, and we'll include this link in the show notes, their homepage takes 12.2 milliseconds to render, or I guess to pull back the data, and their questions page takes 18.3 milliseconds.
That is phenomenal.
Straight up amazing.
So, you know, again, it's not magic, right?
Like what we've talked about with search engines isn't magic.
But if you utilize the tools properly and you set up the pipelines properly, you can really enhance your application, the performance, the usability, and we gave several examples of a Google Wikipedia or a Shopping or APM type thing, then you could really leverage a search engine and then really focus on the front end and let it do its thing.
You know, I tried to think of an example.
I tried to take this out of context and try to think of a way to describe how how you would like quote augment that functionality right like um by bringing in other tools right and so this one example came to mind
about like well okay so you have your stove right and you have your oven and that's that's the
traditional way you know how you would cook right over a that's the traditional way, you know, how you would cook, right?
Over a fire, you know, going way back when to eventually we got gas or electric stoves and ovens.
And then at some point we got microwaves, right?
And the microwaves were like the faster way to do something, but you can't just give up the oven and stove.
You still need those. The microwave can augment your cooking experience, but it doesn't replace your cooking experience.
Right?
And as much as you might love that cooking on that stove or that oven, it's slower.
Right?
It's going to take more time.
So that's why you augment some of that experience with
the microwave
so that you can have your meal when you want it.
I like the analogy.
That's the title, man. Search engines. They're basically microwaves.
I love it.
That's awesome.
Is that
ASP.NET 5? It's basically Java? Oh, man. That's awesome. Is that ASP.NET 5? It's basically Java?
Oh, man, that's amazing.
So, yeah, we have some other resources in here we like.
So we've obviously got the GitHub thing that Joe has started up that is pretty awesome.
There's going to be a ton of links in this one.
I'll go ahead and warn you.
A lot of the links aren't even – when I was putting in my stuff in the notes, I didn't even put them in the one. I'll go ahead and warn you. Like a lot of the links aren't even, uh, when I was putting in my stuff for the, you know, in the notes, I didn't even put them in the
resources we'd like. I sprinkled them in next to the relevant areas as we were talking about. So
there's going to be a ton of links in this episode. I'm warning you now, you're probably
going to want to go check it out. Yep. Awesome. And I won't go through them all. They're,
they're all highly related to this stuff. So really good stuff in there.
Yeah. And come bug me in Slack if you want to talk about it. I built four
search-based prototypes in the last
two months here, so I'm really
super hot on this topic.
Also super bored, and that's why he's...
I totally forgot.
So we recorded something before this.
We don't know when it's going to be released.
And tease it.
Yeah, we've got to tease it a little bit.
Even though you may not be able to watch it whenever you hear this.
So if you haven't and you're halfway interested in this, this goes back to people that are, you know, either new to the field or switching or maybe you just want to see a different perspective on how things are done.
So we're kind of going to start recording a series, I guess, sort of behind the scenes of an app that we're going to try and build to help us with the podcast.
And so we're going to, I guess, publicly air our thoughts and our work process or, you know, what we want to call it, just our flow.
Yep.
We're going to do our planning.
We're going to talk about the MVP.
We're going to talk about the decisions we make and like how we build it and like kind of how we try to organize around it. And we really just kind of, um, hit record and saw what came out and we tried to't do that.
But yeah. So if you're not already,
if you haven't subscribed to the YouTube channel and that sounds halfway
interesting to you,
go up there.
We're going to have a playlist that we put these things in.
So you'll be notified when these things drop.
So,
all right.
I've been doing the same thing for QIT too.
So we actually have been putting out a lot,
a lot of content.
I would say like,
we're probably averaging like four to five videos a month right now. Yep. Not, not terrible as Charles Barkley
would say. Terrible. Not terrible. All right. So with that, let's head to Alan's favorite portion
of the show. This is the tip of the week. Yeah, baby. I got two of them because I can never
remember any of them and I'm too lazy to put them in documents when I think about them.
So the first one is a C sharp seven feature, which may not sound like a huge deal, but I really like it.
So back in the day, if you are back in the back prior to C sharp seven, if you wanted to check to see if something was of a type of something else, you could basically do,
you could cast something as. So you didn't actually have to do an implicit cache. You could
say var my var equal, you know, or wait, var my var as, you know, some class, right? And if it
could cast it as a class, then you would get that thing back
and it would be good. You could use it, right? If it didn't come back because it couldn't do the
cast or it didn't work for whatever reason, then it would be null. So, so basically what you get
is you do that as, and then you'd have a null check afterwards, see if you could use it, right?
Well, C sharp seven has a really cool feature that is called an is pattern match.
So one thing that you might've done in the past, you might've said, Hey, is my, my VAR.
If my VAR is some class, then do something right.
So what you can do now is you can have it auto assign the variable. So you could say, if my var is some class, my new var,
then do something. And then you can reference my new var inside that block of code.
And that way it automatically did the null check for you. And it will only go into that if block,
if that thing didn't come back null. So can I say this in a different way? Sure.
Because I'm getting kind of lost here,
so I want to make sure I'm following what you're saying.
Yep.
So let's go back to the old ways, the old days.
Yep.
You have some variable, i,
and you want to take some input
and you want to cast it as an int.
And so you would say i equals,
and then in parentheses, int,
and then whatever the passed in variable was so let's
just say that it was z right so that's how you that's that's casting z to an int and storing
that as you know i know that that value to i and that would work except except the problem is it would throw an exception if Z wasn't able to be cast as an int.
And that was the advantage of the as keyword.
Right.
Because as could do a safe cast.
So you could say, you know, Z, I equals Z as int,
and then it would attempt to cast Z to an int,
and if it couldn't, then it would just assign the value null.
Correct.
Right?
So then your next line, you'd be saying like,
hey, if I not equal null, do something or whatever.
Right.
Then there's this is keyword that was introduced
where you could say, if Z is int,
then inside of your if statement,
you might say I equals and then put it in parentheses int z, right?
Do the casting back the old way, right?
Right.
Now what you can do is you could just say if z is int i,
then in the rest of your if block,
you now can use i as the int that Z was cast into already there.
Yep.
Right?
So basically you didn't reduce a ton of code,
but you got rid of a bunch of null checking garbage.
You increased the readability.
And you increased the readability, in my opinion.
And I love that.
Anytime that you increase the readability and it's not ambiguous, I love it.
So, yes, that's exactly what it does. And then I've got a link in
there directly to this, how you can use it because it's called the is keyword pattern matching.
So pretty interesting. The next thing is I had to do, I had to do a get cherry pick.
Okay.
Let me,
let me back up.
So I had a problem where along my development process,
I had a C sharp project that, that we typically push up as new get packages.
Right.
And one of the things that kind of sucks is once you create a version in your NuGet and you push it up to, in this case, we're doing on-prem NuGet packages using VSTS.
If you do 1.0, you can't reuse that again.
It doesn't matter if you delete 1.0, you can't overwrite it.
You now have to do 1.0.1 if you want to push up a new one, right?
And the problem is I would publish these things
thinking that, hey, I'm in a good spot. This is where it is. And then after further testing,
you know, maybe there was an edge case that came up. I was like, oh man, I need to make another
change. And so now my source code that was referencing the NuGet had 1.0 in it, right?
And then now I've got a 1.0.1. So I had to update my source code to now do 1.0 in it, right? And then now I've got a 1.0.1,
so I had to update my source code to now do 1.0.1
and push that up there.
Well, the problem is 1.0,
I never wanted anybody to use, right?
So I didn't want that NuGet package available.
Like if you went and searched NuGet,
I didn't want you to see it.
And Outlaw pointed out to me, he's like,
oh man, well, that's kind of going to stink because now you've got source code out there that references 1.0,
but it's not available in NuGet. And I was like, oh man, this, this hurts my head, right? Like I,
what do I need to do? So I came up with this thing that would allow me essentially what I want to do is I wanted to squash my commits so that 1.0 never showed up in my source code.
Right.
Only ever would it reference 1.0.1.
So I ended up doing and this might have been a really long way to go about it.
But I checked out to a new branch from from the code.
So master is what I wanted to update.
Right.
So I was in my branch. Let's call it feature A over here. I was making these changes where I screwed up and I had way too many
versions of my NuGet package that I was referencing. And I only wanted the last version to be the one
that was done. So what I did is I committed those, and then I switched back over to master and I
checked out a new branch. And then I cherry picked the commits
from the previous branch I was working on, but I did a dash in, which means don't auto commit it.
So then when I brought in the various different commits from the original branch,
they all just came in as staged files. And then I could just commit it as one commit.
And then I pushed it up.
I looked for ways to go about doing that, and I could not find anything that made sense.
So you're supposed to be able to do a get rebase dash I interactive and potentially be able to squash them.
So you could do like a head, tilde, you know, 1, 2, 3, 4, 5, whatever.
I never could get that
thing to work properly so i ended up doing the get cherry picks with multiple commit hashes
with a dash in and then that way it just brought the changes in and then that was a single commit
that i could push up so maybe that helps somebody maybe i just confused the heck out of everybody. I don't know that you needed to dash in, though.
I did.
That's just the stage, right?
Yeah.
If you bring in multiple commit hashes in the cherry pick, it commits each one individually.
So in this case, I was bringing in three commits that I was trying to cherry pick over.
It created three separate commits, which is what I wanted to avoid because I didn't want that first
and second versions of the NuGet package to be in there.
But your other commits didn't already get into master?
Correct.
Correct.
I'd never push them up.
So basically creating a separate branch, cherry picking the changes over
with a dash in so that I only had one commit at the very end of it.
I got you.
Worked out pretty well, honestly,
after I figured out messing around with it for an hour and a half.
So those are my two very long tips of the week.
Well, mine aren't going to be quite that, I don't think.
Just one real quick one.
So we've talked about how there's like fiddles for everything. And my obsession with Python or desired obsession with Python.
So there's pythonfiddle.com that I found.
But I don't know if you've ever used it, if you're listening and you've ever used it.
But I found it to be problematic, like especially if you wrote something that it didn't like.
But pyfiddle.io was just amazing.
Like that was a great fiddle for Python development out there.
And then we were talking about, you know, last episode about some of the offerings that Google had or has.
And they have a entire class
offering just for Python. Um, so I thought I would include a link in that just to kind of
go along with, you know, where we, uh, some of the stuff that we talked about in the last episode.
Um, then as far as like, you know, other tips, I thought, well, okay, there's our love of visual, uh, studio code, right?
Like that IDE just keeps getting better and better and better. plugin for Visual Studio Code where you can connect to and execute
SQL to a SQL
server.
It's a really cool plugin.
Forget all that search engine stuff we talked about before.
Go back to SQL because now you can just stay
within Visual Studio Code
and query everything you need
to query. They'll come out
with an Elasticsearch plugin next week and then
you can go back to all the things we talked about before.
But for right now, you can use that plugin.
So I'll include a link for that.
It's really nice.
And then I totally forgot to make a note of who brought it up in the Slack channel,
but somebody brought up GetBisect.
It was like, oh, hey, you know, we should talk about GetBisect. It was like, oh, hey, we should talk about Git bisect. So Alan's comments
kind of reminded me about this because Git bisect, what that does is this is a tool so that
let's say you have some known working commit, but now your current commit that you're working on, it's broken.
Something's broken. It doesn't compile or functionally it's broken, whatever.
What you can do with git bisect is you can give it the known good commit and the bad commit,
and it will do basically like a binary search going back through the commits in
between to eventually get to the commit that introduced the problem, right? And so it's a
really cool tool. I'll include the documentation for it, but where it relates back to what Alan was saying, though, is in his example of, well, if he had let that commit get into the code base that introduced the NuGet package that
was no longer available, and you were using Git bisect, my comment to Alan was like, well,
you're going to break Git bisect functionality, because now you won't even be able to compile
the code, and you won't know that the reason has changed. It's not because what you won't even be able to compile the code and you won't know that the reason has changed.
It's not because what you were investigating,
you found what you were investigating,
but instead it's because, oh, this thing being referenced
isn't available anymore.
So there is some things to be careful about.
That's why it's kind of important when you maintain the history of your code, you know, to kind of have some diligence about it. Because if you want to use a tool like GitBisect, which can be a very powerful tool, you know, it's important that you not put garbage in to your history, right?
So garbage in, garbage out. It's also another reason to consider maybe
squashing your commits so that if you're the type of developer who,
and I'm not judging, but if you like to commit things that are not in a working state or a final state, then you will bring problems for yourself if you
wanted to use a Git bisect for the same similar reasons as I mentioned with Alan's scenario,
right?
Where like something, if you are using Git bisect and you won't know that like, oh, it's
broken, not because of what I'm investigating, but because last month when I was working
on this feature, I just wanted to get the code checked in for the night and, you know, go to bed or get on
another computer or whatever. And so I just went ahead and committed it and pushed it.
Right. Then, you know, it's going to make it difficult for you.
Git bisects is important because like, if you have a hundred commits between when you know it
was good and when you know it was bad at worst you're looking at 10
right because of that binary search is going to take a logarithm of that so it's really powerful
if you've got a whole bunch of stuff going on you need to find something fast i never use it though
because a lot of times it's like a slow build so like you still have to go and check it on each
one of those steps so i'd rather just not make mistakes roger that well there is that option but yeah sure yeah uh oh guess it's me um so i wanted to
mention netlify which is something that uh swix told me a long time ago miss you buddy uh fudge
also syntax fm that they talk about all the time and i gotta say it's as awesome as everyone says
and what is this is i'm just going to say free hosting.
And
now I'm going to play a little game with myself called
Yeah, but.
And I'm going to say, yeah, but I want a custom domain.
Like, oh no, that's free.
HTTPS, free.
And I want to be able to, say, deploy
automatically whenever master is updated.
Like, yeah, that's free. Like, no, but I want to do it
quickly after master. Yeah, that's free. It's like free it's like oh okay well so how the heck do these guys make their money
they're gonna be yeah but i want to pay them money yeah well you could do that too oh so
they've got a function uh or they've got uh the ability for functions where they'll basically wrap
azure functions for you and deploy uh over two hazard functions that'll give you you a little upcharge there where they kind of charge you like 20 bucks a month
or whatever.
And they'll kind of abstract those costs in AWS for you.
So it's really amazing if you just want to get this thing going.
And I don't even think I mentioned BuildStep.
Like the project I'm deploying there, it actually does like an NPM build.
It's got versions and stuff.
And it's stupid simple how you set this up, though.
This isn't some big complicated web form where we basically sign up for the site.
It's like, hey, where's your repo?
And like right here, it's like, all right, here's your fake domain name.
Oh, okay.
And if you want to set up a real domain name, you just kind of plug that in there.
But it gets you up and hosted and running like immediately.
And functions aren't the only way.
They also have the ability to manage your
identity.
So like signups,
logins,
password and stuff like they'll handle that for you.
There's also some stuff with forms.
So if you want to like take in information,
so there's a couple of different ways that you can pay the money if you
really want to assume teams,
but it's amazing to me what they offer for free.
And I really like the upsells split testing is another one that they do some charging for. So it was just really cool. It's a to me what they offer for free. And I really like the upsells. Split testing is another one that they do some charging for.
So it's just really cool.
It's a really cool product.
You can get started for free.
And chances are that you're going to find something that you are actually going to want to pay them for to just make your life easier.
That's Netlify.com.
It almost seems like an incubator for projects.
Yeah, I like that. It's like Ehrlich Bachman's incubator for projects. Yeah, I like that. It's like
Ehrlich Bachman's incubator.
I started watching Silicon
Valley again, by the way.
Back in the beginning. It's great.
It really is.
And
that's about it. So we talked about search engines, and we
talked about how they offer highly scalable solutions
that make certain types of problems
very easy to solve solve and why in certain
situations they can be really good
compared to something like a SQL database.
And we talked about how they
solve these problems mainly with inverted
indexes.
And we gave you a couple different examples
of applications and a whole bunch
of tips that were like super
good.
They're super awesome.
Super awesome. Alright, so with
that, subscribe to us on iTunes,
Stitcher, or more using your favorite podcast
app. It's probably Spotify.
Be sure
to leave us a review. You can visit
www.codingblocks.net
slash review.
And while you're up there, check out our show
notes, which are amazing.
Thank you,
Michael examples,
discussions,
and more.
And send your feedback,
your questions and rants to the slack channel,
which you can temporarily access by like emailing us or tweeting us or
something.
We're working around something.
So follow us on Twitter too,
at coding blocks and head over to coding blocks.net,
where you can find all your social links at the top of the page.
Oh,
and by the way,
if you haven't already,
you,
we mentioned some videos.
You should probably head to coding blocks.net slash YouTube so that you can go
to our YouTube channel and subscribe there.
Oh,
good call.
Yeah.
Does that exist?
It does.
Oh,
cool.
I just made it up right now.
I was just seeing if you were paying attention
and you know
oh that would be the first time that has happened
we're going to get a bunch of 404s