Drill to Detail - Drill to Detail Ep.41 'Developing with Google BigQuery and Google Cloud Platform' With Special Guest Felipe Hoffa
Episode Date: October 30, 2017Drill to Detail Ep.41 'Developing with Google BigQuery and Google Cloud Platform' With Special Guest Felipe Hoffa Mark is joined in this episode by Google Cloud Platform Developer Advocate Felipe Hof...fa, talking about getting started as a developer using Google BigQuery along with Google Cloud Dataflow, Google Cloud Dataprep and Google Cloud Platform's machine learning APIs.
Transcript
Discussion (0)
So my guest this week is Felipe Hoffa, someone whose name many of you will recognize from
his posts about Google Power Platform on things like Stack Overflow and on Twitter and Reddit
and so on, and actually his role as a developer advocate at Google.
So, Felipe, it's really good to have you on the show, and thank you for coming on as a
guest.
Great to be here, finally.
Thank you for inviting me.
Thank you.
So, Felipe, do you want to just kind of give us a bit of a background really for yourself
and how you ended up working at Google and I suppose the route you had coming out of
Chile and over to the States and so on?
Exactly. up working at Google and I suppose the route you had coming out of Chile and over the states and so on. Exactly I grew up in Chile I had most of my professional life there until seven years ago when I got my Google interview I moved to San Francisco and I started at Google six years ago as a software engineer.
And then two years later,
someone thought that I would be a great developer advocate.
They invited me.
And yes, I've been doing this since then.
Okay.
And so what is a developer advocate at Google Cloud Platform?
And what do you do there?
What's your kind of role and your focus and so on?
For me, a developer advocate is a software engineer with a license to speak. So basically
my job is to tell other software engineers, data scientists, doers, the cool things they can do with our platform and especially the my main
topics are big data and especially BigQuery so my mission is basically how
do I communicate to people the big query is a product that they can use and we
solve a lot of problems for them.
What would be quite good at the start, really, would be just maybe to just talk about what
just paint a picture, really, of what the landscape is of products that run on Google
Cloud around big data and analytics, just at a high level, and then we'll go into some
of the details later on.
Yep.
So to start off, I can just going back to my time in Chile when I was a software
engineer there.
First, I was pretty impressed when we got the first cloud offerings, just getting virtual
VMs in Linode and other providers.
So I started using them.
Then I was pretty happy when I saw Amazon getting in the game.
So the startup I was working at, I started developing our services there.
And then one day I realized there was this thing called App Engine from Google
where the whole serving infrastructure was managed.
I was able to just write my scripts,
write my code,
and let Google host it
without me having to worry about that.
That was like, you know,
eight years ago, nine years ago now.
So as a software engineer outside of Google,
I evolved from renting my own VMs to learning a new language.
At that time, I was doing mostly Java and Ruby.
But I saw this App Engine offering, and it really resonated on how I wanted to do things.
So at that time, App Engine worked only with Python, so I learned Python,
and it really helped me to get things done.
And a lot of Google Cloud platform offerings come from that world where the question is,
how do I get Google to do most of the work for me?
I can focus on adding my logic, adding my ideas, not needing infrastructure.
I don't need to worry about infrastructure anymore.
Yeah, definitely.
I mean, that was the thing that struck me.
So I came into the world of this through a startup I'm working at, and I was very much used to running servers on premise. And even when we ran stuff in the cloud before, you know, you were managing VMs, and it was maybe a cluster
of VMs that you could spin up using a kind of a tool that would spin them up and bring them down
and so on. But you were still effectively working with VMs, and you're having to deal with things
like scaling, you're having to deal with, you know, just deal with kind of faults and dealing with kind of capacity and so
on. But I guess a kind of a common thread in Google Cloud Platform's products in this area is
typically they're kind of what's called serverless or their infrastructure as a service. I mean,
just, I suppose, spell out what that means in terms of scaling, in terms of kind of where the time goes.
And as a developer, you know, what does that mean in terms of
where your focus is in this kind of world, really?
Yeah, so my focus goes into developing the ideas that I have.
Like, I'm extremely lazy, as good software engineers should be.
So we really want to automate everything.
If someone that is not me can automate everything
that can be automated, I want to leverage that.
So the origins of Google Cloud Platforms
come from the beginning.
We wanted to do a completely managed solution.
And as we did that with App Engine, when I was still not working here,
but then I saw, I was still outside when I saw the first announcement for BigQuery,
which, well, as you know, I'm pretty linked to now.
And with BigQuery, we have a similar story.
You have data to analyze, you can load And with BigQuery, we have a similar story. You have
data to analyze. You
can load it to BigQuery.
You can run your queries.
And that's it.
There is nothing to set up.
There are no servers to turn on.
There's not much to tune.
It's just about, yeah, you just about yeah you have data you have queries
this will solve that problem for you okay i mean we'll go as we took as we talk really i want to
talk to you later on about porting say a data warehouse workload into this environment and
i guess some of the practicalities of some of it um but also that big query isn't the only
database service
that Google offers in this area.
You've got things like Cloud SQL and this thing called Spanner.
I mean, just maybe outline what those are
and how they differ and how they're similar to BigQuery, really.
Yes.
So between BigQuery, Cloud SQL, Spanner.
So let's go back to the appending world
where I was able to write my scripts and have them completely managed, scaling.
At the same time, as we were doing services like that, not everyone was ready to jump into this new world.
And a lot of people just wanted VMs, virtual machines. And Google also started offering them for people that just need
raw machines. Now, in big data, we have a similar story. Some people might want to jump to BigQuery
and do everything there, but some might want to still keep using the tools that they know they're an expert of and also get help on running them.
So Cloud SQL solves that problem.
Cloud SQL is our managed either MySQL or Postgres servers.
So if what you need is a MySQL server,
if what you need is a Postgres server,
we can help you run it. We will automate a lot of the
infrastructure, calls, backups, scaling, etc. But at the end of the day what you're getting
is the MySQL or PostgreSQL servers that you probably already know that are compatible with
other things that you use.
So do they autoscale in the same way as things like BigQuery?
I mean, I think, is it the point, again,
that it's completely elastic sort of infrastructure?
Or is it more a case of they run as a managed service,
but they have more limits and so on?
How does that kind of work?
Yes.
So they are a managed service that is not as magic as some products that are just
natively magic.
With Cloud SQL, we try to do our best with the products that already exist.
But they're still MySQL, they're still Postgres, so there are some inherent limitations to those products,
while at the same time being pretty easy to pick up and use
and start building your applications.
Yeah, I mean, I actually use Cloud SQL for,
there are some tools out there, some BI tools,
like, say, Superset, for example, that don't yet natively connect,
from my experience, to things like BigQuery.
So I tend to kind of like ship data into your Cloud SQL environment and then use that to
connect.
So the integration between the two is quite good, but it is effectively kind of like MySQL
running in that environment.
But what about Spanner?
I mean, I know I'm very conscious Spanner isn't your kind of product area, but just
at a high level, there's been a lot of kind of talk about Spanner and maybe some people aren't aware of it.
What is Spanner, really?
And what's the background to that?
So Spanner comes as part of Google's history
of trying to solve our own data problems.
With BigQuery, the question we are trying to solve
is how do we run full table scans without indexes,
that analysis in an extremely fast way?
Now, with Spanner, what we needed is a database like MySQL,
but that would be able to grow to a Google scale.
When you're growing MySQL, when you're using it,
when you have multiple nodes, you are having endless requests,
or when you use a NoSQL database instead,
you start finding some limitations.
You start either with eventual consistency,
scaling up, and Spanner was
our answer to all of these problems. How do we get a SQL database
that we can scale massively?
And the most magic part, because
what people had to decide
when they were growing into these new NoSQL databases
was how do we handle eventual consistency?
How do we handle partitions in our network from the famous cap theorem?
And with Spanner, we feel we solved that problem.
That's a great question. In broad terms, I mean, how feel we solved that problem. That's a great question.
In broad terms, I mean, how was that solved?
Or was it kind of Google magic that is kind of in various kind of like, you know, white
papers are out there?
Was there a key innovation there, do you think?
Yeah, well, one of the key innovations there is through time. Our ability to, yeah,
the ability to keep all of the servers synchronized and
that
time is
being
accounted for accurately
in every server. That's a very
hard problem. That's a very hard problem, isn't it?
It's one of those classic things in
clustered systems and whatever
that actually getting synchronized time between the services
is an unexpectedly hard thing to do, isn't it?
Yes, it's a super, super hard problem
unless you can bring hardware into the mix.
And that's part of the Google magic here,
is our ability to bring atomic clocks
that synchronize all of these servers and give the
system an accurate picture of what time is it and in what order these transactions in a distributed
system are happening okay okay so so what i'd like to do next really is is walk through without kind
of going through every product you know to go i'd like to kind of walk through what would be involved in, say, moving an on-premise workload,
a data warehouse workload, into kind of BigQuery.
I'm not going to get into the kind of,
not even get into the details of individual step-by-steps,
but some of the kind of conceptual things really there.
Before we do that, I just thought it would just be useful
because a lot of, some people wouldn't really know that they're,
you know, what is Google Cloud Platform?
They might know Google from search and from docs and so on so just again just maybe just takes a couple of minutes what is google cloud
platform and how does it relate to the sort of like the internal stuff google do and in a broad
sense i suppose really what's the kind of differentiator differentiator for it yes so
well you know other alternatives to google cloud probably probably. I've used them too.
At Google, what I've seen that really makes me happy,
even before joining Google, was the ability of Google
to manage to solve problems as magically as possible for you
with App Engine, with BigQuery.
And at the same time, when Google entered this world,
we needed to catch up at first with traditional services
that other clouds were offering.
And we came from a position of pretty awesome internal tech. Our networking abilities are really impressive,
and all of these ideas make us believe that we can be pretty competitive and offer
differentiating features. Yeah. I mean, my experience,
I mean, again,
coming into an environment here
that was based around Google Cloud,
the thing that struck me was,
I suppose Google in a way,
maybe it reflects their business
to consumer B2C kind of background,
but it tends to be products they build.
So that, you know,
BigQuery is a product
that's a managed service.
It's kind of, you know,
it's kind of no ops.
It's very much kind of finished off
and so on. Comparing it to say other cloud platforms that are more i suppose components that's there
for you to kind of put together it strikes me that even though the bits that we work with things like
that cloud data flow and big query although this they're obviously still uh sort of it tools that
they're finished as products and they are a more complete solution it's almost like there's one
solution for a problem that's finished off rather than i suppose lots of takes on it from other providers
that are half finished and it's your job then to bring them together and and this this this kind of
fact that it is finished off and it is a service that you can rely on it is virtually no ops it
means you know taking our example that we we've got a team of engineers that do work on this
platform and they actually just it's just innovating all the time.
They're not trying to manage servers, they're not trying to scale stuff up.
They're still very technical, and it's still a lot of innovation,
but it's not around trying to keep servers running.
It's around building a service on that
that is itself innovative in terms of a business.
Exactly.
So many of our products come from our own needs of how do you run Google.
The scale of Google, the company, is impressive,
like everything that we do as consumer products.
Then the problem we've always had to solve is how do we run this?
How do we scale internally?
And that's where BigQuery comes from. That's where the ideas for Dataflow come from,
starting with MapReduce.
MapReduce is the paper that started it all.
Exactly.
So let's think about moving the workload over.
So let's imagine that I am a developer or DBA
that's running an on-premise, let's say, for example,
SQL Server or Oracle Data Warehouse.
And somebody says to them, we're now moving this, or can you test out or learn how to start to run
this kind of a workload on a kind of Google Cloud environment instead? And how would that person
approach that? And let's be more specific. If you are a developer, how would you think about moving the workload, moving the tables?
You know, is BigQuery conceptually exactly the same as an on-premise system, or would
you maybe structure the database differently?
Maybe, what would your thoughts be on that as an initial bit of advice to somebody?
So my favorite way of approaching this is asking people what are their pain points um like a lot of people when i
go to a conference i'm speaking in front of everyone i ask them who is working on big data
and not a lot of people are sure if they are working with big data or not depending on the
conference but then if i ask them who knows sql everyone raises their hand and then I ask them
can you keep your hand up
if you ever had a query that had taken
hours to run, days to run
and they have seen this pain
this is a pain that they have felt
so when I can offer them a solution
that will take that pain away, where they can just
load everything, rank queries in seconds,
without going through optimizations or committee
meetings about what indexes should we add or not
add to our database, that's a really good starting point.
If everything is going well for you,
you don't need to switch anywhere.
But a lot of these products only make sense
when you're feeling the pain.
Same thing with Spanner.
If you're doing well with your MySQL server,
Spanner might not be attractive for you.
But then if you know how hard it is to run MySQL at scale,
if you know how hard it is to run a NoSQL database at scale,
when I'm able to offer you a solution
that will take that pain away,
people really start listening.
Okay, okay.
So let's take a concrete example so i've worked at things
like gaming companies in the past where we have you know let's imagine we're getting kind of like
transactions bets and someone coming through and they're at the point where the volume of these
things coming through is more than the more than the traditional kind of relational database can
handle so they're having to deal with i suppose um a greater throughput now than
they used to have and they are thinking about i've got a maybe a table structure running in in kind
of like oracle or something that's got dimensions of facts and so on would you suggest that you know
would you suggest that they move that across as it is or does i suppose the distributed kind of
nature of big query mean that you might think about data modeling different? Yes, so a lot of ideas that are behind your
current designs is what you are designing around the current
limitations. All of the data cubes, etc.
The whole idea is to make things easier to compute later and to be able to process them.
Then with BigQuery, that's not a problem anymore.
So why keep those restrictions around your design anymore.
So yeah, my favorite way of telling people to start playing with this, like there's nothing better than getting your own hands into it, is to get an export of their data into BigQuery.
If they can get just one huge dump, a huge file into BigQuery, and they can start running queries.
They can feel the difference.
It's a typical one-day-long, one-afternoon-long proof of concept
that converts people to this.
Yeah, yeah.
So BigQuery is like a kind of, it's a column it's a column
of database isn't it so i guess some things if you come from a world of row-based databases or
storage some things are easy some things are hard but does the fact that it's column store
uh mean anything differently and and you mentioned a little while there's no indexes so so in there
so i guess what what kind of queries work well in that environment
and what kind of, I suppose, kind of design works well there
and what would be a wrong one there, really?
I mean, certainly not having indexes is interesting, really.
Yes, so BigQuery's biggest strengths,
it's also one of its biggest weaknesses,
if you want to see it this way,
is that BigQuery can process terabytes
of data in seconds. Now, it will also process small data requests in seconds because everything
is optimized to just do a full columnar scan. Usually, you want your database to
reply in less than a second.
Yeah, you have a key,
you want your value, that has to take
milliseconds. BigQuery is not that.
BigQuery is not that kind of database.
So,
if you already
have a solution that brings you answers in less than a second, don't switch
to BigQuery.
But then there's all these problems you have, all these queries that are taking you way
longer than a second.
I'm sure you've run all night processes where you come back the next day to see if it failed
or not.
Well, that's the kind of workflow you bring here what
i was going to say actually there was an interesting one and that you so you and i had a conversation on
twitter i think recently where i'd come across a problem you know in my day job where we had
problems joining tables and i think that there's there's a perception sometimes and that that kind
of big query can't do joins maybe some people coming into big query would would actually not even know that joins are an issue i mean maybe just talk to us a
little bit about why joins could be an issue with big query but actually a strategy to setting up
your your table structures and the way you use big query that means you'll be successful in that
because certainly you corrected me in that in that online with that and I think it was interesting kind of story there really yeah so BigQuery has the ability to do joins um when we first released the product it was not
able to do just it just ran full column scans then a year later two years later we had the ability to
join with small tables um where we basically copied the small table
to every distributed BigQuery node,
and then we were able to join it with the big one.
But today, BigQuery has the ability
to join arbitrarily large tables,
and it does a pretty good job at it.
Now, sometimes what happens is when you're doing a join,
sometimes you end up writing a SQL query
that creates an exploding join.
And those are really bad.
Like, let's say you're doing a cross join.
You have a 5 million row table
and you do a cross join with itself.
That cross join can produce 25 trillion rows.
That's a bad idea.
You probably don't want to do that.
But with SQL, it's pretty easy to write queries that do that.
How important is it to do things like nested columns?
I mean, again, that's something when you learn, when you read about BigQuery, you hear about
nested columns and so on.
Is that perhaps kind of over-engineering the solution or does that kind of come into it
really as a general day-to-day solution to things?
Yeah.
So I was listening to your interview with Dan earlier.
So yes, nested data,
our ability to do arrays inside SQL,
I find it a very beautiful solution,
but that you still need to wrap your mind around it.
Daniel Means at Looker wrote an excellent blog post about how beautiful he finds the
ability to do nested data in the query. I don't know if you saw that one. Yeah. But
it's a great modeling technique to put data that should be together, instead of having it in different tables, just leave nested inside one row.
One of the typical examples we have here is from...
Well, you know Google Analytics.
Mm-hmm. Yes.
So it's pretty easy for Google Analytics 360 customers
to export their data into BigQuery.
So instead of going through the Google Analytics web UI or API, you can just dump everything in BigQuery and start asking any
query that you may want or join it with your own datasets, which is pretty cool.
Now, if you go and see how this data is modeled, each row represents a session.
And that means you will have a certain number of columns
describing the session, who the user is.
But one column contains multiple rows,
which is each page view hit. And instead of duplicating this data around many rows, which is each page view hit.
And instead of duplicating this data around many rows,
all of the session-level data,
you can find it compressing to only one row plus an array with every hit.
And when writing queries,
well, that's a pretty good solution that still creates problems
for people when they are just wrapping their mind around this idea. Yeah. So, yeah. So, so with that,
I mean, I guess this is one thing that leads on leads on to is a lot of people moving a workload onto Google Cloud and BigQuery and so on for various kind of valid reasons would be thinking like dimension load processes.
So in data warehousing, as you kind of know, there's this kind of concept of you've got a dimension table that maybe gets updated and it's joined back to the fact table and so on.
I don't know if you know the answer to this, but would things like nested columns be a solution
for that? Or when someone says to you, I need to replicate this kind of dimension joined to fact
table kind of setup in BigQuery, forgetting the update part at the moment, but how would you
typically approach that really? Or is it completely the wrong thing to think about?
So what I always try to do with my tables
is to design around what queries I want to run.
So I want to optimize my tables for query
because that's where we are going to spend most of the time working.
So if all of my sample queries have a join between three, four tables,
it might be much better if I offer my users a table
that has all that data pre-joined.
Having copies of data, even if like,
this having duplicated data might make you nervous, but that shouldn't be a big problem
if you're able to regenerate these tables at any time,
storage is cheap.
And the question now here is,
what questions, what queries do you do your users
normally run let's optimize for this okay okay so and that's an interesting lead into um how
someone might get access to the tools to do this really so your title is actually you know
developer advocate and so it'd be interesting to understand, well,
someone who comes from the on-premise world
that maybe has kind of, I don't know,
SQL access to the command line,
or they have maybe kind of data integration tools
or whatever, what would be the typical toolkit
for somebody who is a developer working with this?
And how can Google help with that?
What do you have in terms of kind of like tooling
and I suppose command line interfaces and so on?
FELIPE HOFFAEYENEN- Yeah, so if we focus around BigQuery,
we have at least two different types of users.
On one hand, we have the people with questions, the analysts,
the data scientists.
So they have a tool set to query this data. And on the other hand, we have our heroes
that are the people keeping the pipeline alive. How do I keep fresh data inside the query
so people are able to query that? In the analyst tool set, the first tool you find here is the BigQuery web UI.
Yes, the ability to open up a browser, log into BigQuery,
and see all the tables you have access to
and start running SQL queries without any more preparation.
That's the first place we go.
Over that, I don't know if you are already using the BigQuery-made Chrome extension?
Yeah, just explain what that is actually.
So one of our customers, users, one of my favorite BigQuery users in LA created this extension because a lot of people in their company are using BigQuery
and sometimes they wanted our UI to do different things. So instead of waiting for us to create
the different features to the UI, Mikhail just started adding his own features and now
this is released as a Chrome extension.
We use it every day here, yeah.
Yeah.
Yeah, I'm impressed.
If you go to the statistics for the extension,
we have like, it has like 3,500 active users.
One of the things it was good at as well, I think,
one thing we like about that Chrome add-in, sorry,
is that it tells you, it actually predicts what the price of the query will be.
It will tell you, this query will process
this much data and whatever.
But actually, what's interesting is how much of the
queries that we generate are actually
free. And BigQuery
has got a different I suppose
charging model isn't it to maybe what people are used to in that you charge based on queries
is that right rather than kind of like what you're storing I don't know how does that work
yes so as BigQuery is a completely managed solution where you don't need to turn on servers or anything. Pricing goes...
The pricing model is built around the queries you write.
If you're not doing anything with BigQuery,
then the cost is zero.
When you want to write a query, you pay per query.
That's a different mental model to approaching these problems.
But at the end of the day,
a lot of people find out that
it improves their cost a lot.
But it's...
Yeah.
It's kind of interesting.
Certainly, it means that we have to be very cognizant
of making...
In the old world of on-premise databases, doing select star from something was quite straightforward.
But, you know, doing select star from a BigQuery table brings back maybe all the columns.
And how would that affect charging, really?
I mean, how does that affect maybe the SQL you write or your kind of carefulness about bringing back all columns rather than less columns, really?
Yes, so, select star limit 10 is not the most efficient BigQuery query, because basically, we're asking BigQuery to read all of our columns and then just give us 10 results.
There is a free operation to do that. You can see if you want to see your table
you can just see your table without running that query
with that said
the question is always
the problem you want to solve
is how do I get the results to the results I want.
On one hand, BigQuery shouldn't be expensive, but if you have massive tables,
if you have one terabyte table and you're querying that one,
querying a terabyte of data will be a thousand more time expensive than querying a gigabyte
because the cost is linear
but i suppose the fact is column store means that by if you just just get in the columns you want
suddenly it's a lot cheaper isn't it so rather than rather than in a way paying for the servers
and the storage to store a big table and querying it.
You're just paying for the columns that you actually kind of request.
So in some respects, that is better, isn't it?
Exactly.
So what I'm able to do when writing a query
is just to bring the columns that I'm interested in
and just the ability to know the cost of a query
before I run the query helps you a lot
with understanding that kind of problem.
So I can limit the cost of my queries in two ways.
One is choosing the columns that I will query
instead of querying all of them,
the typical select star.
And then I can also develop my queries
over samples of data.
Instead, let's say I have one petabyte of logs.
The question is, how do I extract,
before running all my queries,
how do I extract the data I'm interested in
from those tables?
Yeah, I mean, actually, we're talking here about charges and costs but one of the things that really attracted me to to the google platform when i was i suppose transitioning from say oracle for
example is is actually actually very rarely as a developer playing with things learning the
technology it's actually pretty rare for you to actually incur any costs at all and and you know look at while we've been talking i've been kind of bringing up my
transaction history on google cloud here and it's there's been many many months i've paid like 10
you know a pound or something or or it's as a developer i guess you you guys have been very good
at making it possible for developers to learn this in a way that actually you're not going to
have these big bills through all the tools are there and you're using the environment that is exactly the same environment as you'd use at work,
just that you're using it with small data set and you've got these kind of free tiers as well.
Is that correct? Just maybe, am I correct on that? Exactly. So everyone has every month one terabyte
to run queries for free. You don't even need a credit card for that. And we have a lot of public data sets
available. That means that you can load BigQuery, find any of these public data sets, all of the
GitHub history, a copy of Hacker News, a copy of Stack Overflow, with the worldwide weather, etc., etc., etc.
And you can just start writing queries,
filling out how different this is.
Like, oh, suddenly you have a place
where you can come, load the query,
and start querying data right away
without any charge at all.
Yeah. Now, once you get your feet wet, and start querying data right away without any charge at all.
Yeah.
Now, once you get your feet wet, you start loving it.
Like, bringing it into your own life.
I do.
I mean, and I land all my personal data into BigQuery.
I actually have loads of feeds coming in,
and it goes into BigQuery, and so on. Because I think, again i think again as a developer if you're learning something the fact that your environment stays there month in month out i mean a lot of vendors will give you trials
for one month two months or whatever and then suddenly it's unchargeable or you might have 30
day trials and so on but the fact that you guys make the environment available at you know it's
pretty hard to incur a cost as a developer at home.
I mean, obviously at work it's different.
And that environment stays there as well.
And I think the other thing that's interesting is,
you cover Google Cloud in general,
not just so much BigQuery,
and you've got all these APIs that you can use for things like sentiment analysis and stuff like that.
I mean, tell us a bit about those and what they are
and kind of how easy they are to use, really.
By the way, in BigQuery we also have a free storage tier.
So, yes, the data you're storing there, the data you're querying in BigQuery, it's all
free up to a certain limit.
But that gives you a lot of freedom to play with this.
And then we, as part of the Google Cloud offering, we are investing heavily on machine learning.
Machine learning is one of the strengths of Google as a company,
and we want to share our tool set with the world.
So people sometimes, when we go outside of structured data,
let's talk about images, let's talk about text,
let's talk about images. Let's talk about text.
Let's talk about understanding a lot of data that people,
institutions, companies are collecting
that they are not able to understand.
We want to help them understand that.
You have a huge collection of pictures.
You may have a huge collection of videos.
You might want to transcribe
all of these podcasts to text. You
might want to know what were the concepts, how are people speaking about your product. You might
have a data center full of recordings and you want to understand all of them. We offer you APIs that can help you do exactly that.
If you have a video and you want to know
everything that's happening inside the video,
either the audio or the pictures,
you can put your videos through this API,
extract all of the metadata,
and then store it in places like BigQuery
to just analyze them later.
Yeah.
So actually on that point, one of the things that I found, the APIs are fantastic and I
use them for sentiment analyzing incoming kind of like emails and tweets and all that
kind of stuff.
But one thing I found was it was still a little bit hard to link that to kind of BigQuery.
And I was trying to think about could I maybe create a function that would call the sentiment
API, NLP API?
I mean, is there anything around that, around integration with the APIs in BigQuery that
is coming along that will make it even easier to access those?
Or is this still a case of using the Ruby client or the Python client and so on?
Yes. client and so on. Yes, so yeah there is an impedance mismatch with what you can do with
BigQuery versus what you can do with an API. With BigQuery we can analyze a billion comments
in the next 10 seconds or three seconds. We would kill any API if we send them
a billion requests per second.
Oh, right, yes.
So yeah, the question there is,
how do we bring things to the scale and speed
that BigQuery has?
Yeah, but it's more, I suppose,
because as you say, it's like an impotence mismatch and so
on i mean so so maybe maybe for someone who is looking to get maybe they're a python developer
or maybe uh uh you know that that sort of thing where would they go to or how would they get to
start to understand the apis that are available and and you know how would they run them and so
on you know what would your initial advice be to someone who's looking at that really?
Yeah, so a nice way to start with APIs, my teammate Sarah Robinson has a lot of examples of this, is to start listening to events happening in real time. Let's say with the feed. As you
collect and read each tweet, you can pass them through the API and then store them in BigQuery.
Now, at what speed are you doing this?
You are doing this at the speed that you are getting tweets.
So you're not going over a billion tweets in the next 10 seconds.
You are just collecting data from the outside, going through an API, understanding it, and then storing that data.
So that's a good place to start.
And there you don't have the impediment mismatch.
The API will go at the real speed.
Then you also have data flow to connect to extract data
from BigQuery, run any process
called an API and bring the data back in. And then if you want to stay in the BigQuery
world, what I do many times is instead of going through the most advanced sentiment analysis API,
you can find cheaper solutions that you can run with SQL.
So, for example, yes,
Google Cloud Sentiment Text Analysis,
it's pretty powerful.
It's really awesome.
But then if you want to run some cheap
sentiment analysis
tools maybe you can take each word and score it according to this dictionary that tells you that
if words are positive or negative and you will get a quick answer to how the sentiment has
evolved during time.
Okay, okay. So you mentioned data flow there. And one thing I have found with the kind of Google world is cloud data flow is a pretty kind of integral part of most systems I see. But it's
a lot more complex, or appears to be complex than say BigQuery. Do you want to maybe just
very high level paint a picture of what cloud data flow is? And I suppose, you know, as a developer like yourself who's used to BigQuery,
how did you approach learning Dataflow? You know, what was and how much do you use it really in
real life? Yep. So with Dataflow, we are solving at least three different problems. One of the most important ones
is how we were running
batch analysis systems
versus stream analysis systems.
And there you have
the typical Lambda architecture
that gave you fast,
fast inaccurate answers
in real time.
And then you were able
to have a batch process later that would give you
the correct results.
So with Dataflow, one of the important problems it's solving is how do we have a unified system
that does both, where you can get correct results and in real time.
And many of these ideas we put together in what we call now the Beam API.
And the Beam API is now an Apache project
and a lot of other systems like Spark, like Flink,
like these ideas, they have been adopting them and we are developing a unified programming
model where you can write to the Beam API and have your problem solved by different runners.
Now that's one problem. The other problem that Dataflow solves is how once you have your pipelines defined in this way, is where do
you run them.
So you have awesome open source runners, but you might want a managed solution.
You might want just to deploy your logic somewhere that will scale up, scale down, and take care
of running this.
And DataFlow is our runner for the Beam SDK.
So you can write your programs for the Beam API SDK and then Google can take care of running them
in a managed way. Now the third problem we have here is people coding.
Is this API SDK easy?
The first Dataflow API was implemented in Java,
and maybe not everyone likes coding in Java,
especially outside a corporate environment.
So there I've seen Google advancements moving forward
with our Python Beam SDK for people that love Python.
And other people that have adopted Dataflow
for the Beam SDK,
for the ability to have Google take care of running all of this process.
They have also built their own APIs in their own languages. And there you have Spotify that
is using Dataflow extensively, but they also developed an SDK in Scala. Now you can write data flow programs
on the Scala language,
and they also add a lot of interesting design decisions
to the SDK from the programmer's viewpoint.
Okay.
So just before we get on to the last thing we'll talk about,
which is Google Data Studio,
one of the things that you haven't mentioned, but is a massive kind of useful resource that I think you've been involved in
is there's a lot of examples and tutorials and things on the Google kind of website
that take you through getting started with all these things and BigQuery and Dataflow and so on.
I mean, maybe just kind of explain what they are and kind of, I suppose, the content that's in there
and how they might be able to help people to adopt this new technology.
Yep. So here the question for us is how do we help people get started from zero to become experts? try to write interesting data analysis posts that bring people to realize that we have all of this
data available and that they can do similar analysis pretty quickly. But that's just a part
of telling people that if they learn BigQuery, if they try it out, their reward is pretty good,
especially if you are fascinated by analyzing data.
Then we have all of these offerings,
for example, our code labs
for people that want to start in a guided way.
So you can find many BigQuery code labs,
Dataflow code labs
that will guide you through the whole process
of getting started,
setting up your environment,
and finally arriving to an interesting result.
So if you just search for Google BigQuery Code Lab,
you will find a pretty step-by-step guide to the world of this.
Another resource that I found quite useful
was Google Cloud Platform's GitHub repository.
And there's a whole bunch of really good examples in there
of, I suppose, linking, for example,
BigQuery to sentiment analysis or things like that.
I mean, that's a useful resource as well,
but I don't think it's very well publicized.
But I found it very useful as well.
Oh, yeah. resource as well, but I don't think it's very well publicized. But I found it very useful as well. My teammates like Sarah,
Amy, and many more,
they publish
all of their code
on GitHub.
If you are Googling for those
results, you will find them.
I also try to collect all of these
results on Reddit.
I'm the admin for reddit.com slash r slash BigQuery.
So everything I know about BigQuery,
everything interesting that I find out there,
I collect it there and I let people just upvote and downvote
what they find interesting.
The same with the subreddit Google Cloud.
When things go beyond BigQuery, I also put them on Google Cloud
or the subreddit
App Engine, etc. So if you want to follow this collection of resources, at least you will find
them all collected on Reddit and ready for you to follow. I wasn't aware of that, actually. So I'll
actually take a look at that now. So that's kind kind of useful. The last thing I want to talk to you about is, I suppose, you've got a tool, Google Data Studio,
which is getting very popular now within the world that I kind of work in.
Google's free or free to use, I suppose, BI tool.
Tell us what that is really and what it's for and how people might start to get enabled with that as well.
Yeah, well, Data Studio is one of my new favorite products
in the line that it just solves a problem for you
and you do not need to care about any infrastructure.
With Data Studio, you can create interactive dashboards
following the Google Drive model.
Like, if you want to create a document
and share it with anyone,
you can open up Google Drive,
you can start your document,
you can make your document public,
you can share it with me
if you want to share something privately.
And with Data Studio, we have a similar model.
Data Studio can connect to multiple data sources,
hundreds of data sources now.
We have this new connector program
so anyone can write a connector for Data Studio.
But it also connects to Google traditional data sources
like Google Analytics, YouTube, and of course, BigQuery.
Now, once you can connect to data sources
with a couple of clicks,
you can start creating visualizations,
adding interactive controls for them,
and once your visualization, your dashboard is ready,
you can share it with anyone you would like to.
Again, following the Google Drive sharing model.
One thing with Google Data Studio
that I was all surprised about was
if you connect to, say, BigQuery,
you can't bring in multiple tables and join them.
It's either a single table or a single view
or a SQL statement.
Was that a deliberate design decision
or would that be kind of broadened in the end?
I mean, it seemed a bit of a restriction.
I was surprised about.
Any thoughts on that?
Well, Data Studio is a new product.
So all the features you see there are the initial features.
All right.
Yeah.
Will it get more features?
Probably.
Yeah.
I don't want to spoil the surprises.
Yeah. I don't want to spoil the surprises. Yeah.
But yeah,
to start with,
there are,
you can,
if you want to run joins,
you can run any custom query that you want.
Exactly, yeah.
Things are going to get interesting.
Yeah, good.
I mean, it's a very good tool.
I mean, we kind of use it here
internally for reporting and so on.
And a lot of our customers use it
as a, especially as a common standard,
the fact it connects to kind of
Google Analytics and so on there.
And it's freely available.
It's almost like the Microsoft Access
in our world of kind of,
I suppose, cloud BI.
And there's actually one other tool
that you guys brought out recently,
which I think actually is in beta
at the moment, which is Cloud Data Prep as well.
I don't know if you know about that at all, but maybe if you do, just maybe explain what
that is as well, and we'll finish on that.
Yeah, of course I know about Data Prep.
Okay, I just sprung that on you, yeah.
Yeah, as with Data Studio, the question is always how do we reduce the friction?
Data Studio is amazingly low friction.
With Data Prep, the question is how do you start loading data?
You might have these huge dumps of data
that you have never looked into,
but you're still collecting. Or someone shares some CSV or JSON files with you,
and now you want to load them,
you want to clean up this data.
It's the first step.
The first step for analyzing data
is understanding what data you got.
And with Data Prep,
you have this interactive environment
that will go over a sample of this data, will show you the shape of it, will allow you to
interactively click around, decide how you want to clean it up, what data do you want to keep,
what do you want to split in different columns, or what columns you want to drop.
And while you are designing this recipe for your data, you will store this recipe and transform it into a data flow job. Once that you don't need to really care about, it's just the recipe that you don't need to really care about. It's just the recipe that you interactively build.
And then once you have your recipe ready,
Dataflow is ready to run these jobs,
not only in a sample of your data,
but on as much data as you have,
and it will scale up and down
depending on how complex your data solution is.
Yeah, I mean, I suppose it's a data-angling tool, isn't it?
And for me, the thing that was always disproportionately hard
with working on BigQuery was actually just moving data
around.
It was incredibly easy to load data in, to process it,
to do whatever.
But actually creating
a sequence of steps for example that would take some data and maybe kind of transform it and uh
maybe kind of aggregate it and so on that there was no solution for that really apart from scripting
it and and i mean i don't know whether this is the solution for it but you know that that was
always a hard bit really and and the fact that that Cloud Data Prep connected to BigQuery, for example,
was a massive kind of bonus.
I mean, do you tend to use this tool now sort of day to day or what, really?
Personally, I'm not using it that heavily,
but I know of a lot of people that do so.
And you're touching on another super interesting point,
which is how do we just build our
pipelines. I know and that is, I mean that's the bit that is suddenly you're on your own
typically and you know there's solutions like Airflow and stuff like that but what's your
solution for this and how do you build these kind of data pipelines? I love relying on my data heroes that bring me the data in.
But Airflow is, I would agree,
is one of the most adopted solutions right now by different customers.
We do that, yeah.
Yeah.
That's the one I would recommend people using.
Maybe, I know you're not here to publicize Airflow,
but what is Airflow?
And how, I suppose, in a way, what we're really interested in
is how does it integrate with kind of BigQuery,
or how would you use it with BigQuery?
And what problem does it solve for you, really, or for customers?
AMANDO CASTILLO- Yeah, well, I'm not an Airflow expert,
but I do know that it's one of the best tools
to define how you want to move data around
and run your data pipelines.
I've seen some great work, for example, from WePay
that have described their whole solution
of how they move their live data that lives in MySQL,
how they use Airflow to synchronize
a whole workflow pipeline of moving this data
as quickly as possible to BigQuery for analysis,
for example.
Yeah, I mean, we use it to do aggregation, yeah.
For aggregation, we use it as well.
Yeah, so if you want to do things like that,
like how to keep two databases synchronized
either daily or every 10 minutes,
Airflow is a great place to define those steps.
Just to kind of round up,
so just to recap again,
how do people find you on the internet?
How do people find the content you've been producing
and where do they go to just to kind of get
started with as a developer in this platform?
Well, I'm on Twitter
as Felipe Hoffa.
I put a lot of
content on Medium. I love
collecting
all of these.
For example, this podcast, as soon as
you publish it, it's going straight to
Reddit.com. For people that this podcast, as soon as you publish it, it's going straight to reddit.com.
And for people that have questions,
people that are, if you're running into technical problems,
if you have a programming question,
Stack Overflow has this awesome community.
I'm usually looking there at every question that is posted.
Some I can answer, some I see how the community
just comes out and answers.
And we have the engineers from the product
also working there.
Stack Overflow is just this awesome resource.
It's usually you that's answering the question
at the end there with the correct answer,
which is good.
So it's been fantastic to speak to you, Filipe. So, I mean, it's been fantastic to speak to you, Philippe.
I mean, yeah, it's been good to talk to you, someone I've kind of seen on those places before
and has been so helpful with the answers and so on.
And, yeah, it's been great to speak to you.
And thank you very much for coming on the show.
Oh, it's been great finally meeting you.
And I've been a fan of you on Twitter this whole time.
I love seeing your
evolution
yeah
through the cloud world
so yeah thank you very much for having
me and let's stay connected
okay thank you Thank you.