Grey Beards on Systems - 130: GreyBeards talk high-speed database access using Apache Arrow Flight, with James Duong and David Li

Episode Date: March 23, 2022

We had heard about Apache Arrow and Arrow Flight as being a hi-performing database with access speeds to match for a while now and finally got a chance to hear what it was all about with James Duong, ...Co-Fourder of Bit Quill Technologies/Senior Staff Developer at Dremio and David Li (@lidavidm), Apache PMC and software … Continue reading "130: GreyBeards talk high-speed database access using Apache Arrow Flight, with James Duong and David Li"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Matt Lieb. Welcome to the next episode of Greybeards on Storage podcast, a show where we get Greybeards Storage bloggers to talk with system vendors and other experts to discuss upcoming products, technologies, and trends affecting the data center today. And now it is my pleasure to introduce James Duong, co-founder of BitQuil Technologies and senior staff developer at BitQuil and Dremio, and David Lee, Apache Aero PMC member and software engineer at Voltron Data. So why don't you two tell us a little bit about yourselves and what Apache Arrow Flight is all about. Thanks, Ray. It's great to be here. So as mentioned, so I am a PMC member for Apache Arrow. That means I'm one of the maintainers.
Starting point is 00:00:58 I can vote on decisions. I'm also a software engineer at Voltron Data, which also contributes a lot towards the Arrow project. So before we introduce Arrow Flight, I think we should introduce Apache Arrow first. So just to be brief, Apache Arrow is an in-memory format, a cross-language standardized in-memory format for columnar data. It also comes with other things like an IPC format for serialization and Arrow Flight, which is an RPC framework built on top of the in-memory format and the IPC format. So Arrow Flight is an RPC framework. It's specialized for transferring columnar data
Starting point is 00:01:42 in Arrow format across the network. It's built on top of gRPC and protobuf. It is just a framework. It's included with many of the Arrow libraries. But of course, you need to take it and do interesting things with it. And one of the things recently is a project called Flight SQL, which I think James can introduce. James Duong, Thanks, David.
Starting point is 00:02:04 So yeah, as I've previously mentioned i'm james duong i'm a senior staff developer at bitcoin technologies and jremio corporation so a while back at jremio corporation we decided to introduce this layer on top of the aeroflight project called flight sql which is a standardized way of accessing SQL databases through using the Aeroflight protocol and server framework. So FlightSQL has a single Flight client, a single FlightSQL client that can connect to any FlightSQL server. And so, maybe just, I'm not a database expert and some of the storage guys are not necessarily, but they work with database experts all the time. What's the distinction between row-based
Starting point is 00:02:56 data and columnar-based data? Sorry, I can't even pronounce the thing. One of you, right? David, maybe? Yeah, so, well, I guess, yeah, so databases, there are both row-based databases and columnar databases. It's really just how you shape the data. If you look at a table of data, do you, when you flatten out, do you flatten out along the rows or along the columns? And each of those has trade-offs and advantages. Arrow focuses on columnar data because we think that has advantages for data science. For instance, columnar data compresses better because the values within a column are of
Starting point is 00:03:42 the same type. You can use different compression techniques, or even if you're just using a general-purpose compression technique, that'll probably work better. If you're doing processing on the data, columnar has advantages because, again, all of the data in a column is adjacent in memory now, and you can apply things like
Starting point is 00:04:04 SIMD or vectorization to get a speed boost. Of course, row-based has its advantages too. If you're seeking through data row by row, columnar is not necessarily going to be a good fit. But for a lot of applications, we think columnar has advantages. And the other thing you mentioned about Arrow was that it was an in-memory solution.
Starting point is 00:04:27 You want to talk a little bit about that? Yeah, so Wes McKinney, one of the co-founders of the project, when he introduces Arrow, he kind of likens it to breaking up the layers of a traditional analytic database so that you can use all of its components separately.
Starting point is 00:04:48 So Arrow provides an in-memory format, which is basically if you have, say, a column of integers, how is that supposed to be laid out in memory? But also, once you have a column of integers and you have a few columns and you want to write that out to disk or transfer it over to the network, how should you serialize that data? How should you encode that data on the network? And actually, for Arrow, we also try to focus on avoiding encoding and copying as much as possible
Starting point is 00:05:18 for efficiency. And speed and performance and that sort of stuff. That's interesting. We talk a little bit from the storage side about in-memory advantages, and particularly lately about the inherent advantages of expanding that memory, leveraging Optane, for example, are there benefits to Arrow by increasing the capacity of what's available in memory? Yeah, so I'm not super familiar with Optane specifically, but one of the advantages, one of the properties that the Arrow in-memory and file formats try to maintain is being able to just
Starting point is 00:06:06 memory map a file and start working with the data right away. So you can work with larger than memory data sets by just memory mapping an arrow file without having to load it all into memory, without having to decode it first. And so the challenge is, of course, is that there's only so much memory in the world. Even with Optane, maybe, I don't know, 64 terabytes per server might be a reasonable maximum or something like that. So if your Arrow database exists and it's, I don't know, let's say it's a couple hundred terabytes you you page that in and out of memory is that how it would work i mean just for an in-memory processing we're not even starting to talk about flight yet which is the other side of this coin sure right yeah so at that point um
Starting point is 00:06:59 yeah you you could start how do i how do I want to say this? So columns don't have to be entirely contiguous, in a sense. You can break up your columns into little chunks called record batches. Inside a record batch, everything's contiguous. But at a higher level, you can then stream or iterate through record batches and process those incrementally. So yeah, in a sense, you're paging data in and out. But this is accounted for in the file format, in the in-memory format, and so on. And when you say a record batch, that would be all the columns for, let's say, the database
Starting point is 00:07:40 across 1,000 records or however many would fit into the record batch buffer or something like that? Or would it just be like column one down to how much, how much column one actually fits into that record batch? It's a former, it's a 2D, it's a rectangular chunk of data. All the columns, all with the same number of rows okay so we're just okay so bringing you know hundreds of terabytes of data back and forth into memory and writing it out stuff must be quite io intensive where does aeroflight you know how does aeroflight and aeroflight sql uh SQL really improve that sort of overhead? So it used to be, you know, you'd write out something from a database.
Starting point is 00:08:31 It moved from a database buffer to a memory cache buffer and memory cache buffer to, let's say, a NIC and then NIC out to the storage, which has its own set of buffers and all that stuff. Right? I mean, those are the old days. Or maybe it's today. I don't know. So where does AeroFlight bring to the table? Right. So AeroFlight basically tries to make all that more convenient and faster, especially if you're working with AeroData, or really only if you're working with AeroData. So if you have AeroData in memory and you want to transfer it over the network by using arrow flight, you don't have to go implement all that
Starting point is 00:09:10 yourself. You get these high level methods that let you just say, I want to read the next record batch or I want to write the next record batch. And arrow will take your data, it'll punch through the layers of all the networking stuff it uses, gRPC and protobuf, to avoid copying data as much as possible. Get that onto the NIC and get that across the network as fast as possible. Sorry. Go ahead.
Starting point is 00:09:39 And then Flight SQL is taking those benefits and taking the benefits of Arrow and then trying to bring it towards SQL databases. I always thought SQL databases, James, I guess, SQL databases were always row based and you'd sit there and you'd do like an if, you know, column X is, you know, Matt Leib's social security number, then bump is pay raised by 10 or something like that it was you know it was a row oriented bring it in do something with it and put it out kind of stuff so how does where does flight sql how does flight sql work in a with a common database well flight sql provides the protocol for high performance transport by making the data sent in a columnar format.
Starting point is 00:10:29 Traditional APIs like ODBC and JDBC are row oriented. When you have a JDBC driver, when you're accessing data from a select query, you get a result set interface and what you do is you check if there's a row using result set.next and then you get values on that single row using um say get object or get string on each column right so one at a time um if you're using flight sequel now using flight sequel's interface you could just get a single record batch and pull out a vector representing a column for that batch. And you could go through the stream getting record batches until you've gotten all the data. Now if your application layer is working with arrow data, that's when you really get the benefits out of flight.
Starting point is 00:11:25 You're already working with vectors that do not have to be deserialized. You mentioned serialization and deserialization before. Can you explain to me what that sort of process is or what that means in this sense? Yeah, so say you have a JDBC driver. Well, JDBC has its own formats for integers, strings, and timestamps, for example.
Starting point is 00:11:57 So when you build a JDBC driver, you have to convert from the database's wire format representation of those to JDBC's format. Potentially the database also needs to convert from the database's wire format representation of those to JDBC's format. Potentially the database also needs to convert from its own internal representation to the wire format as well. So you've got a transfer from the database to the wire format and from the wire format to JDBC format. Whereas if you're using Aeroflite and say your database is working, it uses Aero internally, it's just copying data to the wire and then having the client not even deserialize the data, but just be able to operate with it. So there's very little format conversion requirements, unlike
Starting point is 00:12:48 ODBC and JDBC, which would require multiple format conversions across during the during the data transfer. Is that what you're saying? That's correct. Yeah. Okay. Okay. There was some mention of parallelization as part of Aerofly. Could you explain how that plays out in this game? I could talk about this. So modern computing engines often support multi-node systems. Most systems are distributed nowadays.
Starting point is 00:13:24 Yeah, yeah, yeah, yeah. And you're not talking multi-core, you're talking multi-server node, right? Multi-server, yeah. I'll use Dremio as an example. We have a coordinator node, and then we have several executor nodes for processing a query, coordinator's planning, and then we delegate the work to executors. And they individually process the query, execute the query plan. What Aeroflight provides is a way to, as a response to a request, report each endpoint where the data is being served
Starting point is 00:13:56 so that your client layer can then start consuming data at multiple different endpoints at once potentially if your client itself is is also distributed you could have your client working with data on multiple nodes on its side as well so each core is its own compute engine effectively if i've got a parallel access to the data, can I have all the cores effectively working on their own columns of our record batches separately or they would have within a server I guess it's a it's it's a record batch that this server gets and a record batch that some other server so the the element of or the unit of granularity
Starting point is 00:14:45 for parallelization is record batches? So I would say Arrow and Arrow Flight give you all the tools to parallelize and split things up whichever way is the best for your application. For instance, so you can have multiple clients making requests to the same server. You can have one client making multiple requests
Starting point is 00:15:11 to multiple servers. You can split data up. So there's a little detail here. So the Arrow Java and Arrow C++ libraries conceptualize things slightly differently, but effectively a record batch or vector schema root is like a unit of data yes that you can uh you can have each thread working independently on its own chunk of data and process that and send it back over arrow AeroFlight independently. Is there
Starting point is 00:15:45 any specialized hardware in creating these Aero nodes? Aero as an in-memory format is intended to be hardware agnostic. It's intended to be designed in a way that it's efficient to
Starting point is 00:16:04 implement, but it's efficient to implement, but it's not tied to particular hardware. We have, for instance, Arrow's CI infrastructure tests Arrow on x86 machines, it tests Arrow on Macs, it tests Arrow on an S390X from IBM, and some PowerPC machines. So we have all power PC machines. Mainframe? Did you say mainframe? I said X390X, yeah. Arrow, okay. And I guess the other side of this is it's all open source, right? I mean, it's Apache project, right?
Starting point is 00:16:39 Yeah. Apache Arrow is under the Apache umbrella. It's open source. We have contributors from many open source we have contributors from many companies we have contributors from all over the world uh we have arrow projects in all sorts of different languages uh the julia project uh recently uh got recently joined the main arrow arrow umbrella as well. So we have lots of things that are supported. Yeah. And so, I mean, the reason I really wanted you guys to come on the show was because it's, there's not a lot of high performance access mechanisms or access protocols that exist out there,
Starting point is 00:17:25 especially in the open source community. I mean, most of the high performance access protocols are either proprietary or, yeah, they're POSIX based effectively. So you would have a POSIX client for Vendor X and he'd have his own servers to support their own parallelization. Now, NFS is coming out with some parallelization in 4.2, I believe, but this is something different. I'm not trying to think what the question is here. So do you have any performance statistics on what Aeroflight could potentially deliver?
Starting point is 00:18:11 As far as gigabytes per second or record batches per second, I guess, would be the other claim. One of the things that occurs to me is that while the software side, which is where you're working, is highly dependent. It could be, who knows, line level speeds, but it's going to require fast networking. It's going to require, and all of these hardware functionalities are variables that are going to be hard to compare apples to apples. Yeah, I agree. That is a good point.
Starting point is 00:18:49 So I think I briefly mentioned Aeroflite uses GRPC underneath. GRPC is a RPC framework from Google, and it's been pretty well optimized for TCP communication. But recently, we're also looking at integrating the UCX networking library into Aeroflite as well. Because Aeroflite abstracts away the underlying networking library it's using. So UCX is a library that's designed to take advantage of specialized hardware like InfiniBand interfaces. Oh, okay. InfiniBand interfaces. We're hoping the tests were conducted on a cluster that I can't
Starting point is 00:19:28 disclose exact numbers from, but UCX does quite well when it has access to specialized hardware. And this would be an InfiniBand solution. So let's talk about the hardware configuration here.
Starting point is 00:19:44 AeroFlight requires obviously client software sitting on the client and there's server software as well sitting on some server someplace and then behind that server would be SSDs directly attached or disks directly attached or do you support other storage systems behind that? Yeah, so think of Arrow and Arrow Flight as more of a toolkit and a toolkit and a set of standards and protocols. So, again, we're not trying to make particular requirements on the kind of hardware setup you have or anything like that. But basically, Arrow Flight at the network level is a set of APIs based on gRPC. And then we also ship client and server libraries
Starting point is 00:20:35 that any application can use in a variety of different programming languages to build higher level things like Flight SQL on top of these libraries. I see James has some performance figures if you want to mention those. Yeah, please do, James. Yeah, so when we did some testing of this at Jeremio, we saw throughput rates of 20 gigabytes per second without using Flight's parallelization features. Without Flight's parallelization.
Starting point is 00:21:03 So you potentially could see 20 gigabytes per second per parallel transfer that's right that's pretty skinny yeah if you had the hardware that could support it and that's that's over Ethernet TCP right I mean it's not you don't require any special switching or anything like that, right? Right. And yet you did mention InfiniBand. So is there some reliance on InfiniBand as a protocol? No.
Starting point is 00:21:36 So once we have UCX fully integrated, we'll be able to take advantage of InfiniBand hardware if you have it. But if you don't, you can continue using grpc and everything will just work over tcp and you mentioned uh ipc as well as a pro another protocol that you use um maybe just for my own edification can you tell me the distinction between ipc and gprc yeah so we have the arrow and memory format.
Starting point is 00:22:08 That's just how the data gets laid out in RAM. If we want to serialize it and then send it to another process or write it to disk or something, that's when you use the IPC format. IPC format basically specifies how you pack the buffers on the wire, the message headers, stuff like that. That all gets sent over gRPC. gRPC is a RPC framework from Google and the Cloud Native Computing Foundation. It handles all the networking details. So that's where the, those are like the three layers here.
Starting point is 00:22:43 Oh, okay. I got you. Not like alternate layers or alternate. They're, they all combined to support the transfers and that sort of stuff. So where do you where do you see in memory databases being used these days, a columnar columnar and memory databases being I mean, what sort of clients or customers would be and what would they be doing with them? So I'd say in memory databases are really good at doing batch
Starting point is 00:23:16 analytics. Dealing with large fact tables and being able to produce meaningful data using BI tools. And you don't see that being applicable to things like machine learning or anything of that nature? I can see this being used in machine learning. One of the big use cases for Arrow is to be able to efficiently process data using Spark, be able to load data into Spark, Arrow data into Spark without serialization.
Starting point is 00:24:02 Right, right, right. Normally when you work with Spark jobs, you ought to write a Python script and then you send the work to a JVM that processes it. But if you're using Arrow, there's no serialization required to go from the Python data to the JVM data because it's just Arrow data.
Starting point is 00:24:20 Right. Yeah, that's a good example. So I guess think of Arrow as kind of like a bridge between all these different systems. So Spark uses Arrow to implement its Python user-defined functions. And other systems like BigQuery and Stealth Lake also use Arrow to transfer data at different points. I think in the client libraries in these cases. Kafka could be a potential solution here as well, or I know Kafka has some Spark support. I can't say, I can't, I don't think off the top of my head, I can't think of anyone combining Kafka and Arrow per se, but there's no reason stopping you if you need to get polymer data from point A to point B.
Starting point is 00:25:12 Right, right, right. So what about high availability and that sort of thing are somewhat are sometimes uh required uh attributes of uh especially databases quite frankly because they become so critical to bi and other uh critical corporate functionality. Does AeroFlight offer high availability or is that something you just kind of configure it with? So ultimately that's up to the application being built on top, but AeroFlight does provide things to try to make it easier
Starting point is 00:26:03 to implement reliable applications. So again, because we're building on top of gRPC, that means we inherit a lot of the tooling. gRPC is its own rich ecosystem. So Aeroflight building on top of that means we inherit all the tooling, all the best practices that have been built up on top of gRPC over the years. All of the observability, monitoring and logging tooling, all of the knowledge of how to debug things, all of that still applies to AeroFlight because it is gRPC underneath.
Starting point is 00:26:37 Mm-hmm. Okay. So you get the advantage of gPRC and that sort of stuff. I'd like to mention that AeroFlight's ability to report multiple endpoints can be used for data redundancy as well. So if, say, one of the endpoints has gone down with that source data, you can go to another one. Right, if you've got a copy of it at that other endpoint. I see. That's interesting. We think and forgive me if I'm making too much assumption, Ray. But but it seems as if you and I are thinking in terms of how hardware might resolve a lot of these issues. But right. But ultimately, with a database language or a file format, we're really looking here at how those problems
Starting point is 00:27:27 are actually resolved by software. So, you know, split brain taking place, that's a transaction that doesn't necessarily compete. And is there a cache coherency from site to site? And you're saying that's not really a function of Arrow. That's really a function of the overriding architecture that actually handles the transactions. Would that be a correct statement? Yeah, that's correct.
Starting point is 00:28:00 I got you. I got you. I got you. But you mentioned that you could automatically replicate or mirror AeroFlight data on to different storage servers if I'm using the correct terminology. Just by configuring it that way, I guess. Kind of. So, well, sorry. So I guess there's always more layers to peel back here, right? So Flight just defines a protocol and some RPC methods that you can use to build things like that.
Starting point is 00:28:40 And it kind of tries to be suggestive and corral you into doing stuff like that. So, for instance, when you're requesting data from a flight service, the recommended pattern is that you make first a metadata call called get flight info. And that tells you where this data set can be fetched from and how it's partitioned. And as James mentioned, alternative endpoints that you can fetch data from if the primary endpoint is down. And as long as your application implements that,
Starting point is 00:29:15 as long as your client implements that, then yeah, you can build in redundancy. You can build in parallelism. It's still up to your application to actually implement those details, but Flight tries to encourage you to do that and make it easy for you to do that. Right, right, right.
Starting point is 00:29:33 The other challenge that open source has had historically is operations or configurations and that sort of stuff. It's been always, I would say, open source is typically developed by technical development teams and they're not necessarily usability teams associated with that. How hard is it to configure
Starting point is 00:29:55 and make use of something like Arrow and Arrow Flight and Arrow Flight SQL? That's something that the community is actively working on, I would say. So we've been trying to improve the documentation, especially in languages like Java. We recently started an Arrow cookbook initiative to try to provide these simple, reusable recipes for accomplishing common tasks with the arrow libraries now this is maybe a common cop-out of open source projects but if if you if there's something that's not clear if there's something that you want improved
Starting point is 00:30:40 please let us know right uh at least for me because I've been in the project for a few years now, it can be hard to see where things are confusing or unclear. So having these questions, having these questions really helps me as a contributor know where to focus my efforts, know where we need to be, know where we need to focus, explain more, basically. Right. That's a very valid point. You know, the forest for the trees conversation. But the difficulty with open source in general has always been a lack of support.
Starting point is 00:31:23 Which is the other side of this. Yeah, I agree. Yeah. So I imagine that community support, though, is, is quite robust in a project of this magnitude. Yeah, I'd say so. Well, I guess there are a couple of ways to approach it. So Dremio from as far as I can see, is actually fairly active in the project itself. And one of Dremio, as far as I can see, is actually fairly active in the project itself. And one of Dremio's co-founders was also a co-founder of the Arrow project.
Starting point is 00:31:50 But also, yeah, I and many of the other contributors do our best to monitor Stack Overflow, our mailing lists, GitHub issues, all that to try to provide support as best we can. And maybe that's not, of course, that's not guaranteed, but I think we try our best to address everyone's questions. Nobody's denying that. I think that the historic need has been finding a community of practitioners who actually understand the product and actually
Starting point is 00:32:31 understand the avenue that the end user is attempting to go through to resolve these questions, et cetera. In my mind, when you've got a product of this significance, you've more than likely got people that have faced similar issues in the past and can set you into a decent direction in terms of even if it's ad hoc support. So, you know, I think that we're not seeing the same issues we used to see. Mm hmm. Mm hmm. Right. Yes.
Starting point is 00:33:15 The I think the Aero community has grown. The Aero community has grown a lot, is still growing. So yeah, there is there is a fairly active community around it now across all of these different languages. And of course, there's also commercial support, but that's always an option. Oh, and there is commercial support for arrow and arrow flight.
Starting point is 00:33:36 Yeah, so I'm it's through my employer Voltron data. So I Okay, I won't speak too much, but it's also an option. Right. Well, that's good. We were kind of looking, we were probing to see if that was available as an option. And that's good. I mean, obviously, if you are a modern institution, you're relying on the data and the accessibility in the long term. You want to know if you can gain greater levels of support.
Starting point is 00:34:14 And obviously you can't. Can you guys speak to some of, let's say, some of your bigger installations, you don't have to actually name the company, but might talk about, you know, what they're doing from a vertical perspective with Arrow, Arrow Flight, and perhaps Arrow Flight SQL. Well, Dremio is the obvious candidate here. James? Right. So Dremio is just getting ready, has recently made Dremio Cloud available. And with Dremio Cloud comes support for Aeroflight through a centralized service now. So, that's one of the big changes. We adopted Aeroflight into the Dremio Enterprise Edition about a couple of years ago. So, we added support for AeroFlight on its own
Starting point is 00:35:07 and then started the initiative to do Flight SQL. We're currently building up Flight SQL support. Can you tell us just a little bit about Dremio as a company, what you guys are doing? Because I tell you the truth, I've heard about them, but I don't know exactly what you're doing. JOHN KOTTERMANN- Jeremio is a query engine for accessing data lakes efficiently and executing SQL
Starting point is 00:35:35 using an arrow-based execution engine. So we take advantage of the features that David's mentioned, like being able to do vectorized computations on data for the purpose of processing SQL, as well as exposing data to users using Aeroflight. So Dremio can connect to a lot of different sources, including Azure Data Lake, Amazon S3, and Google Compute Storage.
Starting point is 00:36:11 So data lakes, as well as more traditional sources like relational databases, such as SQL Server, Postgres, or Redshift, for example. Yeah, MySQL is included. Oracle as well. Oracle for example. Yeah. Yeah, MySQL's included. Oracle as well. Oracle as well. Yeah. That's the first mention we've had thus far, and it had occurred to me.
Starting point is 00:36:38 But if you've got raw data sitting in an Oracle database, it seems logical that there be an interpreter of that data into Arrow. So what Jermio does is it provides a connector based on JDBC to suck data in from a traditional database and then get into Arrow format so that Jermio could work with it. It tries to push as much work as possible down to the, down to the backend database though. Right, right, right, right. You mentioned vectorization and I would assume that because it's having this data sitting in, you know, column or format sitting in memory, vector operations would be useful, very useful here. So, I mean,
Starting point is 00:37:20 are you using things like GPUs to do those sorts of things, or is this something that you're using, I'll call it vector instructions of like x86, et cetera? So we use a component of Arrow that was developed at Dremio called Gandiva. I'm sorry, Gandiva? Yeah, so Dremio is a Java-based server, and we use Gandiva to be able to access some of Arrow's more
Starting point is 00:37:44 lower level features, including its SIMD operations. Single instruction and multiple data operations. Right. I just want to translate from our – so this is vectorization. But, I mean, vectorization could occur at the CPU level. It could occur in a GPU. It could occur in an FPGA. Am I assuming that you're using primarily the SIMD instruction sets for the CPUs that you're
Starting point is 00:38:13 operating on? I'm actually not sure about the answer to that, David. Do you know? This is kind of abstracted from me. Yeah, no worries. So Gandiva is based on the LLVM compiler framework. As far as I know, it targets CPUs mostly. The interesting thing there is Gandiva is written in C++, even though much of Dremio uses Java. But because Arrow is a standardized memory layout, those two languages can share data between them without having to copy it all.
Starting point is 00:38:44 They can just pass pointers around. So that's a big advantage of Arrow here, that a JVM-based system can take full advantage of native C++ capabilities. But you mentioned GPUs and FPGAs, and I want to say, so the NVIDIA Rapids ecosystem has a library called QDF, which implements data frame operations using the Arrow memory layout on GPUs. So we do see Arrow usage with GPUs as well. And the name escapes me at the moment, but there is also a project that works with FPGAs in Arrow.
Starting point is 00:39:26 You can take basically you can give it an Arrow schema basically the data types and it'll generate I think VHDL or Verilog to work with that data on an FPGA.
Starting point is 00:39:41 Wait a minute. Wait a minute. So you can give an arrow schema to this process and it will generate the hardware design language to program an FPGA to process it? Is that what you're saying?
Starting point is 00:39:57 Yes. You still have to bring your own... You still have to write the actual processing part, but it will generate... The interfaces or something like that? It'll generate all the interfaces for you, yeah. So it reduces the amount of work you have to do to program the FPGA.
Starting point is 00:40:14 So yeah, again, that's called Fletcher, if you want to look at it. Fletcher. Okay. Yes. There's lots and lots and lots of arrow-based puns. Yeah. Yeah. Okay. All right.
Starting point is 00:40:27 Well, hey, this has been great. David and James, any last questions for Matt and myself? Or is there something you want to get involved, please reach out on the mailing list, dev at arrow.apache.org, or you can send GitHub issues or pull requests on GitHub at Apache slash Arrow. Okay, great. Matt, anything you'd like to ask before we leave? No, no questions. But I just want to thank you guys. This is a very interesting project you're working on. And I learned a lot. Yeah, yeah, yeah, yeah. Well, this has been great. David and James,
Starting point is 00:41:19 thank you for being on our show today. Thanks for having us. Thank you. That's it for now. Bye, David. Bye, James. Bye. Bye, Matt. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.