Podcast Archive - StorageReview.com - Podcast #145: Why Edge AI Matters with Oregon State University
Episode Date: December 29, 2025If you follow StorageReview regularly, you might have seen the piece on the OSU Ocean Study in real time. We spoke with Chris Sullivan, Director of Research and Academic Computing at Oregon State Univ...ersity, to better understand how technology helps researchers study ocean life and how that impacts the global environment. Brian met up with The post Podcast #145: Why Edge AI Matters with Oregon State University appeared first on StorageReview.com.
Transcript
Discussion (0)
Brian Beeler with Storage Review, and here we've got a Super Compute Live interview with Chris Sullivan from Oregon State University.
Chris, thanks for sitting down.
Yeah, thanks, Brian.
This is kind of your Disneyland, isn't it, or Disney World?
It really is. This is Candyland for me.
This is where you get to find all the new technologies that really change the way we do the work that we do.
And you do science, which is cool, but you use technology in new and creative ways to do the science, which is.
is a little bit different than the marketing that we're all subjected to on our side of the fence.
You're the doer side of the fence, and it means something to you.
I really care about technology that changes the way I do science.
And if you change the way I'm doing science, I'm really interested in your technology
because all technology actually produces a bias and limits the scope of the work that we're trying to do.
And I need that to be removed.
And every day I look out at new technology, that's being changed.
And we can barely see six months down the road at what's going to be coming.
to us, but we know it's going to change the way we're doing the work we're doing.
When you say bias, what does that mean?
Because we hear that a lot in AI, like there's an inherent bias or past history or something,
but what do you mean in your sense?
So when we talk about scientific data, we usually do lots of statistical analysis to take the information
that we've got in limited numbers and try and apply it in a more macro world.
And so as we try to scale that up, sometimes it's right and sometimes it's wrong.
But the more data points I can actually put into the statistics as I'm making those kind of
calculations, the less bias I find in the answers that I'm getting back.
And so what we really want to do is collect massive quantities of data around each one of
the questions that we're asking so that way we don't introduce a bias based upon the sampling.
If I sample one area, I may get a bias to the species that are in that area because of other
environmental factors.
But if I sample lots of areas and become more diverse in that context, I overcome that bias.
and I can actually see a more broad context of what's happening.
And so that's part of what we've been talking about.
We just published a paper last week on some of the work you're doing around plankton research,
how you're using Dell's new GPU servers, the RTX-6,000 pros, dense solidime storage,
high-speed interconnect, all these things to get that done.
And I guess those components all come together, especially on the capacity side with the storage,
to help you get more data and do more work.
I mean, in the end, we've been fighting this concept of processing power versus storage.
I could never really have both.
We had small NVMEs and SSDs that were really addressing the localized speed.
But when we talk about massive datasets, I couldn't even use some of that that we would put on those HPC machines.
So when we started to try and do processing, I really had to lean into that Tier 1 storage versus what we'd be consider at Tier 0, which is on the node itself.
And so that Tier 1 is the only way out.
and that's what you see people buying from DDN and all these other groups,
is a Tier 1 storage associated to their super pods
to get that performance that they need to match the GPUs.
And that's because there wasn't large capacity SSDs
and not performance coming from that side
to cover the amount of data that we're talking about.
Well, but there's a trade-off, right?
Because those clusters are not inexpensive.
And when you go to see, like you intend to do with Tani
and some of these other research vessels that you're working on
to collect this data and process it on site,
like the math changes.
you've got all sorts of new constraints of the data center on a ship.
So that's the point is I can't take that tier one out to the ship.
It just doesn't really work, both in terms of power footprint,
and I've got a small data center on the ship.
And so really what we're trying to do is compact a massive amount of stuff
into a small footprint.
And I need to be able to capture it for just one experiment,
like my plankton experiment on that ship.
10 days at sea will bring you about 100 tariff of data back.
And after I process that, segment it and classify it,
We're probably looking at about 3 to 400 terror for one experiment,
and I need to get through that in about a couple weeks to a month.
That's just one thing on the boat.
I mean, you're running, what, 15, 20 experiments at the time?
It's 15 experiments at one time, and all of them are capturing that same amount of data.
Plus, I have all the sensors that the ship is actually running also,
because I have to have listen to the motors, I've got to have all this ship operations.
And so when we look at this brand new ship, it's really like making the Star Trek Enterprise.
We're building this new ship, and it's really designated for research.
And the Enterprise was like a research vessel in the future.
And so this is like the very first research vessel that we're moving in that direction.
And you can see that we're stacking it with technology.
It has 200 feet of cabling on a 200 feet of cabling on a 100-foot boat.
Right.
200 miles of cable.
There we go.
200 miles of cabling on a 200-foot boat.
And I apologize for any background noise.
We're actually on the floor as they're setting up.
There will be horns, there will be rattles, and there will be other noises.
All right, so all those miles of cabling.
So you're talking about research vessels.
All I know about big boats is what I learn on below deck,
and it's mostly debauchery and is that what happens on a research vessel as well,
or do they actually do work there?
So what we have to do is we actually have to take a lot of equipment onto the ship.
So I used to have to bring actual compute onto the ship
and run fiber cables down the hall.
Because we really didn't have a really well-defined data center.
The new ship we're building, with NSF's help, by the way, is really about changing that and allowing us to collaborate and interact better.
So I bring equipment onto the ship.
The plankton, for example, I have to drop a device into the ocean that goes up and down in the ocean and will be towed at five knots, and it collects 162 liters of water per second.
Yeah, I think it's per second.
And then it generates 10 gigabytes of data every minute.
Right.
Okay.
It's an alarming amount of data.
It's all image data, right?
It's all these photographs.
And so the scientists are all working together in a huge data center trying to manage those processes and actually understand what's happening.
Now, if I take that ship out there and I can't do semi-real-time or even real-time analysis, there's a potential I come back with no data.
And then you've wasted how many dollars?
It's a million dollars to run the ship.
So 10 days at the ocean is about a million dollars.
Wow.
And that's why everybody who's on this ship is really engaged.
to make sure everything is working.
It is really a team effort.
And you have to be trained, how to go out on those ships and all those pieces.
And so semi-real-time really allows us to move the ship into the appropriate places to find plankton.
Plankton don't swim around.
They go by current.
How do they get those little feet or cilia or something on them?
Right.
And so they use that to actually understand the world around them, but they're not using them to swim.
Okay.
And so they actually...
Well, you know a lot about plankton for an IT guy.
Right.
Yeah, I mean, that's, I'm a researcher.
And so as a researcher, you have to understand the science at hand.
And what I try and do is marry technology with research.
In the end, plankton produces 50% of the oxygen we breathe,
and it's 25% of the carbon sink on this planet.
It's 17% of the food you eat per capita, and it's the basis of the food wet.
The rusty bucket?
Is that, I mean, how many awful SpongeBob jokes do you get about the plankton?
So many, I wish SpongeBob wasn't for trained plankton in a negative.
way because it's probably the most important species in the planet.
I would say he's one of the more loved characters in an ironic sort of way, right?
Yeah, absolutely.
So you've got all these experiments going on, and your vision is to get systems like
the power edge on the ship with GPUs, with dense storage.
I know you picked up some gadgets.
I'm sitting here playing with a 122.
Yeah.
This kind of stuff will change what you can do on the ship.
This is a paradigm change.
until this moment, we never could put enough SSD or NVME on a single box.
I would always have to take disk arrays out to try and create that performance, which again,
back to a tier one.
And I don't want to have a tier one.
I want a tier zero.
And with this, we were able to put hundreds of terra onto the device, not only land the data
in real time on the device, process in semi-real time, but never have to move it to store again.
So that was one of the pieces that gave us an increase in processing time, was that I didn't
We didn't have to move it off.
When we went out to the test before with the smaller NVMEs and SSDs, we had to move the
data off of the device so I could process the next set.
And in that process, I was losing my capabilities, and I was losing bandwidth and losing
I-O, okay, my read-I-O.
Well, you're also IT staff limited out there.
I mean, I know you've got guys out there.
There is no IT staff.
There's no one?
I thought you had- There's one guy, and he has to be able to run both the ship and the
experiments, and he disassociates himself from the experiments.
He doesn't affect the experiments negatively.
And so he's really there to ship operations.
When we bring equipment onto the ship that's specific for an experiment, a lot of times the
group is going to have to manage that themselves.
So I have to train my postdocs, and I have to train my graduate students how to go out
there and run these algorithms and deploy this hardware.
And so I have to write the pipelines in that context.
It can't be something that's so complicated, and I need a single piece of device, not multiple
pieces of device that I'm trying to mount volumes around.
This is a home run.
This is truly a paradigm shift.
I see this as a pathway to eliminating spinning disk.
I think I could rent you Kevin for like seven to ten days to go out on your ship and be
your IT guy.
He's going to have to pass all of the...
Kevin, you're going to go out on the ship?
He says he's in.
You're going to have to pass background checks.
You're going to have to...
All these things.
Well, that was a good idea until you just said that, so we're not going to do that.
So how do you, when you think about your data collection, it's almost all edge, whether
whether it's on the ship, whether you've out at Ranger stations, you've got another Al project,
you've got 15 or 20 projects that you're looking at, and you're examining how technology
can impact those outcomes.
What's different about your edge challenges than other data collection analysis, AI-type
problems that are out there.
So I'm really trying to change to where I'm processing at the edge, because I don't want
to have to take all of that energy and time to bring it back to the data center.
Right now, people are bringing eight boxes of spinning disk.
How do I get the compute further towards where the data is being collected?
So I don't actually have to bring it back to the data center.
The data centers are spent training new models that I can push to the edge.
And I'm getting really good information without having to send labor out there.
So we take the Owl Project, for example.
I've got 5,000 autonomous recording units to do sound, and we can identify 130 species right
now just from sound in the forest.
But I've got to send a person out there to collect all the cards.
It's so cumbersome that by the time you get the data back, I mean, you're probably
weeks and arrears.
It's not as meaningful.
Right.
The temporal aspect of this is what we're really after.
And that we're losing that when we talk about all the data.
I don't care if you're talking about a concert venue or a sporting game.
What we could be doing with these types of drives is putting that stuff at the edge and in real
time, not having to go back to the cloud, tell you what's happening.
If there's a problem and something's happening, we would know instantly rather than that temporal
coordinate is really what we're after.
Well, in the ship, though, you've got almost a real data center.
It is, it's real, it's just small.
It's small.
And you've got plenty of power, and you've got cooling figured out, so you're in good shape there.
In some of your other locations, ranger stations, for instance, you've got a power problem,
you've got a cooling problem, you've got no staff problem, you've got a lot of different
problems, ruggedness, air quality, like all these things.
So how do you find the right hardware for those solutions?
So we have worked with so many vendors, okay, and it's important to realize that, you know,
solidine that we were testing, we wanted to test it across multiple vendors. I wasn't going to sit here
and test it on a single vendor because each vendor provides us a different solution. And when we
looked at Dell, they are just a home run for us when it comes to our data centers, putting things
in data centers, and putting something on the ship, for example, where I have to have a compliance
problem. Okay. So I have multiple government groups trying to use that ship, and I've got to meet some
compliance, and Dell hits a home run there. Okay. But they don't always hit a home run when it comes
to Bleeding Edge.
It takes them a while to GA a product.
Well, they're a volume business, to be fair, right?
Right.
Whereas Super Micro is a little bit faster at getting Bleeding Edge technology onto the table
for us.
And so we've been working with Super Micro for the Ranger stations, and this is why we chose
them.
They were able to hand us pieces of equipment that had liquid cooling built into them and
allowed us to deal with the cooling problems.
So it handles the thermals, right?
You got it.
And then they were also able to meet a power footprint that worked inside the Ranger
station so that I'm able to put four.
A100 liquid cooled into a ranger station with 300 terra of solidime and process at speeds
greater than a grace hopper.
Wow, yeah, it's pretty wild.
And now, I mean, you're already seeing the early returns on the RTX Pro 6,000 server
cards.
What do you think that's going to do for your, I almost said, business, for your research,
just fundamentally compared to the other cards that you've been playing with?
Because not everyone needs an NVL-144.
Correct.
And I mean, in the end, what I found is people.
People, I think, miss using some of the bigger cards that we pay for.
We spend a lot of money on Grace Hoppers and other types of
Nvidia products, H-100s and stuff like this, DGXs.
And I really want to see people leveraging those in a way that changes the way we do science
and can change the scope again, remove that bias.
Whereas I have...
Let me restart that one.
It's okay.
We'll just leave it all in.
It's kind of funny.
The voice of Oz.
I don't even hear what he's talking about.
Okay, so different cards, though, could be used for different reasons.
Right, so if I'm using a big, expensive card, like a hopper, for just inference, I'm not really getting the value back out of that hopper, actual GPU, where I could be training on it.
And so what I really want to be doing is buying inference-based cards for inference workloads and training-based cards for training workloads, really focusing where that's going to go.
I'm training on the big cards, and I'm going to do tons of inference on these 6,000s.
Well, I think there was a lot of market confusion, not just in research, but in Enterprise AI, too, where all these cards were available.
There was a period of time where, where Invidia had six or eight server cards, 30, 35, 40, L40.
Like, it was too much.
And I get it.
At that time, I mean, it doesn't seem that long ago, but everyone was finding their way, right, about what we needed and what do we need these cards to do.
Now, I mean, the swim lanes are getting tighter, and I think it's a little.
a little more clear as to what organizations need.
But still, I mean, you're right,
there's a segmentation there that takes a little while
to learn to make sure that you're putting
the right product to work.
Yeah, I don't want to be spending tons of money
on something I could have gotten done at a cheaper price.
You know, the other piece is that we really look
at how the rendering occurs.
So we buy these big, big GPUs to go in the data center,
but I can't do Omniverse on them, and I can't render on them.
And I really look at that RTX-6,000 is a dual piece.
we are building an omniverse representation of the tawny ship.
Right.
So our students can actually be trained on the ship with the ship out at sea still.
And by the time the ship comes back, they can actually go to the ship and be ready to go much more rapidly.
I don't have to use ship time to train my students and postdocs and stuff.
And I could also use this Omniverse to learn how to place things on the ship
and how we are going to actually have to load all the equipment.
Because space is limited, right?
It is.
And I have to make sure it doesn't fall off the ship.
The ship's going up and down.
and it kind of moves, and so we don't want things to fall off.
And so we have to get a packing order associated to this,
and that packing order is hard.
You're going to have to actually use screws on your rails
and stuff to lock them into the rack.
It does.
We never do that in our office.
The data center is going to have to be really set,
so things don't juzzle, javel, yeah.
And the other thing is you've got to realize
that anything that has moving parts
is going to start to have problems,
and that's another reason that we love to have the SSBs out there.
And we've been testing different flash arrays,
from different groups, but again, it was always a secondary piece of equipment.
And I always have to move things around, and that costs me in terms of processing.
And so we were only getting a limited visibility.
Being able to put all of it onto a single device and not have to move data once it's landed is actually a huge win.
Well, I'd guess you'd rather have another GPU array than blow those two or four you on a storage product.
Correct. And you've got to realize that we've been able to get this large capacity in such a small,
actual power footprint, that I'm not burning the power, that I would have burned otherwise.
And so I had to bring all these other arrays out and pieces of equipment to run them.
Guess what? I'm consuming power. And so ultimately, the power footprint is amazing on the ship
using large capacity SSDs.
So now that you're starting to put these things to work, I mean, you've talked a little bit
about the change in workflow that it's given you. Any other benefits or side effects that you
didn't expect? I mean, in the end, it's this performance. So I was always concerned that as we
went to larger capacity, we were going to be losing the performance. And that's why I was really
interested in testing this. That was the early knock on these high cap drives that they're not
going to go fast enough. I've been saying for maybe almost two years, like we almost have enough
performance now. Now there's other challenges that play. I agree. And I think that we've reached that time
where the performance levels of things are at the human level where it can think as fast
as we are. And so we're not going to recognize performance changes because we're not going
to be able to feel it. We are feeling it for the last 20 to 30 years. We were starting to reach
a time when somebody put something in, it's coming back so fast. And so when we looked at the performance
on these drives, it was incredible. It matched up to what we saw on normal NVMEs that we're working
with right now. The re-performance was incredible. The right performance was still there. We
We didn't lose anything, but we just gained this massive capability in a very small footprint,
and we saved a lot of power.
And so really the question is, how is we going to start leveraging this in other ways?
I believe that we should be having a tier zero on all the HPCs.
I think that these should be put in all the machines on the HPCs because it really will
change the way that we can get the performance out of those machines.
Well, you walk around out there, I mean, Dell's got their new system, super micro, HPE, everyone's
out here on the Expo floor. And there's still quite a bit of diversity in the way that they're
going to market and trying to hit these things. We're seeing some giant 10U air cooled systems.
We're seeing three and four liquid cooled systems. We're seeing U.DOT2 like this, E3S, E3.
I mean, it's all over the board, E1S. Dell's got their new one with 16 E1S. The storage configuration,
I know you had a bag full from Solidom. I saw her hand you that bag full of toys before.
Yeah. There's all these form factors and all these different
design decisions now really make it, I think, kind of fun, but also a challenge.
I think it is a challenge, but I need people to realize that AI is a data problem.
And so if we're not addressing how we're handing the data off to these machines,
we're never going to overcome that problem, right?
We don't need the GPUs in the processing without the data, okay?
It's just not needed.
But if I have this amount of data, I can't do it without the GPUs.
And so when we look at this, we used to store data in the 1990s in these things
called filing cabinets. We had tons of data. It literally weighed 2,000 pounds, and we were
information poor because you could not search that.
It was also not replicated anywhere in most cases.
It was not redundant. And so, you know, there's been, you know, this concept that processing
is what made that come to life. It's not. It was the cheap hard drive. The cheap hard drive
brought the data into a position for the computer. To do something with them. Yeah. And that
was the magical moment. When we made the cheap hard drive in the 1990s, we took all the data out of
the filing cabinets, put it on to the drives, and all the math for AI is coming from the
70s, 50s, 60s, and 70s. We've had it. We just never had the data and the processing.
And so we're starting to watch the two meet each other. And that it's always been an arms war.
Processing went faster. Now we got a little bit more data. And so the ball is back in the
court of processing when we look at these large capacity drives. And then now processing is now
going to be able to do things that they weren't able to do before.
And that's what I want is the enablement.
The data and the storage creates an enablement for us to do science and processing.
We never did before.
So your work has pushed you guys out to the edge.
You've got more data coming in from all sorts of different sensors, devices, video, cameras.
In all of this, and as we look at this, this conference in particular is a little more education heavy,
but it's leaning more enterprise every year as we come here.
What have you learned that would be more?
that would be practical for a large organization,
that they don't know about the edge
or that they don't know about AI?
I think that you're consuming time and labor
when you're not working at the edge, okay?
So I have to put a tremendous amount of labor
behind people bringing us the data,
pushing that data up onto an infrastructure,
processing that data on the infrastructure.
If we really talk about AI and agents,
Well, I wouldn't we put them at the edge and have them do that work for us.
Now I don't have to ingest all of that labor, and the ability of us to understand what's happening massively increases.
So I'm not trying to remove labor because I don't want to get rid of labor.
I want to change that labor to work on other things.
I think we've seen that be successful at retail.
I think it's probably the closest analog to what you're talking about,
where they're pushing more stuff out to the grocery stores or Home Depot's or whatever to do,
more analytics, more customer flow tracking, more recommendation.
Hey, notice you've got the hammer, here's the thing, a nail, or whatever, right?
To try to get more interaction, more sales, more commercial use.
Although I guess you're trying to help get the trees without getting the birds
so we can make more lumber and keep those places.
So we are trying, like you look at the Forest Service where it's a project to monitor
the spotted owl for the Endangered Species Act.
What we're really trying to do there is enable the lumber companies to take down trees without affecting habitat.
None of the lumber companies want to hurt animals.
None of them.
I've never met one that did.
They really look for us to help them.
And what we really want to do is give them immediate information.
If I can only sample two times a year, that's not good enough.
Because they spend most of their summer working in the forest and taking down trees and farming them.
If I'm giving them information that's two months old, that's not of value.
The birds could have moved around and habitats could.
could have changed and things like this.
And so the accuracy goes up massively
when we're starting to do things at the edge.
And that really does affect industries like the plankton
or the crabs, for example.
Our crab industry looks for us to actually put cameras
and camera traps onto people's fishing boats.
So we can actually just tell them what's in the trap
before they bring it up.
See, I watch enough TV.
I watch the, I know that where they run them up.
And it's all dramatic when they get the cage up.
And then it's like, oh, there's one crab.
No crab.
We would already be able to tell them, A, and B.
Well, you could have told them to put it somewhere else, probably.
That's true, too.
The other one that they're trying to get us to do is to put buoys out there so they don't have to take the boat out.
They're like, just tell me in the morning if there's crabs in the trap.
We'll go out and get them.
We'll go get them.
And I'm like, that's a brilliant idea, but we're going to have to build the buoys and put the buoys out there.
Now, we do have buoys.
Not impossible, though.
We absolutely do this already.
We have an ocean observatory initiative, which is a, a ocean observatory initiative, which is a,
ultimately about a billion dollar project with the NSF and Woods Hole Orchanographic and we do
this with UW. So we're the cyber infrastructure. So Oregon State's just amazing at
computing. And we're the cyber infrastructure for that entire grant. And we hold about
nine peta of storage for that one experiment and one set. And we're getting minute
interval interval data coming from the ocean. So I have gliders and I have buoys and we have
mounts on the ocean. And this is leading to us being able to tell you, there was a volcano
going off outside of the state of Washington a couple months ago. And so undersea volcano was going
off. And you can look it up. It was out there. And that was our data. And we're holding that
data for everybody. I have a 200 gig fiberline. It's been upgraded by Link Oregon now to 400
gig. And that's what's important to understand is that we have fiber plant. Okay. And Link
Oregon helps us do that. And that allows me to get data from the ocean to my data center in real time.
I've also been able to put a fiber plant down the dock
so I can plug the ship in when it shows up
on 100 gig and just start offloading data.
And so the idea is that if it was already processed,
I'm backing it up at that point.
So even with Starlink, I know we've talked about this before,
that gets you something, but it's still not enough
to move that much data?
It's not.
We're already using Starlink on the ship
for a whole bunch of other things.
We have been testing Starlink with the Owl Project, okay?
But with 5,000 autonomous recorders,
I cannot scale that up for the cost.
Right.
And so this is where they're looking at,
other wireless meshes. We were testing lower our way in. We're testing other wireless
meshes to bring the data back to a ranger station. Going to a ranger station is easy for us.
We can drive there, we can pick up the drives and swap them out. Right. And then we bring
the data back. But we've already got the information because that one ranger station
can talk starly without much cost. And so the question really is how many of the devices
can we collect the data back to the same ranger station and how many of them do we need to put
and how often do I need to have them for that visibility?
Well, you've got so much on your plate.
I know you've got all these new toys you're playing with all the time.
I started by saying, you know, this is like a toy store for you.
Truly.
In particular, as you look out at this expo floor, if you could sneak one piece of hardware in your backpack
and take it home with you, what would it be?
What have you seen today that would be your favorite piece to steal?
Yeah, so I'm still on that XE 7745.
I love that machine.
Well, you have one already.
I know, but I want more of them.
I mean...
So I'm building it.
A virtual reality space in the building that we're building with Jensen Wong.
Jensen and Lori Wong donated this amazing donation to Oregon State University.
He's here this evening.
Why don't you ask him for more 7745s?
I would love to.
I'm not sure I'll get access to him.
Okay.
I'll bring it up.
Our guys will be there.
We'll make sure to ask them for you.
So that machine is crucial for us to be able to do lots of the edge stuff.
And when I talk about putting stuff out into the field,
that machine does fit a power footprint that fits into a lot of our,
space is out there.
And then it's able to do the rendering.
It's able to do the inference.
It's able to do so many things.
And I can reconfigure it in multiple ways.
The 9680, for me, wasn't the greatest box.
And I think that it had its own set of issues and things like this.
The 7745 is a home run.
It's an absolute one.
It's way more flexible.
It's so flexible.
And for me, when I'm talking both edge or non-edge, it's both.
Yeah, you can do both.
I can do both.
And this is what you should see people be.
using the cloud like we see, and putting these types of things in their data centers right now to be considered edge.
You should turn your on-prem data centers into edge data centers, leveraging cloud services for the bigger lift.
And that would be the new paradigm that I would see Solidon moving people to.
Good.
Well, that's awesome.
We've got a detailed paper on what Chris is doing with Plankton and many other things.
We'll link to that in the description.
You can check out that whole conversation.
Chris, this is awesome.
Glad to finally see you in person.
We've had so many conversations, but thanks for sitting down and chatting and good to see you here.
Thank you, man.
