Orchestrate all the Things - The Quality Imperative: Why Leading Organizations Proactively Evaluate Software and AI Systems. Featuring Yiannis Kanellopoulos, code4thought CEO / Founder
Episode Date: April 9, 2025As enterprise systems shift from deterministic software to probabilistic AI, forward-thinking organizations leverage proactive quality assessment to maximize value, minimize risk and ensure regul...atory compliance In today's rapidly evolving technological landscape, ensuring the quality of both traditional software and AI systems has become more critical than ever. Organizations are increasingly relying on complex digital systems to drive innovation and maintain competitive advantage, yet many struggle to effectively evaluate these systems before, during, and after deployment. Yiannis Kanellopoulos is on the forefront of software and AI system quality assessment. He is the founder and CEO of code4thought, a startup specializing in assessing large-scale software systems and AI applications. We connected to explore the challenges and opportunities of quality assessment and share insights both for developers and for people responsible for technology decisions within their organizations. Read the article published on Orchestrate all the Things here: https://linkeddataorchestration.com/2025/04/09/the-quality-imperative-why-leading-organizations-proactively-evaluate-software-and-ai-systems-and-how-you-can-too/
 Transcript
 Discussion  (0)
    
                                         Welcome to Orchestrate All The Things. I'm George Anadiotis and we'll be connecting the dots together.
                                         
                                         Stories about technology, data, AI and media and how they flow into each other, shaping our bikes.
                                         
                                         As enterprise systems shift from deterministic software to probabilistic AI,
                                         
                                         forward-thinking organizations leverage proactive quality assessment to maximize value,
                                         
                                         minimize risk and ensure regulatory compliance. In today's rapidly evolving technological landscape, ensuring the quality of both
                                         
                                         traditional software and AI systems has become more critical than ever.
                                         
                                         Organizations are increasingly relying on complex digital systems to drive innovation
                                         
                                         and maintain competitive advantage. Yet, many struggle to effectively evaluate
                                         
    
                                         these systems
                                         
                                         before, during, and after deployment.
                                         
                                         Yanis Kanerdopoulos is on the forefront of software and AI
                                         
                                         system quality assessment.
                                         
                                         He's the founder and CEO of Code for Thought,
                                         
                                         a startup specializing in assessing large-scale software
                                         
                                         systems and AI applications.
                                         
                                         We connected to explore the challenges and opportunities
                                         
    
                                         of quality
                                         
                                         assessment and share insights for anyone responsible for technology decisions within the organizations.
                                         
                                         I hope you will enjoy this. If you like my work on orchestrating all the things,
                                         
                                         you can subscribe to my podcast, available on all major platforms.
                                         
                                         My self-published newsletter, also syndicated on Substack, Hackernan, Nidium and
                                         
                                         D-Zone or follow the story told of things on your social media of choice.
                                         
                                         I'm Yianis Kannelopoulos, I am from Greece and I'm the founder and CEO of Code for Thought. We are a technology startup which specializes
                                         
                                         in assessing and testing large scale software systems.
                                         
    
                                         We have, let's say, two lines of work.
                                         
                                         One is testing the traditional software systems,
                                         
                                         using also a platform from a DATS partner,
                                         
                                         and essentially the quality of the code,
                                         
                                         the architecture, the security of the code.
                                         
                                         And we have another line of business, which is related to testing and auditing of AI systems.
                                         
                                         We started with a traditional one, like machine learning, deep learning systems, and so on.
                                         
                                         And now we are expanding to GenAI.
                                         
    
                                         The difference between those two lines of work is that in the first line, we deal with
                                         
                                         systems that have millions of lines of code behind them and deterministic behavior.
                                         
                                         And in the other line, we deal with systems with way less lines of code, but they tend
                                         
                                         to make probabilistic decisions.
                                         
                                         So in most of the cases, the result is non-deterministic and it depends very much on the data and also
                                         
                                         the criticality of these kind of systems is way more important compared to the traditional
                                         
                                         ones.
                                         
                                         An example I'd like to give always is from the banking sector where we are able to assess
                                         
    
                                         the quality of a core banking system or a mobile application but also we are able to
                                         
                                         assess the credit scoring risks of the organization.
                                         
                                         And in the first case, if I have a system
                                         
                                         like Core Banking one, which the code is not good
                                         
                                         or the system fails for a day,
                                         
                                         of course it will create problem to the clients,
                                         
                                         it will create disruption, but within a day or so,
                                         
                                         everything will be back to normal.
                                         
    
                                         Whereas if my credit risk algorithms exhibit biases
                                         
                                         against people of color or women or younger people
                                         
                                         or older people,
                                         
                                         then this means that there is a systemic risk
                                         
                                         or the probability of a risk
                                         
                                         that my algorithms are not working as they should
                                         
                                         and giving everyone a fair opportunity for lending.
                                         
                                         That's the example I'd like to give.
                                         
    
                                         Great. Well, thanks for the intro.
                                         
                                         It was quite enlightening as to what you do, and it touched on both areas.
                                         
                                         I think it's kind of a natural progression to move from software to machine learning systems
                                         
                                         because of the reason you also mentioned in your introduction.
                                         
                                         So additional software is deterministic,
                                         
                                         and also it has been around for longer.
                                         
                                         It's understood more in a better way,
                                         
                                         and it's something that the industry has more experience with.
                                         
    
                                         So the first question I would like to ask you
                                         
                                         when it comes to evaluating software is what exactly are the use cases?
                                         
                                         So when do people come to you and ask you to evaluate the quality of their software?
                                         
                                         Okay, the majority of our projects started when clients were facing issues with, I wouldn't say with their systems per se, but with their investments on those
                                         
                                         systems. So usually if you spend significant amount of money, and by significant I mean
                                         
                                         some millions of euros in a project, then the system is not delivering the value it
                                         
                                         was expected to or the project is delaying or never going to production, then these are
                                         
                                         the first cases that clients were calling us and saying,
                                         
    
                                         can you help? Can you tell us what's going on here?
                                         
                                         What do we need to fix?
                                         
                                         How much time will it take? How much effort?
                                         
                                         After this first wave of projects,
                                         
                                         management and engineers also started to understand that it is better
                                         
                                         to cater for quality as early
                                         
                                         as possible when you start a project and not after the system is in production or anything like that
                                         
                                         because there you can prevent things from happening instead of reacting to things
                                         
    
                                         when happening. This is our experience all this year. It's like a journey. We started with some cases
                                         
                                         that clients facing problems but nowadays people are calling us beforehand like this is what we
                                         
                                         want to do. Can you help us already from the design phase? Can you review our architecture?
                                         
                                         Can you review our plans for this system? And then can you make sure to put the right measures in place so we know that
                                         
                                         the code is going to be of sufficient quality and it's going to have security holes or anything
                                         
                                         that may be exploited in the future. So I would say that the industry is maturing on that perspective.
                                         
                                         As you said we've been here for many years, the software industry exists for more than
                                         
                                         30 to 40 years, and this means that a lot of money went down the drain in this time.
                                         
    
                                         Yeah, that's encouraging actually, because when you started by saying that in most cases people
                                         
                                         call us when things go wrong, basically, it didn't sound that encouraging, but maybe this is
                                         
                                         just one of the use cases and there's a little bit of hope if the realization is settling
                                         
                                         in that well, prevention is better than firefighting basically.
                                         
                                         Yeah, and also because you've been a software engineer yourself, you know that the earlier
                                         
                                         you identify a problem in your code, the cheaper it is to, and the easier it is to fix it.
                                         
                                         If you see, if you find a bug in the design phase,
                                         
                                         then you can fix it much faster,
                                         
    
                                         where compared to identify something
                                         
                                         when your system is already deployed in production.
                                         
                                         Exactly.
                                         
                                         So ideally it should start
                                         
                                         even before the project starts itself,
                                         
                                         even before the architectural design. You should start
                                         
                                         with the overall goal of having a quality product. But then, of course, this is a very
                                         
                                         good segue for me to ask you the next question, which is, okay, fine. So let's assume that
                                         
    
                                         in theory we agree that, well, software quality is a good thing to have, but how does that
                                         
                                         translate into practice? So are there specific metrics
                                         
                                         that are used to assess that? Are there specific stats that organizations need to comply with?
                                         
                                         What do you use in your own evaluation, in your own work?
                                         
                                         If you look in the literature or if you see at industry, there are several ethics or standards that
                                         
                                         one can follow in order to ensure the quality of the software.
                                         
                                         There are guidelines from me, from IEEE, there is a CMMI matured model, there are several
                                         
                                         standards one can use. We are using the ISO 25010, which is the ISO standard for
                                         
    
                                         software product quality, which finds certain characteristics that a software system needs
                                         
                                         to exhibit, like maintainability, security, portability, performance efficiency, and so
                                         
                                         on. And then for each characteristic, there are specific sub-characteristics.
                                         
                                         So for the maintainability, which is mostly related to the source code, it says that the source code
                                         
                                         should be easily analysable, so it's analysability, easily understand where you need to make a change,
                                         
                                         which is the changeability, and then how is used to test those changes, which
                                         
                                         is the stability.
                                         
                                         And below also those properties, we
                                         
    
                                         map them with metrics that they are directly
                                         
                                         deducted from the source code analysis,
                                         
                                         like the complexity of the code, the duplication, the fan-in,
                                         
                                         fan-out metrics, which are related with the coupling of a certain DT module.
                                         
                                         By using those metrics and certain thresholds for those,
                                         
                                         then we map them to some characteristics, which
                                         
                                         are mapped to the characteristic.
                                         
                                         And we have an axon on DT that helps us, first of all,
                                         
    
                                         measuring the source code in a repeatable way.
                                         
                                         Second, be able to communicate results to engineers,
                                         
                                         which are the results for the metrics,
                                         
                                         but also the mapping of the metrics
                                         
                                         to high level concepts help us talk to management
                                         
                                         because they don't need the details,
                                         
                                         but they want to gain a good idea of the big picture.
                                         
                                         So by using this standard, which is industry accepted,
                                         
    
                                         and by using this kind of tax is industry accepted, and by using this
                                         
                                         kind of taxonomy, we are able to work with both the engineering team around the system and the
                                         
                                         management. Okay, so in a way it sounds like a typical KPI with metrics approach. You have the
                                         
                                         various metrics that are directly related to the code itself, and then you have aggregations of
                                         
                                         those that are mapped to your KPIs that you can
                                         
                                         use to communicate with management or people who are not necessarily technical.
                                         
                                         Okay. Yeah, this is true. This is the case. All right. And I presume, actually, I don't just
                                         
                                         presume, I have also seen that you are using a specific software to do that, right?
                                         
    
                                         Yes.
                                         
                                         Which has been developed by a third party.
                                         
                                         Would you like to share a few words about the software and the process that you use?
                                         
                                         How exactly does it work?
                                         
                                         Do you connect that to other repository where the implementation lines are?
                                         
                                         Oh, I like your questions.
                                         
                                         The tool that we use is called C-Grid.
                                         
                                         It is provided by our Dutch partner software improvement
                                         
    
                                         group.
                                         
                                         They used to work for SIG.
                                         
                                         I used to live in Amsterdam working for them
                                         
                                         when I finished my PhD in the area of software quality.
                                         
                                         And I went back to Greece to develop a similar business
                                         
                                         with their support. When I joined SIG, I was also part of the team
                                         
                                         that designed this model, also doing these kind of projects. So for me, it's like the natural
                                         
                                         evolution of my work there. And what I like also is that this is a direct result of R&D and of work that was done together with academic institutions.
                                         
    
                                         So SIG themselves are spinning off of the University of Amsterdam.
                                         
                                         Myself, I was finishing my PhD at the University of Manchester when I met this company, these guys in the conference.
                                         
                                         We met at an academic conference actually.
                                         
                                         I tend to keep that also at code for thought this connection with academia and research. Okay so it's not just because you know
                                         
                                         on the surface it seemed like a kind of typical case okay so there is the software vendors that
                                         
                                         developing this software and then there are companies like yours in this example that are
                                         
                                         somehow licensing and
                                         
                                         using it in their projects.
                                         
    
                                         But it seems there's a deeper connection and I wonder if you even contribute to the software
                                         
                                         in some way even today?
                                         
                                         Yeah, I would say that we're contributing mainly via our clients.
                                         
                                         And I'm proud to say that together with our team and Siegfried, we have one of the higher rates of client engagement
                                         
                                         and adoption of the Toolink.
                                         
                                         So the feedback from our team and from our clients
                                         
                                         is taking seriously from the team in Amsterdam
                                         
                                         to enrich the Toolink.
                                         
    
                                         And also to your previous question,
                                         
                                         yeah, the Toolink can be connected to a CI's pipeline.
                                         
                                         Whenever there is a commit, the code is be connected to a CI-CD pipeline whenever there is a commit.
                                         
                                         The code is analyzed on the fly.
                                         
                                         It's a SaaS solution, but no code is analyzed on our premises.
                                         
                                         It never leaves the premises of the clients.
                                         
                                         So the whole analysis on the fly, and we get only the results.
                                         
                                         Okay.
                                         
    
                                         All right.
                                         
                                         So that's even more interesting than I thought actually, because in my mind,
                                         
                                         I was somehow imagining like, okay, if people call you to evaluate their software, then probably
                                         
                                         it's like a one-off. So you go there, you do your evaluation, you give them the results, and
                                         
                                         then you're out basically. But the way you describe it, I would even imagine that this could also work
                                         
                                         on a granular level basically. So if there is a commit like a new feature or even more
                                         
                                         granular than this, then somehow violates the quality directive, then you get like maybe
                                         
                                         a warning or something.
                                         
    
                                         Yeah, I mean, that's a good observation. In Greece from the beginning, I mean, we had
                                         
                                         a couple of cases where there were problems and clients were calling us, but they were
                                         
                                         not interested for a one-off project. They were not interested about the assessment.
                                         
                                         Whereas in other countries, the assessment is way more popular as a service. Here in
                                         
                                         Greece, everyone was like, if you you come we want you to help us,
                                         
                                         not just by assessing and then leaving the company. We want you to kind of take owner
                                         
                                         support, take the accountability for your recommendations that will be provided to our
                                         
                                         teams to fix the things, but you need to monitor those fixes and you need to make sure that these
                                         
    
                                         are going to be implemented. So that helped build a stronger relationship, a longer relationship with the client,
                                         
                                         but also showing more impact.
                                         
                                         And also the tool is designed to facilitate this kind of needs, so it was a good match.
                                         
                                         Interesting. And it also makes me wonder, because obviously when you have a variety of clients,
                                         
                                         because obviously when you have a variety of clients, you even get into the territory of what I would call maybe personal preferences or style.
                                         
                                         So, because these things, well, obviously there must be some kind of objective metrics,
                                         
                                         but as you said, I've also worked as a software engineer myself, and so I can attest to the fact that, well, some things,
                                         
                                         even in software engineering, can be subjective.
                                         
    
                                         So there's not a single way to implement things.
                                         
                                         And even though the implementations may differ,
                                         
                                         the end result or the end quality, let's say, can be comparable. Is there the option to somehow tailor the results
                                         
                                         that you get out of your evaluation
                                         
                                         to such individual preferences?
                                         
                                         I would say that the results cannot be changed.
                                         
                                         There is a whole model behind it.
                                         
                                         So the rating regarding the code or anything that cannot be altered.
                                         
    
                                         But what changes and makes things more personal is the context of your client,
                                         
                                         the priorities they have, the challenges they face.
                                         
                                         So this also helps you to make sure that the message will come across.
                                         
                                         And that is also one of the responsibilities of a good consultant, if I may say. They need to
                                         
                                         understand their clients' context, their clients' challenges and problems. Because if they don't,
                                         
                                         I mean, just by saying a rating that is like three stars or two stars or four stars,
                                         
                                         doesn't mean anything by itself. So I would say that the results do not change. What changes
                                         
                                         So I would say that the results do not change. What changes and gets personalized is the message.
                                         
    
                                         All right.
                                         
                                         Let's switch gears a little bit now,
                                         
                                         having covered, I think, a good amount of ground
                                         
                                         in terms of software quality and how do you go about it.
                                         
                                         So in that scenario, or in that use case, I should say,
                                         
                                         things were kind of more solid in a way,
                                         
                                         precisely because of the reasons we both mentioned.
                                         
                                         Long tradition, existing software, pretty much well understood metrics and KPIs,
                                         
    
                                         so you had a good starting point.
                                         
                                         And that explains also the backstory that you shared on the fact that you are using
                                         
                                         off-the-shelf software, so to say.
                                         
                                         In the case of evaluating machine learning and AI systems, however, things are a little bit different.
                                         
                                         I know that in that part of your work, you are actually using your own solution that you have developed from scratch, I presume? Yes.
                                         
                                         And I guess that reflects the fact that it's a relatively new area.
                                         
                                         I mean, both in terms of adoption of machine learning per se, but also in terms of evaluating
                                         
                                         those systems.
                                         
    
                                         So how did you get started there? And again, to make the equivalence
                                         
                                         with the other area, use cases, do you see?
                                         
                                         So when do people come to you to evaluate their machine
                                         
                                         learning systems?
                                         
                                         OK.
                                         
                                         So yes, we have developed our own platform,
                                         
                                         which is called IQ4AI.
                                         
                                         Currently supports the workforce structure datasets and mainly for binary and
                                         
    
                                         multi-class classification problems. But we are expanding it now. We're going towards the
                                         
                                         GenAI area, rug-based usually. We have cases that clients, they need to abide to a specific
                                         
                                         legislation like the New York City
                                         
                                         Bias Law, which is the city of New York dictating any HR software to undergo an independent
                                         
                                         bias audit and publish the results on their website.
                                         
                                         So we have these cases.
                                         
                                         We're doing also a due diligence project.
                                         
                                         So a client wants to acquire an AI startup and they want us to evaluate their AI system.
                                         
    
                                         And also there are cases that clients that are proactive enough, they were kept hearing
                                         
                                         the last years about the upcoming legislations like the UAE Act.
                                         
                                         So they wanted to test the waters.
                                         
                                         So we're doing some AI assessments on selected systems they were having to understand what it means in
                                         
                                         an AI audit or an AI test and for them to be more prepared for the future.
                                         
                                         Okay, I find the last part especially interesting because even before you got to the part when
                                         
                                         you actually called what you do an audit, it already sounded to me like you are maybe doing auditing work in fact. So
                                         
                                         and also because you mentioned the EU AI Act legislation, it's in the process of being enacted
                                         
    
                                         actually. Yeah. And I know that you are reasonably familiar with it. So I wonder
                                         
                                         if you see yourselves as doing audit work and if you actually know whether you need
                                         
                                         to somehow officially register to act as an auditor or how does that work?
                                         
                                         Yeah, I will be open here.
                                         
                                         I don't like the word auditor, to be honest, because usually it comes with something which is very dry formula based
                                         
                                         kind of thing and doesn't leave room for creativity or for going out of the perimeter and do some
                                         
                                         more work because you think it will help your client. I like the word assessment more because assessment is more general and gives you more grounds
                                         
                                         to do things.
                                         
    
                                         But the word I think auditing fits to what we do when we check a system for certain
                                         
                                         legislations.
                                         
                                         There are several initiatives to organise AI auditing and we are following on these
                                         
                                         developments.
                                         
                                         I would say that if there is a need for us to register,
                                         
                                         we will do so, but for now, I think there is no such a need.
                                         
                                         And if you ask me also,
                                         
                                         we would like to take the road of consultancy and advisory
                                         
    
                                         because we feel that we can help more.
                                         
                                         Also the other thing that we see is that
                                         
                                         if clients perceive you as a
                                         
                                         compliance exercise, so they don't see the value for them, they see just the cost of doing business.
                                         
                                         So tell us what is the fastest way to check the box and then move further.
                                         
                                         I can understand that as well as I can understand why you don't like the term itself.
                                         
                                         But you know, if it's something that people are obliged by law to do, it's a de facto line of business.
                                         
                                         Yeah, it is. I agree with you.
                                         
    
                                         The legislations like the EU AI Act right now, they are the compelling reason for somebody to start looking at the AI systems,
                                         
                                         which is, I mean, if you do it seriously, then you can get lots of benefit from this particular legislation. I agree. I totally agree. So let's go back the core of the topic in a way, because
                                         
                                         to my mind, it's doing this type of assessment on a machine learning system,
                                         
                                         it's much harder than doing it on a traditional piece of software,
                                         
                                         for a number of reasons.
                                         
                                         First, the thing that you already mentioned,
                                         
                                         the non-deterministic aspect of machine learning and machine learning algorithms.
                                         
                                         And then also the fact that seen purely from the engineering point of view,
                                         
    
                                         these systems have a different level of complexity basically.
                                         
                                         There are many more moving parts.
                                         
                                         So in traditional software, you can have systems of high complexity with many modules.
                                         
                                         You can have very elaborate change of deployment and all of those things.
                                         
                                         But at the end of the day,
                                         
                                         it can be a big and elaborate puzzle, but you can see how all of the pieces fit together. In the case of machine learning-based systems, you have that plus also a number of components
                                         
                                         that don't exist in traditional software like data and different data sets and even different versions of all these data sets.
                                         
                                         You have machine learning models and again,
                                         
    
                                         different versions of these machine learning models.
                                         
                                         So it's a much more elaborate system to handle.
                                         
                                         So which part of this whole ecosystem
                                         
                                         are you able to monitor in the current state of affairs and which ones do
                                         
                                         you think should be actually monitored for a holistic assessment?
                                         
                                         Okay, when we work with a client, we ask for the testing data they have themselves to test
                                         
                                         their models and we try also to map the version of the model with a version of the data set
                                         
                                         and then the audit that we do because everything gets timestamped. So that's for in an audit we see
                                         
    
                                         a specific version of the model, a specific version of the testing data set.
                                         
                                         So we try to make it as contained as possible.
                                         
                                         And then based on that, we can also monitor the behavior of the model, all by checking
                                         
                                         the testing data set.
                                         
                                         And also we have the ability to monitor certain KPIs in production for cases, if we want to
                                         
                                         identify drift on the data or in the model and so on.
                                         
                                         But I think that the challenge here is
                                         
                                         that the AI system changes, evolves continuously.
                                         
    
                                         It's not like a typical software system
                                         
                                         that you do the enhancements,
                                         
                                         but the behavior system is not ultimately changing.
                                         
                                         In machine learning and also in Gen.AI, these things tend to change more drastically and
                                         
                                         more rapidly and more often.
                                         
                                         So the challenges on monitoring then this behavior, just by doing one audit per year,
                                         
                                         it gives you a sense of control
                                         
                                         or a sense of assurance, but it's not enough.
                                         
    
                                         So you need to make sure that you monitor
                                         
                                         the behavior of the system.
                                         
                                         We also, we have the tendency to base our analysis
                                         
                                         on facts or specific artifacts, if you like.
                                         
                                         So that's why I'd like to say that, okay,
                                         
                                         in traditional software, we like to base the analysis in the source code. In traditional AI we would like to base the analysis
                                         
                                         on a testing data set let's say. On the Gen.AI we would like to base the analysis on the embeddings
                                         
                                         on again to find the proper artifacts where we can do an analysis that will be as fact based as
                                         
    
                                         where we can do an analysis that will be as fact-based as possible and not an assessment made based on opinion or expertise or things that are not related directly to the artifacts.
                                         
                                         Okay, okay. I think it makes sense to make this distinction between, well, systems and AI-based systems.
                                         
                                         Let's start with the traditional machine learning. I think you mentioned previously
                                         
                                         you cover mostly classification algorithms. Yes.
                                         
                                         Okay. Even for those, you mentioned Drift, for example, so model Drift.
                                         
                                         So even for those, you mentioned drift, for example, so model drift. When in production, the model behavior may remain constant,
                                         
                                         and actually it will remain if it hasn't been updated,
                                         
                                         but what may change is the actual distribution,
                                         
    
                                         the actual input of the incoming data.
                                         
                                         So if you have a model that stays the same,
                                         
                                         while the data that's coming in changes in some way,
                                         
                                         you get this mismatch. So is that the kind of thing that you monitor?
                                         
                                         Yeah, we tend to monitor the behavior of the model by checking a series of statistics and
                                         
                                         metrics related to the performance of the model. That's how we try to identify the so-called
                                         
                                         drift. So we are looking on the metrics from the model and also metrics from the data itself like
                                         
                                         the distribution and so on to make sure that we can identify types of drift alongside the monitoring of the model.
                                         
    
                                         OK, and you do that on an ongoing basis, I presume, because if you only do it as a one-off,
                                         
                                         you can verify the performance of the system at that point in time, but that doesn't necessarily
                                         
                                         tell you anything about what happens in a month or two months. We are now working with our clients on monitoring.
                                         
                                         It is interesting to see that most of them right now,
                                         
                                         they are happy with just an auditor or two,
                                         
                                         and not the monitoring.
                                         
                                         But this is also a requirement from the EU AI Act.
                                         
                                         So I think that as time passes, they will be
                                         
    
                                         also more prone to having tools like ours to monitor the behaviour of their model on
                                         
                                         a constant basis and not just by doing a series of audits. The point is that in general, the
                                         
                                         legislation like the EU AI Act started putting AI audits or AI assessments on the table.
                                         
                                         Before that, I would say it was more exotic.
                                         
                                         And also there are preconceptions, right?
                                         
                                         A typical one is that, okay, if my model has high accuracy, why do I need to test it further?
                                         
                                         Why do I care about biases?
                                         
                                         Why do I care about explainability?
                                         
    
                                         And then you have to elaborate on that
                                         
                                         and convince people who are, let's say,
                                         
                                         budget owners, problem owners,
                                         
                                         but they don't understand technical things.
                                         
                                         Right, so again, one of the interesting side effects
                                         
                                         of legislation and regulation.
                                         
                                         And what about the use cases that you are asked to monitor GenAI based systems?
                                         
                                         I have the feeling that these must be more complicated for a number of reasons.
                                         
    
                                         I would imagine that many of those systems are actually based on calling external GenAI models through APIs.
                                         
                                         So they're kind of wrappers around API calls.
                                         
                                         Is that what you're seeing? And if yes, how do you actually evaluate the system?
                                         
                                         The clients we deal with are using the rags on top of an LLM, usually open AIs.
                                         
                                         And then the rag does something and there is a user interface that presents the information
                                         
                                         to the user.
                                         
                                         That's the most of the cases, the rug-based systems, not just an API around the foundational
                                         
                                         model.
                                         
    
                                         That's one of the things that we would like to understand before we start the
                                         
                                         project, before we even submit a proposal, because if there is an API just calling such a GPT,
                                         
                                         there are not many things we can actually do, right? I mean, it's just using a foundational
                                         
                                         model. And if there is something going on with that result, then of course we can look at it. But if it just asks and provides a user interface, then I don't think it's something very sophisticated for us.
                                         
                                         Of course, we can, depending on the context, because if you are a startup claiming you have a super tool,
                                         
                                         and that super tool is just a wrapper on top of a foundation model, and you buy to sell your company for some millions.
                                         
                                         Of course, by doing it with diligence and we realized that this is the kind of system
                                         
                                         that won't fly. In most of the cases we are being called for they are rack based systems.
                                         
    
                                         And we try to start with the basics, with the fundamentals. We start from the embeddings, we start from the NLP part
                                         
                                         of the equation, and then we move layer by layer.
                                         
                                         OK, but well, even if you start with embeddings,
                                         
                                         I know that lots of people actually
                                         
                                         use third party models like GCPT or whatever, Anthropics,
                                         
                                         whatever, they actually use them for embeddings as well. So in a scenario like this, how do
                                         
                                         you evaluate the embeddings that are coming off of this black box basically?
                                         
                                         We try to utilize standards and guidelines from papers, from academia, from the industry. Also, there are
                                         
    
                                         some thresholds we would like to use, which are kind of proprietary, I may say. If we
                                         
                                         say that this is an NLP problem, then we utilize the thresholds and the metrics related mostly
                                         
                                         with that field.
                                         
                                         Okay.
                                         
                                         Okay.
                                         
                                         I was just curious because I find it like really intellectually challenging problem.
                                         
                                         That's why we have developed a methodology and we published it on our website.
                                         
                                         You can find a detailed description of the methodology.
                                         
    
                                         So we're not afraid to say some things about us and what we do.
                                         
                                         And we do that because it's a really complex problem. Testing Gene AI is complex, is a current
                                         
                                         problem, is a huge problem. So it's not like we try to hide what we do. We like to actually to
                                         
                                         exchange views with the community and see what can we do better and how we can actually help.
                                         
                                         and see what can we do better and how we can actually help because the adoption of Gen.AI is pretty fast. It made people realize that AI is here. Before that people didn't even realize,
                                         
                                         although yeah that is happening for some years now. It cannot be under control the whole situation with GNI LLMs. So we need more help and we want
                                         
                                         more exchange with the community to identify more solutions because the problem is getting
                                         
                                         bigger and bigger. All right, well. And one claim we have the silver bullet for that, right?
                                         
    
                                         Well, if you did, maybe somebody else would be evaluating your own solution.
                                         
                                         Yeah, yeah, I'd like to comment on something you just said, which made me smile.
                                         
                                         You said that something along the lines of, well, it was this whole chat GPT and AI thing that made people realize that AI is here. And I find that to be very much true,
                                         
                                         not just in my experience,
                                         
                                         but if you look at things like option rates or mind shares
                                         
                                         or how much people are talking about AI,
                                         
                                         it's very clear that it has been a watershed moment.
                                         
                                         And for people who have been around before that,
                                         
    
                                         it's a little bit funny because now everyone
                                         
                                         is talking about it.
                                         
                                         But not that many people actually
                                         
                                         know what they are talking about.
                                         
                                         To share my own experience here, in the last couple of years,
                                         
                                         I was asked to do something that nobody ever
                                         
                                         asked me to do before.
                                         
                                         So I was asked to deliver
                                         
    
                                         training seminars to organizations on AI. And it was clear in pretty much all of the
                                         
                                         cases that their motivation was precisely this buzz around generative AI. And in their
                                         
                                         minds, I would say maybe 99% of people would equate machine learning and AI to GenAI and ChiaGPT, basically.
                                         
                                         To put it very simplistically.
                                         
                                         So because of the fact that you also work with a wide range of clients and partners,
                                         
                                         I'm wondering what you are seeing, so what your experience has been.
                                         
                                         Do you think that the level of, well, the awareness is there now,
                                         
                                         obviously, but the level of understanding and education is adequate?
                                         
    
                                         No, it's not. We recently had a meeting with a client and it was obvious to me that all the
                                         
                                         people when we're talking about AI, they man-gen AI, and everything else is simple
                                         
                                         software. I don't think that people are trained adequately enough for this kind of technologies.
                                         
                                         For me it's a very simple reason. All the people who are like my age and above, we've never seen
                                         
                                         while we were studying or our first year at work,
                                         
                                         be alive AI applications.
                                         
                                         It was a very constrained thing at that time.
                                         
                                         I would say that machine learning applications start becoming more dominant after 2010,
                                         
    
                                         more or less, that's the year of that milestone you're having mine. So that
                                         
                                         might explain a bit why we see this lack of literacy in people, even engineers, software
                                         
                                         engineers on the other hand. I think it's never too late to start training your people
                                         
                                         in the proper way. And if you ask me, I think that the Gen.AI is not the solution for everything, right?
                                         
                                         It's not a silver bullet.
                                         
                                         And I think that most of the problems that organizations face, business problems,
                                         
                                         thinking they can be solved with more traditional AI.
                                         
                                         The whole thing is if you have the proper people on board, if you have a proper data
                                         
    
                                         strategy, there are other things that may affect the way you're going to solve a problem.
                                         
                                         But just saying, I'm going to use GEN.AI for the sake of GEN.AI,
                                         
                                         they definitely is not going to fly.
                                         
                                         You are going to spend lots of money, but without getting your return on investment.
                                         
                                         Yeah.
                                         
                                         Yeah.
                                         
                                         I mean, I can see why there is such an overshoot
                                         
                                         on how people use Gen.ai.
                                         
    
                                         Well, first, it's very easy to use.
                                         
                                         It's very user-friendly.
                                         
                                         You have this textual interface
                                         
                                         and anybody can just use it without much effort,
                                         
                                         without having to set pretty much anything up.
                                         
                                         There's that, and there's also the fact that it's very much hyped.
                                         
                                         There's a lot of money being spent on developing these models
                                         
                                         and so there's also a lot of budget for marketing
                                         
    
                                         by the people who develop those models.
                                         
                                         So you get lots of noise, let's say, about these models and what they can do. But I totally agree with you that some of the things that these models can do
                                         
                                         can be very useful, but they're definitely not the answer to everything.
                                         
                                         No, no. We shall not make that mistake, I think.
                                         
                                         So I wanted to ask you a bit more exploratory question.
                                         
                                         In my mind, it sort of ties all of these things that we talked about together.
                                         
                                         Some of the things that I've been reading about lately have to do precisely with the interplay of these two domains that you also cover.
                                         
                                         So code and traditional software engineering on the one hand, and AI, and specifically Gen.AI
                                         
    
                                         on the other hand, and how these two interact.
                                         
                                         On the one hand, we have what has become
                                         
                                         like a premium application domain for this Gen.AI model,
                                         
                                         so code generation, basically.
                                         
                                         There's different ways that people use GEN.AI for code generation.
                                         
                                         Some people, developers, use it to generate code on their behalf.
                                         
                                         There's also the so-called vibe coding going on.
                                         
                                         Solutions that are addressed to non-developers.
                                         
    
                                         The idea there is something like being able to explain what it is that you want to do,
                                         
                                         explain what kind of solution you want to build,
                                         
                                         and then have some Gen.AI model build it for you.
                                         
                                         So that's one way that they mix with each other.
                                         
                                         And then there is also the fact that a big part of the training of these Gen.AI models
                                         
                                         is actually code itself.
                                         
                                         And through techniques like reinforcement learning,
                                         
                                         some people have been experimenting
                                         
    
                                         with fine-tuning a gen AI model with code
                                         
                                         and noticing the overall change in behavior that this brings.
                                         
                                         So I've seen research, for example,
                                         
                                         in which people have shown that, well,
                                         
                                         if you try and tune a GenAI model with specific examples of code
                                         
                                         and make it more capable, let's say, in this area,
                                         
                                         this somehow reflects in the model's capacity in other areas as well.
                                         
                                         So that's a very... I find it fascinating how this interaction works.
                                         
    
                                         I just wanted to bring it to your attention
                                         
                                         and just get the comment from you basically,
                                         
                                         because it's two of the areas that you specialize in.
                                         
                                         Okay. I think I can say more about the code generation.
                                         
                                         I will start from that, right?
                                         
                                         Gen.ai is a great tool to help developers document code, understand code.
                                         
                                         When I was doing my PhD, which was in the area of program comprehension, there were
                                         
                                         statistics showing that 90% of the time of a developer is spent on reading and understanding
                                         
    
                                         code.
                                         
                                         So for me, GenAI might be the perfect tool for helping somebody understand the code, what is the
                                         
                                         code that they read and giving it the proper context and everything.
                                         
                                         That's one.
                                         
                                         The second is I also read not so many research papers, but mostly surveys saying that using
                                         
                                         this tool, a compiler, let's say, you can be 10 times faster, 15 times faster. The point is that writing the code is just a fraction
                                         
                                         of the time of a developer. The rest is communication with other people, is analysis, design, testing.
                                         
                                         And just by optimizing the time you write code doesn't mean that you optimize the whole
                                         
    
                                         time you write code doesn't mean that you optimize the whole software development life cycle, which means that you can gain benefits when it comes to productivity only to a specific
                                         
                                         part of your work. That's one. I wouldn't see Copilot as a productivity tool per se.
                                         
                                         I would see it as a very valuable companion for the other tasks,
                                         
                                         for understanding the code, for getting ideas on how you can improve your code or how you can
                                         
                                         become better. Still though, the code generated by a tool is not to be trusted. Not only about the quality of it, but also about security and other things.
                                         
                                         One thing that I haven't got a very concrete answer is that what does it happen to this part
                                         
                                         of code generated by the Ecopilot and now we need to change it. We need to, let's say, support a new feature or we found a bug
                                         
                                         and it's fixing. So again, here, which is another problem in the software engineering, that 80% of
                                         
    
                                         the system's life cycle support is not the initial development. Who's going to do it? Are you going
                                         
                                         to give up another prompt to compile or are you going to yourself?
                                         
                                         How this will look and how this can be integrated in the master brands later on?
                                         
                                         Yeah, I think it's relatively early for that type of analysis to have surfaced, but some early results that I've seen
                                         
                                         seem to suggest that even though you may get more output, to put it that way, so more code
                                         
                                         generated per developer, that doesn't necessarily mean that the quality of this code is up to the
                                         
                                         usual standards of what that particular developer would generate. So you may have a very good point
                                         
                                         as to the overall utility of that Gen.AI generated code.
                                         
    
                                         But still, I find it fascinating for cases of communication, migrating code.
                                         
                                         I mean, I have a piece of code in Java and I want to migrate it to Python.
                                         
                                         It can be a good starting point, I'd say.
                                         
                                         Another point that I've seen people make is that over-reliance on these tools may end up hindering development,
                                         
                                         especially of junior software engineers.
                                         
                                         Because if you blindly follow what something like Copilot gives you,
                                         
                                         without actually... you also mentioned that the biggest part of a developer's time is spent in understanding
                                         
                                         what needs to be done or
                                         
    
                                         understanding what a particular piece of code does. So if you outsource that part
                                         
                                         of the process then well yes you may generate some code but you miss
                                         
                                         something basically. You miss the insights that this gives you. You miss your
                                         
                                         own personal development as a professional.
                                         
                                         Yep.
                                         
                                         There is a learning curve for these two links and I would avoid idealizing them or thinking that they will replace everyone and software development is dying or anything.
                                         
                                         No, it's a matter of finding the right balance, finding the right use cases for those,
                                         
                                         and even for senior developers, to your other point
                                         
    
                                         about the long-term effects.
                                         
                                         I have also seen reports from very senior people who say that,
                                         
                                         well, I was very enthusiastic about these tools.
                                         
                                         I adopted them.
                                         
                                         I trusted them and used them in my work
                                         
                                         up until the point where I realized that they
                                         
                                         introduced subtle bugs. And, you know, there can be bugs in anyone's code, but at least
                                         
                                         if it's your code, you know where to look. If it's code that you haven't produced and
                                         
    
                                         you have no good oversight over, then it becomes probably harder. Yep.
                                         
                                         It's one of those things.
                                         
                                         It looks fascinating, but you should probably treat with caution.
                                         
                                         Yeah, it needs not caution in each careful design,
                                         
                                         but the profession of the software developer is going to change
                                         
                                         drastically, that's for sure. The point is how we adapt, how we reap the benefits of tools like
                                         
                                         copilot and how we can keep the trust on the the resulted code. By the way, have you ever been
                                         
                                         By the way, have you ever been called in a situation where you had to evaluate the quality of GenAI generated code?
                                         
    
                                         We kind of are doing it now.
                                         
                                         I mean, there are clients that half of the code is written using these kind of tools,
                                         
                                         half of the code is written by developers and we tend to see and tell our opinion.
                                         
                                         But I cannot say yet, I mean some very definitive conclusions.
                                         
                                         Thanks for sticking around. For more stories like this, check the link in bio and follow Link Data Orchestration.
                                         
