Orchestrate all the Things - The Quality Imperative: Why Leading Organizations Proactively Evaluate Software and AI Systems. Featuring Yiannis Kanellopoulos, code4thought CEO / Founder

Starting point is 00:00:00 Welcome to Orchestrate All The Things. I'm George Anadiotis and we'll be connecting the dots together. Stories about technology, data, AI and media and how they flow into each other, shaping our bikes. As enterprise systems shift from deterministic software to probabilistic AI, forward-thinking organizations leverage proactive quality assessment to maximize value, minimize risk and ensure regulatory compliance. In today's rapidly evolving technological landscape, ensuring the quality of both traditional software and AI systems has become more critical than ever. Organizations are increasingly relying on complex digital systems to drive innovation and maintain competitive advantage. Yet, many struggle to effectively evaluate

Starting point is 00:00:44 these systems before, during, and after deployment. Yanis Kanerdopoulos is on the forefront of software and AI system quality assessment. He's the founder and CEO of Code for Thought, a startup specializing in assessing large-scale software systems and AI applications. We connected to explore the challenges and opportunities

Starting point is 00:01:04 of quality assessment and share insights for anyone responsible for technology decisions within the organizations. I hope you will enjoy this. If you like my work on orchestrating all the things, you can subscribe to my podcast, available on all major platforms. My self-published newsletter, also syndicated on Substack, Hackernan, Nidium and D-Zone or follow the story told of things on your social media of choice. I'm Yianis Kannelopoulos, I am from Greece and I'm the founder and CEO of Code for Thought. We are a technology startup which specializes in assessing and testing large scale software systems.

Starting point is 00:01:51 We have, let's say, two lines of work. One is testing the traditional software systems, using also a platform from a DATS partner, and essentially the quality of the code, the architecture, the security of the code. And we have another line of business, which is related to testing and auditing of AI systems. We started with a traditional one, like machine learning, deep learning systems, and so on. And now we are expanding to GenAI.

Starting point is 00:02:19 The difference between those two lines of work is that in the first line, we deal with systems that have millions of lines of code behind them and deterministic behavior. And in the other line, we deal with systems with way less lines of code, but they tend to make probabilistic decisions. So in most of the cases, the result is non-deterministic and it depends very much on the data and also the criticality of these kind of systems is way more important compared to the traditional ones. An example I'd like to give always is from the banking sector where we are able to assess

Starting point is 00:02:57 the quality of a core banking system or a mobile application but also we are able to assess the credit scoring risks of the organization. And in the first case, if I have a system like Core Banking one, which the code is not good or the system fails for a day, of course it will create problem to the clients, it will create disruption, but within a day or so, everything will be back to normal.

Starting point is 00:03:22 Whereas if my credit risk algorithms exhibit biases against people of color or women or younger people or older people, then this means that there is a systemic risk or the probability of a risk that my algorithms are not working as they should and giving everyone a fair opportunity for lending. That's the example I'd like to give.

Starting point is 00:03:47 Great. Well, thanks for the intro. It was quite enlightening as to what you do, and it touched on both areas. I think it's kind of a natural progression to move from software to machine learning systems because of the reason you also mentioned in your introduction. So additional software is deterministic, and also it has been around for longer. It's understood more in a better way, and it's something that the industry has more experience with.

Starting point is 00:04:20 So the first question I would like to ask you when it comes to evaluating software is what exactly are the use cases? So when do people come to you and ask you to evaluate the quality of their software? Okay, the majority of our projects started when clients were facing issues with, I wouldn't say with their systems per se, but with their investments on those systems. So usually if you spend significant amount of money, and by significant I mean some millions of euros in a project, then the system is not delivering the value it was expected to or the project is delaying or never going to production, then these are the first cases that clients were calling us and saying,

Starting point is 00:05:07 can you help? Can you tell us what's going on here? What do we need to fix? How much time will it take? How much effort? After this first wave of projects, management and engineers also started to understand that it is better to cater for quality as early as possible when you start a project and not after the system is in production or anything like that because there you can prevent things from happening instead of reacting to things

Starting point is 00:05:40 when happening. This is our experience all this year. It's like a journey. We started with some cases that clients facing problems but nowadays people are calling us beforehand like this is what we want to do. Can you help us already from the design phase? Can you review our architecture? Can you review our plans for this system? And then can you make sure to put the right measures in place so we know that the code is going to be of sufficient quality and it's going to have security holes or anything that may be exploited in the future. So I would say that the industry is maturing on that perspective. As you said we've been here for many years, the software industry exists for more than 30 to 40 years, and this means that a lot of money went down the drain in this time.

Starting point is 00:06:34 Yeah, that's encouraging actually, because when you started by saying that in most cases people call us when things go wrong, basically, it didn't sound that encouraging, but maybe this is just one of the use cases and there's a little bit of hope if the realization is settling in that well, prevention is better than firefighting basically. Yeah, and also because you've been a software engineer yourself, you know that the earlier you identify a problem in your code, the cheaper it is to, and the easier it is to fix it. If you see, if you find a bug in the design phase, then you can fix it much faster,

Starting point is 00:07:12 where compared to identify something when your system is already deployed in production. Exactly. So ideally it should start even before the project starts itself, even before the architectural design. You should start with the overall goal of having a quality product. But then, of course, this is a very good segue for me to ask you the next question, which is, okay, fine. So let's assume that

Starting point is 00:07:36 in theory we agree that, well, software quality is a good thing to have, but how does that translate into practice? So are there specific metrics that are used to assess that? Are there specific stats that organizations need to comply with? What do you use in your own evaluation, in your own work? If you look in the literature or if you see at industry, there are several ethics or standards that one can follow in order to ensure the quality of the software. There are guidelines from me, from IEEE, there is a CMMI matured model, there are several standards one can use. We are using the ISO 25010, which is the ISO standard for

Starting point is 00:08:30 software product quality, which finds certain characteristics that a software system needs to exhibit, like maintainability, security, portability, performance efficiency, and so on. And then for each characteristic, there are specific sub-characteristics. So for the maintainability, which is mostly related to the source code, it says that the source code should be easily analysable, so it's analysability, easily understand where you need to make a change, which is the changeability, and then how is used to test those changes, which is the stability. And below also those properties, we

Starting point is 00:09:13 map them with metrics that they are directly deducted from the source code analysis, like the complexity of the code, the duplication, the fan-in, fan-out metrics, which are related with the coupling of a certain DT module. By using those metrics and certain thresholds for those, then we map them to some characteristics, which are mapped to the characteristic. And we have an axon on DT that helps us, first of all,

Starting point is 00:09:40 measuring the source code in a repeatable way. Second, be able to communicate results to engineers, which are the results for the metrics, but also the mapping of the metrics to high level concepts help us talk to management because they don't need the details, but they want to gain a good idea of the big picture. So by using this standard, which is industry accepted,

Starting point is 00:10:04 and by using this kind of tax is industry accepted, and by using this kind of taxonomy, we are able to work with both the engineering team around the system and the management. Okay, so in a way it sounds like a typical KPI with metrics approach. You have the various metrics that are directly related to the code itself, and then you have aggregations of those that are mapped to your KPIs that you can use to communicate with management or people who are not necessarily technical. Okay. Yeah, this is true. This is the case. All right. And I presume, actually, I don't just presume, I have also seen that you are using a specific software to do that, right?

Starting point is 00:10:46 Yes. Which has been developed by a third party. Would you like to share a few words about the software and the process that you use? How exactly does it work? Do you connect that to other repository where the implementation lines are? Oh, I like your questions. The tool that we use is called C-Grid. It is provided by our Dutch partner software improvement

Starting point is 00:11:09 group. They used to work for SIG. I used to live in Amsterdam working for them when I finished my PhD in the area of software quality. And I went back to Greece to develop a similar business with their support. When I joined SIG, I was also part of the team that designed this model, also doing these kind of projects. So for me, it's like the natural evolution of my work there. And what I like also is that this is a direct result of R&D and of work that was done together with academic institutions.

Starting point is 00:11:50 So SIG themselves are spinning off of the University of Amsterdam. Myself, I was finishing my PhD at the University of Manchester when I met this company, these guys in the conference. We met at an academic conference actually. I tend to keep that also at code for thought this connection with academia and research. Okay so it's not just because you know on the surface it seemed like a kind of typical case okay so there is the software vendors that developing this software and then there are companies like yours in this example that are somehow licensing and using it in their projects.

Starting point is 00:12:26 But it seems there's a deeper connection and I wonder if you even contribute to the software in some way even today? Yeah, I would say that we're contributing mainly via our clients. And I'm proud to say that together with our team and Siegfried, we have one of the higher rates of client engagement and adoption of the Toolink. So the feedback from our team and from our clients is taking seriously from the team in Amsterdam to enrich the Toolink.

Starting point is 00:12:58 And also to your previous question, yeah, the Toolink can be connected to a CI's pipeline. Whenever there is a commit, the code is be connected to a CI-CD pipeline whenever there is a commit. The code is analyzed on the fly. It's a SaaS solution, but no code is analyzed on our premises. It never leaves the premises of the clients. So the whole analysis on the fly, and we get only the results. Okay.

Starting point is 00:13:21 All right. So that's even more interesting than I thought actually, because in my mind, I was somehow imagining like, okay, if people call you to evaluate their software, then probably it's like a one-off. So you go there, you do your evaluation, you give them the results, and then you're out basically. But the way you describe it, I would even imagine that this could also work on a granular level basically. So if there is a commit like a new feature or even more granular than this, then somehow violates the quality directive, then you get like maybe a warning or something.

Starting point is 00:14:00 Yeah, I mean, that's a good observation. In Greece from the beginning, I mean, we had a couple of cases where there were problems and clients were calling us, but they were not interested for a one-off project. They were not interested about the assessment. Whereas in other countries, the assessment is way more popular as a service. Here in Greece, everyone was like, if you you come we want you to help us, not just by assessing and then leaving the company. We want you to kind of take owner support, take the accountability for your recommendations that will be provided to our teams to fix the things, but you need to monitor those fixes and you need to make sure that these

Starting point is 00:14:41 are going to be implemented. So that helped build a stronger relationship, a longer relationship with the client, but also showing more impact. And also the tool is designed to facilitate this kind of needs, so it was a good match. Interesting. And it also makes me wonder, because obviously when you have a variety of clients, because obviously when you have a variety of clients, you even get into the territory of what I would call maybe personal preferences or style. So, because these things, well, obviously there must be some kind of objective metrics, but as you said, I've also worked as a software engineer myself, and so I can attest to the fact that, well, some things, even in software engineering, can be subjective.

Starting point is 00:15:31 So there's not a single way to implement things. And even though the implementations may differ, the end result or the end quality, let's say, can be comparable. Is there the option to somehow tailor the results that you get out of your evaluation to such individual preferences? I would say that the results cannot be changed. There is a whole model behind it. So the rating regarding the code or anything that cannot be altered.

Starting point is 00:16:05 But what changes and makes things more personal is the context of your client, the priorities they have, the challenges they face. So this also helps you to make sure that the message will come across. And that is also one of the responsibilities of a good consultant, if I may say. They need to understand their clients' context, their clients' challenges and problems. Because if they don't, I mean, just by saying a rating that is like three stars or two stars or four stars, doesn't mean anything by itself. So I would say that the results do not change. What changes So I would say that the results do not change. What changes and gets personalized is the message.

Starting point is 00:16:49 All right. Let's switch gears a little bit now, having covered, I think, a good amount of ground in terms of software quality and how do you go about it. So in that scenario, or in that use case, I should say, things were kind of more solid in a way, precisely because of the reasons we both mentioned. Long tradition, existing software, pretty much well understood metrics and KPIs,

Starting point is 00:17:15 so you had a good starting point. And that explains also the backstory that you shared on the fact that you are using off-the-shelf software, so to say. In the case of evaluating machine learning and AI systems, however, things are a little bit different. I know that in that part of your work, you are actually using your own solution that you have developed from scratch, I presume? Yes. And I guess that reflects the fact that it's a relatively new area. I mean, both in terms of adoption of machine learning per se, but also in terms of evaluating those systems.

Starting point is 00:17:59 So how did you get started there? And again, to make the equivalence with the other area, use cases, do you see? So when do people come to you to evaluate their machine learning systems? OK. So yes, we have developed our own platform, which is called IQ4AI. Currently supports the workforce structure datasets and mainly for binary and

Starting point is 00:18:27 multi-class classification problems. But we are expanding it now. We're going towards the GenAI area, rug-based usually. We have cases that clients, they need to abide to a specific legislation like the New York City Bias Law, which is the city of New York dictating any HR software to undergo an independent bias audit and publish the results on their website. So we have these cases. We're doing also a due diligence project. So a client wants to acquire an AI startup and they want us to evaluate their AI system.

Starting point is 00:19:06 And also there are cases that clients that are proactive enough, they were kept hearing the last years about the upcoming legislations like the UAE Act. So they wanted to test the waters. So we're doing some AI assessments on selected systems they were having to understand what it means in an AI audit or an AI test and for them to be more prepared for the future. Okay, I find the last part especially interesting because even before you got to the part when you actually called what you do an audit, it already sounded to me like you are maybe doing auditing work in fact. So and also because you mentioned the EU AI Act legislation, it's in the process of being enacted

Starting point is 00:19:54 actually. Yeah. And I know that you are reasonably familiar with it. So I wonder if you see yourselves as doing audit work and if you actually know whether you need to somehow officially register to act as an auditor or how does that work? Yeah, I will be open here. I don't like the word auditor, to be honest, because usually it comes with something which is very dry formula based kind of thing and doesn't leave room for creativity or for going out of the perimeter and do some more work because you think it will help your client. I like the word assessment more because assessment is more general and gives you more grounds to do things.

Starting point is 00:20:50 But the word I think auditing fits to what we do when we check a system for certain legislations. There are several initiatives to organise AI auditing and we are following on these developments. I would say that if there is a need for us to register, we will do so, but for now, I think there is no such a need. And if you ask me also, we would like to take the road of consultancy and advisory

Starting point is 00:21:19 because we feel that we can help more. Also the other thing that we see is that if clients perceive you as a compliance exercise, so they don't see the value for them, they see just the cost of doing business. So tell us what is the fastest way to check the box and then move further. I can understand that as well as I can understand why you don't like the term itself. But you know, if it's something that people are obliged by law to do, it's a de facto line of business. Yeah, it is. I agree with you.

Starting point is 00:21:57 The legislations like the EU AI Act right now, they are the compelling reason for somebody to start looking at the AI systems, which is, I mean, if you do it seriously, then you can get lots of benefit from this particular legislation. I agree. I totally agree. So let's go back the core of the topic in a way, because to my mind, it's doing this type of assessment on a machine learning system, it's much harder than doing it on a traditional piece of software, for a number of reasons. First, the thing that you already mentioned, the non-deterministic aspect of machine learning and machine learning algorithms. And then also the fact that seen purely from the engineering point of view,

Starting point is 00:22:46 these systems have a different level of complexity basically. There are many more moving parts. So in traditional software, you can have systems of high complexity with many modules. You can have very elaborate change of deployment and all of those things. But at the end of the day, it can be a big and elaborate puzzle, but you can see how all of the pieces fit together. In the case of machine learning-based systems, you have that plus also a number of components that don't exist in traditional software like data and different data sets and even different versions of all these data sets. You have machine learning models and again,

Starting point is 00:23:30 different versions of these machine learning models. So it's a much more elaborate system to handle. So which part of this whole ecosystem are you able to monitor in the current state of affairs and which ones do you think should be actually monitored for a holistic assessment? Okay, when we work with a client, we ask for the testing data they have themselves to test their models and we try also to map the version of the model with a version of the data set and then the audit that we do because everything gets timestamped. So that's for in an audit we see

Starting point is 00:24:14 a specific version of the model, a specific version of the testing data set. So we try to make it as contained as possible. And then based on that, we can also monitor the behavior of the model, all by checking the testing data set. And also we have the ability to monitor certain KPIs in production for cases, if we want to identify drift on the data or in the model and so on. But I think that the challenge here is that the AI system changes, evolves continuously.

Starting point is 00:24:59 It's not like a typical software system that you do the enhancements, but the behavior system is not ultimately changing. In machine learning and also in Gen.AI, these things tend to change more drastically and more rapidly and more often. So the challenges on monitoring then this behavior, just by doing one audit per year, it gives you a sense of control or a sense of assurance, but it's not enough.

Starting point is 00:25:28 So you need to make sure that you monitor the behavior of the system. We also, we have the tendency to base our analysis on facts or specific artifacts, if you like. So that's why I'd like to say that, okay, in traditional software, we like to base the analysis in the source code. In traditional AI we would like to base the analysis on a testing data set let's say. On the Gen.AI we would like to base the analysis on the embeddings on again to find the proper artifacts where we can do an analysis that will be as fact based as

Starting point is 00:26:08 where we can do an analysis that will be as fact-based as possible and not an assessment made based on opinion or expertise or things that are not related directly to the artifacts. Okay, okay. I think it makes sense to make this distinction between, well, systems and AI-based systems. Let's start with the traditional machine learning. I think you mentioned previously you cover mostly classification algorithms. Yes. Okay. Even for those, you mentioned Drift, for example, so model Drift. So even for those, you mentioned drift, for example, so model drift. When in production, the model behavior may remain constant, and actually it will remain if it hasn't been updated, but what may change is the actual distribution,

Starting point is 00:26:57 the actual input of the incoming data. So if you have a model that stays the same, while the data that's coming in changes in some way, you get this mismatch. So is that the kind of thing that you monitor? Yeah, we tend to monitor the behavior of the model by checking a series of statistics and metrics related to the performance of the model. That's how we try to identify the so-called drift. So we are looking on the metrics from the model and also metrics from the data itself like the distribution and so on to make sure that we can identify types of drift alongside the monitoring of the model.

Starting point is 00:27:46 OK, and you do that on an ongoing basis, I presume, because if you only do it as a one-off, you can verify the performance of the system at that point in time, but that doesn't necessarily tell you anything about what happens in a month or two months. We are now working with our clients on monitoring. It is interesting to see that most of them right now, they are happy with just an auditor or two, and not the monitoring. But this is also a requirement from the EU AI Act. So I think that as time passes, they will be

Starting point is 00:28:27 also more prone to having tools like ours to monitor the behaviour of their model on a constant basis and not just by doing a series of audits. The point is that in general, the legislation like the EU AI Act started putting AI audits or AI assessments on the table. Before that, I would say it was more exotic. And also there are preconceptions, right? A typical one is that, okay, if my model has high accuracy, why do I need to test it further? Why do I care about biases? Why do I care about explainability?

Starting point is 00:29:08 And then you have to elaborate on that and convince people who are, let's say, budget owners, problem owners, but they don't understand technical things. Right, so again, one of the interesting side effects of legislation and regulation. And what about the use cases that you are asked to monitor GenAI based systems? I have the feeling that these must be more complicated for a number of reasons.

Starting point is 00:29:38 I would imagine that many of those systems are actually based on calling external GenAI models through APIs. So they're kind of wrappers around API calls. Is that what you're seeing? And if yes, how do you actually evaluate the system? The clients we deal with are using the rags on top of an LLM, usually open AIs. And then the rag does something and there is a user interface that presents the information to the user. That's the most of the cases, the rug-based systems, not just an API around the foundational model.

Starting point is 00:30:21 That's one of the things that we would like to understand before we start the project, before we even submit a proposal, because if there is an API just calling such a GPT, there are not many things we can actually do, right? I mean, it's just using a foundational model. And if there is something going on with that result, then of course we can look at it. But if it just asks and provides a user interface, then I don't think it's something very sophisticated for us. Of course, we can, depending on the context, because if you are a startup claiming you have a super tool, and that super tool is just a wrapper on top of a foundation model, and you buy to sell your company for some millions. Of course, by doing it with diligence and we realized that this is the kind of system that won't fly. In most of the cases we are being called for they are rack based systems.

Starting point is 00:31:18 And we try to start with the basics, with the fundamentals. We start from the embeddings, we start from the NLP part of the equation, and then we move layer by layer. OK, but well, even if you start with embeddings, I know that lots of people actually use third party models like GCPT or whatever, Anthropics, whatever, they actually use them for embeddings as well. So in a scenario like this, how do you evaluate the embeddings that are coming off of this black box basically? We try to utilize standards and guidelines from papers, from academia, from the industry. Also, there are

Starting point is 00:32:08 some thresholds we would like to use, which are kind of proprietary, I may say. If we say that this is an NLP problem, then we utilize the thresholds and the metrics related mostly with that field. Okay. Okay. I was just curious because I find it like really intellectually challenging problem. That's why we have developed a methodology and we published it on our website. You can find a detailed description of the methodology.

Starting point is 00:32:42 So we're not afraid to say some things about us and what we do. And we do that because it's a really complex problem. Testing Gene AI is complex, is a current problem, is a huge problem. So it's not like we try to hide what we do. We like to actually to exchange views with the community and see what can we do better and how we can actually help. and see what can we do better and how we can actually help because the adoption of Gen.AI is pretty fast. It made people realize that AI is here. Before that people didn't even realize, although yeah that is happening for some years now. It cannot be under control the whole situation with GNI LLMs. So we need more help and we want more exchange with the community to identify more solutions because the problem is getting bigger and bigger. All right, well. And one claim we have the silver bullet for that, right?

Starting point is 00:33:41 Well, if you did, maybe somebody else would be evaluating your own solution. Yeah, yeah, I'd like to comment on something you just said, which made me smile. You said that something along the lines of, well, it was this whole chat GPT and AI thing that made people realize that AI is here. And I find that to be very much true, not just in my experience, but if you look at things like option rates or mind shares or how much people are talking about AI, it's very clear that it has been a watershed moment. And for people who have been around before that,

Starting point is 00:34:26 it's a little bit funny because now everyone is talking about it. But not that many people actually know what they are talking about. To share my own experience here, in the last couple of years, I was asked to do something that nobody ever asked me to do before. So I was asked to deliver

Starting point is 00:34:46 training seminars to organizations on AI. And it was clear in pretty much all of the cases that their motivation was precisely this buzz around generative AI. And in their minds, I would say maybe 99% of people would equate machine learning and AI to GenAI and ChiaGPT, basically. To put it very simplistically. So because of the fact that you also work with a wide range of clients and partners, I'm wondering what you are seeing, so what your experience has been. Do you think that the level of, well, the awareness is there now, obviously, but the level of understanding and education is adequate?

Starting point is 00:35:31 No, it's not. We recently had a meeting with a client and it was obvious to me that all the people when we're talking about AI, they man-gen AI, and everything else is simple software. I don't think that people are trained adequately enough for this kind of technologies. For me it's a very simple reason. All the people who are like my age and above, we've never seen while we were studying or our first year at work, be alive AI applications. It was a very constrained thing at that time. I would say that machine learning applications start becoming more dominant after 2010,

Starting point is 00:36:21 more or less, that's the year of that milestone you're having mine. So that might explain a bit why we see this lack of literacy in people, even engineers, software engineers on the other hand. I think it's never too late to start training your people in the proper way. And if you ask me, I think that the Gen.AI is not the solution for everything, right? It's not a silver bullet. And I think that most of the problems that organizations face, business problems, thinking they can be solved with more traditional AI. The whole thing is if you have the proper people on board, if you have a proper data

Starting point is 00:37:02 strategy, there are other things that may affect the way you're going to solve a problem. But just saying, I'm going to use GEN.AI for the sake of GEN.AI, they definitely is not going to fly. You are going to spend lots of money, but without getting your return on investment. Yeah. Yeah. I mean, I can see why there is such an overshoot on how people use Gen.ai.

Starting point is 00:37:29 Well, first, it's very easy to use. It's very user-friendly. You have this textual interface and anybody can just use it without much effort, without having to set pretty much anything up. There's that, and there's also the fact that it's very much hyped. There's a lot of money being spent on developing these models and so there's also a lot of budget for marketing

Starting point is 00:37:57 by the people who develop those models. So you get lots of noise, let's say, about these models and what they can do. But I totally agree with you that some of the things that these models can do can be very useful, but they're definitely not the answer to everything. No, no. We shall not make that mistake, I think. So I wanted to ask you a bit more exploratory question. In my mind, it sort of ties all of these things that we talked about together. Some of the things that I've been reading about lately have to do precisely with the interplay of these two domains that you also cover. So code and traditional software engineering on the one hand, and AI, and specifically Gen.AI

Starting point is 00:38:48 on the other hand, and how these two interact. On the one hand, we have what has become like a premium application domain for this Gen.AI model, so code generation, basically. There's different ways that people use GEN.AI for code generation. Some people, developers, use it to generate code on their behalf. There's also the so-called vibe coding going on. Solutions that are addressed to non-developers.

Starting point is 00:39:19 The idea there is something like being able to explain what it is that you want to do, explain what kind of solution you want to build, and then have some Gen.AI model build it for you. So that's one way that they mix with each other. And then there is also the fact that a big part of the training of these Gen.AI models is actually code itself. And through techniques like reinforcement learning, some people have been experimenting

Starting point is 00:39:48 with fine-tuning a gen AI model with code and noticing the overall change in behavior that this brings. So I've seen research, for example, in which people have shown that, well, if you try and tune a GenAI model with specific examples of code and make it more capable, let's say, in this area, this somehow reflects in the model's capacity in other areas as well. So that's a very... I find it fascinating how this interaction works.

Starting point is 00:40:24 I just wanted to bring it to your attention and just get the comment from you basically, because it's two of the areas that you specialize in. Okay. I think I can say more about the code generation. I will start from that, right? Gen.ai is a great tool to help developers document code, understand code. When I was doing my PhD, which was in the area of program comprehension, there were statistics showing that 90% of the time of a developer is spent on reading and understanding

Starting point is 00:40:56 code. So for me, GenAI might be the perfect tool for helping somebody understand the code, what is the code that they read and giving it the proper context and everything. That's one. The second is I also read not so many research papers, but mostly surveys saying that using this tool, a compiler, let's say, you can be 10 times faster, 15 times faster. The point is that writing the code is just a fraction of the time of a developer. The rest is communication with other people, is analysis, design, testing. And just by optimizing the time you write code doesn't mean that you optimize the whole

Starting point is 00:41:44 time you write code doesn't mean that you optimize the whole software development life cycle, which means that you can gain benefits when it comes to productivity only to a specific part of your work. That's one. I wouldn't see Copilot as a productivity tool per se. I would see it as a very valuable companion for the other tasks, for understanding the code, for getting ideas on how you can improve your code or how you can become better. Still though, the code generated by a tool is not to be trusted. Not only about the quality of it, but also about security and other things. One thing that I haven't got a very concrete answer is that what does it happen to this part of code generated by the Ecopilot and now we need to change it. We need to, let's say, support a new feature or we found a bug and it's fixing. So again, here, which is another problem in the software engineering, that 80% of

Starting point is 00:42:55 the system's life cycle support is not the initial development. Who's going to do it? Are you going to give up another prompt to compile or are you going to yourself? How this will look and how this can be integrated in the master brands later on? Yeah, I think it's relatively early for that type of analysis to have surfaced, but some early results that I've seen seem to suggest that even though you may get more output, to put it that way, so more code generated per developer, that doesn't necessarily mean that the quality of this code is up to the usual standards of what that particular developer would generate. So you may have a very good point as to the overall utility of that Gen.AI generated code.

Starting point is 00:43:47 But still, I find it fascinating for cases of communication, migrating code. I mean, I have a piece of code in Java and I want to migrate it to Python. It can be a good starting point, I'd say. Another point that I've seen people make is that over-reliance on these tools may end up hindering development, especially of junior software engineers. Because if you blindly follow what something like Copilot gives you, without actually... you also mentioned that the biggest part of a developer's time is spent in understanding what needs to be done or

Starting point is 00:44:25 understanding what a particular piece of code does. So if you outsource that part of the process then well yes you may generate some code but you miss something basically. You miss the insights that this gives you. You miss your own personal development as a professional. Yep. There is a learning curve for these two links and I would avoid idealizing them or thinking that they will replace everyone and software development is dying or anything. No, it's a matter of finding the right balance, finding the right use cases for those, and even for senior developers, to your other point

Starting point is 00:45:10 about the long-term effects. I have also seen reports from very senior people who say that, well, I was very enthusiastic about these tools. I adopted them. I trusted them and used them in my work up until the point where I realized that they introduced subtle bugs. And, you know, there can be bugs in anyone's code, but at least if it's your code, you know where to look. If it's code that you haven't produced and

Starting point is 00:45:37 you have no good oversight over, then it becomes probably harder. Yep. It's one of those things. It looks fascinating, but you should probably treat with caution. Yeah, it needs not caution in each careful design, but the profession of the software developer is going to change drastically, that's for sure. The point is how we adapt, how we reap the benefits of tools like copilot and how we can keep the trust on the the resulted code. By the way, have you ever been By the way, have you ever been called in a situation where you had to evaluate the quality of GenAI generated code?

Starting point is 00:46:32 We kind of are doing it now. I mean, there are clients that half of the code is written using these kind of tools, half of the code is written by developers and we tend to see and tell our opinion. But I cannot say yet, I mean some very definitive conclusions. Thanks for sticking around. For more stories like this, check the link in bio and follow Link Data Orchestration.

Your Ad Here

Orchestrate all the Things - The Quality Imperative: Why Leading Organizations Proactively Evaluate Software and AI Systems. Featuring Yiannis Kanellopoulos, code4thought CEO / Founder

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.