Debbie Reynolds “The Data Diva” talks to Zacharias Voulgaris, Data Science Consultant, Author, and Mentor in Data Science and Data Analytics. We discuss his podcast “Analytics and Privacy” and his book “The Data Path Less Traveled” (Amazon), his fascination with math and programming from an early age and steady progress towards a PhD, the need for recognition of the effect of one’s work in privacy on individuals, 7-11 Australian biometric privacy case and the dangers of innovation, Big Data and the dangers it brings, increased data quality resulting from improved privacy, his current concerns in privacy and data, UK AI accountability proposals, where do data science and Data Privacy intersect and interact, privacy policies and flexibility, clarity minimizes risk and his hope for Data Privacy in the future.
Support the show
data, people, privacy, understand, analytics, model, collect, ai, variables, data analytics, field, organization, insights, analysts, problem, companies, world, happening, science, important
Debbie Reynolds, Zacharias Voulgaris
Debbie Reynolds 00:00
Personal views and opinions expressed by our podcast guests are their own and are not legal advice or official statements by their organizations Hello, my name is Debbie Reynolds. They call me "The Data Diva". This is "The Data Diva" Talks Privacy podcast, where we discuss Data Privacy issues with industry leaders around the world with information that businesses need to know now. I have a special guest on the show from Italy, Zacharias Voulgaris. He is a data science consultant and author, a mentor in data science and data analytics, and I'm happy to have him on the show. Hello Zacharias.
Zacharias Voulgaris 00:45
Hello, Debbie, and thanks for having me.
Debbie Reynolds 00:48
This is great. So actually, you and I have not been connected on LinkedIn for a very long time. I really love your content and the things that you put out; I'm actually going to be on your podcast coming up pretty soon. And I thought it would be great to have you on my podcast. So I'm happy that we're able to do this. You have such a breadth of knowledge in technology and data. And that's why I love to see your content and the things that you're working on. Your podcast is called "Analytics and Privacy". I highly recommend people check it out. It's really cool. Also, you have a book that's out right now. It's called "The Data Path Less Traveled" by Technics Publications, and also, that book is available on Amazon, correct?
Zacharias Voulgaris 01:39
Debbie Reynolds 01:41
Excellent. Well, you have a fascinating background. And I would love to have you tell the listeners how you got into the data space, what things interest you about analytics, obviously, your Ph.D. in Computer Science, and you just had a very interesting trajectory on the data path, but just tell us what got you interested in data? Data Analytics? And how does that impact you in terms of privacy?
Zacharias Voulgaris 02:17
Debbie Reynolds 05:28
That's fascinating. I think I feel like there is a disconnect around data and human stuff. So I feel as though there's a school of thought that was like, Okay, we program things, we use data to make decisions, and then we're sort of divorced from the responsibility of what impact, especially a negative impact, that this work may have on individuals? What are your thoughts about that?
Zacharias Voulgaris 06:03
That's a very good point. And I'm glad you mentioned it because many people think that, Okay, all those data people they understand everything else too. They are like everyone else, but they just know data. Unfortunately, many people in this field, I'm ashamed to say, are not really that good at understanding the impact of their work, especially the negative impact. And they don't really understand that there are certain protocols that should be in place to respect the people behind the data and the processes and the ethics of it all. And it is this disparity for sure between the data work and the real world. That's why I believe there's an array of different roles that have been introduced in the past few years, such as a data storyteller, who tries to bridge the gap between the two and communicate what the data scientist finds and what's there in the business world that has something to do with these findings and how these insights can be applied. In other areas, like data analytics, like the simpler analytics, this gap is not so big. So the facilitator is not necessary. But whenever there is advanced modeling, like AI-based systems and things like that, it's often the case that you need an extra role for that. Other times, there may be a strategist involved, who may be able to prioritize the different things that a data scientist or a team of analysts can do and put everything in a way that makes sense. From a business perspective, aligning the data work with a data strategy is like the link between the data world and the business world. Because at the end of the day, every company or every organization has some kind of strategy in place if they are to do something useful with what they have, and that strategy has to really incorporate data strategy and data work.
Debbie Reynolds 08:07
Yeah, I agree with it. Also, I think one thing that we're seeing, I saw recently, well, I think it's happening again, in a different place. But there was a case in Australia where a company 7-11 had this tablet that they had people signing up for coupons or different loyalty programs, but this tablet, as people use it, will capture their biometric information, and they're using it for other purposes. So I feel like another gap I see is between people who just want to take all these innovations, implement these new cool technologies, but then they don't understand what their responsibility is, and the use and not really clearly defining for individuals what these tools are doing with your data.
Zacharias Voulgaris 09:14
That's a very good point. Yeah, it's very easy to get carried away with technology. And this has happened before, but now it is happening on a larger scale. And that's the problem with it. Because if it's a few isolated cases around the world, you can handle them. Somehow you can find them first of all easily and deal with them individually. But if this happens at scale, it's really hard to get a grip on the problem and deal with it in a meaningful way. So yeah, I mean, just because you can collect data doesn't mean that you should collect it in the easiest way that may seem obvious to you, as someone who's looking to get the data but does understand that there are also implications about the privacy aspects of it. So the data acquisition part may seem trivial for us because we will focus more on the analytics side; okay, what do with the data, what insights can we get? What kind of models we can build for data acquisition is very important. And you know, you can't really do anything in analytics if you don't have the data first. So oftentimes, we don't really pay as much attention as is required to this part of the pipeline. And other times, it is done by other people altogether, like many data engineers, for example, are responsible for acquisition too. And maybe even developers are involved in the whole process. So we don't really think about it that much. But whoever is involved in this part of the pipeline, I believe they have a responsibility towards the people behind the data and the kind of ethical respect that they need to apply when dealing with data. Because even after you get the data, you have to make sure that it's governed properly, it doesn't leak, and it doesn't get access from people who shouldn't have access. So then, it becomes a data governance problem. But it's still someone's responsibility. You can't really say, oh, okay, well, the technology is to blame because technology cannot be taken to court. But somebody in that company will be taken, for sure, if that escalates into a scandal, and the company is exposed, because especially nowadays, people are very sensitive about their data. And with many leaks having taken place and still taking place around the world, even in larger organizations, people are aware of that, and they don't tolerate it, nor should they tolerate it, I believe.
Debbie Reynolds 11:43
Yeah, I agree with that. I want your thoughts about what I call the Big Data age. So I remember when people started talking about big data, and they were trying to implement systems that gather all this data, and they were going to take all this data, and they were going to put in some system, it was going to spit out these insights. And a lot of people at that point were saying we just need as much data as possible. We can't get these insights without tons of data. So that created an age in my view of this indiscriminate data collection and indiscriminate data processing. So the idea is, let's just get as much data as we can, and then we'll do whatever we want with it. Where we at, we're seeing regulations now that are asking companies to be a little bit more circumspect in terms of how what they collect, why they collect it, what they intend to do, or what insights they tend to gain from the data that they collect. What are your thoughts?
Zacharias Voulgaris 12:59
I think this was inevitable. And it is something essential because when people realize that, you know, all models, regardless of where they're coming from, tend to perform better when they have more data, they started collecting more data. Also, even people who didn't have analytics in mind. So data collection was the first step to doing a data-driven kind of operation. So they said, Okay, well, let's start collecting the data now. And maybe in the future, in the next five years, we'll do something about it. But it's good to have it because when we start using it, we'll need to have historical data. So why not start collecting it? And because there were so many good success stories with big data, especially with larger companies that had literally terabytes in and beyond of data at a time when a terabyte was inconceivable, so imaginary, he was working with small datasets in Excel, and some people had like terabytes. And now that people have a million times more, and they can do amazing things with the data. But more data doesn't necessarily mean better results, at least for you. And maybe, maybe for a large organization that has a clearer idea of what to do with the state, has the right people, and also the right tools to analyze this data properly. Maybe they can do wonders with it, and I have no doubt like the more data you have, the better your models are going to be. Yes. And if you have good people handling those models, building those models, because what's maintaining those models, they're going to gain a lot of value out of them. But for the average organization, I don't think you need big data. And it sounds like an oxymoron. But sometimes less data may be better if that data that you have is targeted towards what you want to do. And Bernard Marr wrote the book on that strategy. Thank you, guys. Actually a nice course about this topic. He also highlighted this point that you know, and it's good to have more data. Yeah, if you have big data, that's amazing. But first of all, have a good idea of what you want to do, have a good objective that you try to feel, and then look at the data. Don't start with the data and say, let's collect as much as possible because that data, apart from the fact that it's a liability, as you mentioned, there is also a cost, a hidden cost behind it, because not just collecting it, but also storing it, maintaining it, if you want this data to exist in a meaningful way, for a five year period, for example, you have to make sure you continue to collect it, all of these fields. Or, if you're getting it from an external vendor, you have to pay some fees because they're not going to give it to you for free. And of course, there's the legal liability as well, because what if legal regulations change, and you are the owner of the data? You have to really be sure that you don't abuse it and break any laws. Many people nowadays break privacy laws, even without realizing it. And that's something that I think we need to be aware of more data may be great for the models, but it's not always good for the organization. So that's why we need to have a more holistic view of the whole data escape is not just data for the models; it's also data for the organization for specific objectives that it tries to achieve ad within given timeframes.
Debbie Reynolds 16:31
Right. I like to tell companies that if they have data, especially data, they're retaining for a long period of time, if it has a low business value, it has a high privacy risk or a cyber risk. Because a lot of times companies, they don't protect that data, that older data in the same way that they protect the data that they're actually utilizing on a day-to-day basis. I have a question about data quality. So what I'm seeing and I want your thoughts on this. One benefit that I think that companies will have as a result of this more prescriptive way that jurisdictions are wanting people to go about data collection and data retention is that if they do this, right, I think that they can greatly improve their data quality because they're getting better data, right? More accurate data, data that people want to give to them or they see a benefit. And so I want to get your thoughts on, first of all, you think about brackets, and then it how you think that that would help people in data analytics.
Zacharias Voulgaris 17:55
That's a very good point, and the data quality is something people don't really think about too much. Because you say, Okay, well, if I have a million variables, at least some of them must be good, right? But there is one of the V's in big data that characterize good Big Data especially, and that's veracity like, isn't that a reliable, because you may have lots of data collected from an app, from a website, from a survey or from a bunch of other sources. But if the data doesn't really mirror reality, what good is it just taking up space on the servers, or the cloud is not really adding any value, and some analysts may be able to do something with it, there's no doubt about that. But will the result be reliable because there is an insight that can drive decisions? And there is inside of this just interesting, if the person who analyzes the data takes data seriously, right, and they really trust it, they invest in that in the creative product. And that product may be a model that predicts things based on that data. If the data is not good, if there's not good quality, that product is not going to be of any use, it will predict nonsense, it will give you gibberish, and it's not going to help drive any good decisions. So it may seem on the surface that it's adding value, but in the long run, is not going to be sustainable and is probably going to cost the organization in one way or another. For example, if that particular data set that we have developed on good quality data is supposed to help us predict the sales, it may predict completely different sales than what we experienced. And then we won't be able to prepare for the actual sales, and we may end up at the loss or may end up not being able to handle the demand. So it is very, very important for any organization to have good quality data, and that good quality is not going to come from having lots of data unnecessarily. Although that might help in some cases, it will come from being able to collect data in the right way, with the right processes. Ideally, if you could do that, you would want to, you'd want to collect the data yourself. If that's not possible because you don't want to bother with sensors or build the requisition pipelines, then you can get the data from some other source. And in that case, it will be fairly easier because the risk is not on your hands necessarily, but at the same time, it will be like a running cost. So there are pros and cons and different ways of acquiring data, but it's better to have fewer variables and have good quality than to have many of them that are not really helping anyone.
Debbie Reynolds 20:56
Excellent. What is happening in privacy right now? Or what's happening in data right now that concerns you most?
Zacharias Voulgaris 21:03
There's so many things. But if I were to pick one, I would say that there is this. This misunderstanding that data analytics, especially advanced analytics, can do wonders. And that's the biggest misconception, I think, in the field. Because potentially, it can do wonders. But it doesn't mean that it will because it greatly depends on the data. If you don't have good quality data, as we talked about just now, what good does that good model do? A good model may be able to find a good mapping of the data you have with what you're trying to predict, for example, or it may find a good exploratory model. So you can understand the data better. But all of the insights and predictions that may come about from the model might be useless. And the analyst may not be able to understand that because, for us, it's all about some metrics, some heuristics that we use to understand what's happening with the data. But there needs to be like an outside view of the whole matter to better understand what's happening. And if it really helps anyone, because what good is a data model if it cannot be applied anywhere, and you're going to have the best model in the world, and you may be able to communicate the results perfectly. But if the data is not good enough, or if it's not relevant enough to what the organization tries to accomplish, then it's useless. That's why I think it's very important, and it's probably paramount to have the strategy in place first, and then get the analytics going afterward, then get the data going. Because collecting data blindly and trying to make sense of it may or may not be of value. But if you have a clearer idea of where you're going, you're better equipped to find the right way to get there, be it through a specific set of models through a specific methodology and through her specific data center.
Debbie Reynolds 23:19
Excellent. I want your thoughts about this. So the UK has a proposal to create more accountability in AI. And one of the things that they want people to do when they're creating AI models is to be able to say in plain terms what is the artificial intelligence that you're using. What data are you using for that? And then what are your expected results of running data through this AI model? And even though that sounds very simple, is not as easy as it sounds, right? And the thing that concerns me most is that we have people using AI models, and they are receiving unexpected results, right? So they have some say, oh, I don't know what this model is going to do. Oh, I didn't know it was having this result. But then they may be taking that information and still be making decisions, possibly harmful decisions about people based on what these AI models with your datasets are saying. What are your thoughts?
Zacharias Voulgaris 24:36
That's a very big topic, and perhaps maybe we can analyze that more in another episode. But the topic of AI is a big thing. And if you're just figuring it out now, that doesn't mean that it's a new problem. I remember reading about this a couple of years ago, at least, from a friend of mine who worked for the EU then. And there is like a series of documents there about the ethics of AI and about RAI, responsible AI, and things like that. So this is not a new problem, and it's not the first time it has been attempted to solve. It's a very deep problem. And it's not really the AI. That is the problem itself. It's really the fact that the AI is not clear about how it's doing what it's doing and what it's doing when the data is not what it expects. So, there is this thing called data drift. For example, in machine learning, when you have data that is different from what you have used before to train your model. And this happens, and this allows the model to perform worse. So there's a drift in the performance. And that's one of the biggest problems in model maintenance. With AI, it's more obvious because AI models tend to go for very high performance. So even if there's a slight drop in the performance, it will be more noticeable in that case. And the fact that they can't really explain at all what's happening there under the hood of that model, even if you have made them all to yourself, you won't be able to explain what's happening there. If that makes things more complicated because it's bad enough that there is a problem of data quality falling as time goes by, especially if there are many new users using that particular model. But also, the fact that if something goes wrong, you can't debug it easily makes it a liability more than anything else. Nevertheless, that doesn't mean that all AI systems have this issue. It's just that whenever you really want to make decisions, and you want to be able to understand the why behind the decisions, they may not be able to help you there much. So there is a solution for this, fortunately, but it's still in the works. There is no production-ready model for this yet. But the methodology is being investigated. And that's got to do with Transparent AI or Explainable AI, as it's sometimes called. And that may, that is promising to solve a lot of these problems because you'll be able to pinpoint if there's a problem early on. And also, if there's a problem, we'll be able to understand where it stems from exactly.
Debbie Reynolds 27:32
I like to work with people who are in data fields. So regardless of what they're doing in those fields, I think it's really cool. So I like the way that privacy folks can work with data science people, like yourself. Where do those two disciplines intersect? And how do you see your role in helping Data Privacy professionals?
Zacharias Voulgaris 28:10
Well, for the first part of your question, I would say wherever there is some explainability involved, where we try to understand where why things are the way they are and how decisions are made based on the data. That's something that involves both fields because it's all about the variables in that case. In many cases, this is not possible because the variables are obscured. And that's one layer of privacy that is evident. And it's often required. Now, some people do that because they have to, and other people do it because they just do it. They don't really think about it. Ideally, both of these kinds of professionals would talk to each other, and they would coordinate. So nobody would be doing something that may expose someone's privacy. And the same time, when somebody's masking variables, for example, or combining variables in a way that cannot be traced to the original variables, then they will be doing it for the right reasons, not just because they can do it, okay. Because it's there's a cost involved in everything. So why not make the most strategic decision about it first and then start doing stuff instead of just doing it and hoping for the best? So, for the second question, I will say that there is no clear gap between the two professions after a certain level, and I believe after you after being in a field like data science for a while, it's impossible to disregard the privacy aspect. Like you can't do that otherwise as you're not really doing service to the field, and I imagine someone who also delves into privacy, they understand that going into some extent, so there is a breach of this form in other high levels, the challenge is having this connection in the lower levels as well. And I think this can be done easily through good leadership. So them starting from the CTO perhaps, or this chief science officer, or whoever is responsible for all these pipelines in an organization, make sure that it's communicated everywhere, across all four levels of the analytics departments that privacy is to be respected and to be taken into account in everything, so that everyone, even the junior level analysts can have an idea of why this is important and what they can do to their in their level, about preserving it. And the same can be done in from the privacy side of things like the privacy professionals from all levels, if there's a better understanding of what the professionals do, and why there will be a better match and better collaboration and communication.
Debbie Reynolds 31:21
I agree with that. I think I'm hoping that the trend is changing now. Because I think early on, in my view, the way people thought about attacking the issue of privacy was to let's create all these policies and procedures. And then let's try to figure out how it connects to what we're actually doing. Where I think the best way to do it is to follow the data, look at what you're doing with the data, and then try to make sure that your policies or procedures align with what you're actually doing. What are your thoughts?
Zacharias Voulgaris 31:58
I agree with that. That makes good sense. If you have integrated the why of something, you don't really need the policies that much. Although in the beginning, they may be useful to understand the why. But I think relying too much on external policies and regulations may cause the whole thing to become a bit rigid. While if we understand the value of preserving privacy and the value of analytics as well, then this will happen organically. I'll give you an example from the privacy world. There are regulations about how passwords should be like should have upper case letters, lowercase letters, numbers, symbols and has to be at least eight characters long. Yeah, this is a good guideline or set of guidelines, rather. But is it really essential, like if somebody understands the concepts of entropy, for example, and time to guess a password nowadays, they would naturally try to make passwords that are complicated and difficult to guess? So they wouldn't need to follow strict guidelines like this, but they will just make a very long string of words that they can remember easily, which has probably a much higher entropy than some eight-letter password that ticks all the boxes of the guidelines. And on the plus sign of this is that you'll be able to remember this password without having to write it down anywhere because it's just a string of words, which may be meaningful to you, but not to somebody who tries to guess the password. So that's why it's important to understand the why of something before you you start doing stuff in this area.
Debbie Reynolds 33:46
I think you're right. I'm saying it's very important as traditionally, it hasn't been the case where explainability is needed, or our companies haven't been as open to explainability in these issues. Because a lot of times, people thought, well, it's just data. So the data is going to tell us all story, right? And we know that that's not the way it goes. And then I think companies have more clarity on what they're collecting and why they're collecting it will help narrow their risk in a lot of ways. Because instead of just having this indiscriminate kind of data grab and kind of keeping everything forever. We're saying collect less data, collect more relevant data. Really think through why you need it, and then helping people with what they're doing with data on the retention signs is important.
Zacharias Voulgaris 34:51
Exactly, yeah. Because this way, it's more targeted, and it makes it also easier on the analytic side because you don't have to deal with too many variables, and you don't have to deal with too many data sets. And it's easier for everyone involved because the data will be cleaner and will be more meaningful in many ways. I have had students who were getting into analytics and data science as well. And sometimes, they were puzzled by these big data sets. And it makes sense if you think about it because, for me, they weren't really that big. But for someone starting off, they seemed very, very complicated because they had a bunch of different variables, and they didn't know what to do with all this. Now, if this data set were smaller, they would be able to do their analyses much faster. And this would happen at every level as well, not just for the junior analysts and data scientists but also for someone at the senior level. So instead of having like a million different variables, if we just had 1000, for example, you will make things so much easier because you wouldn't go through them one by one. But even if it were to go in an automated way, which would be the case, it will be much easier on the computational side. It wouldn't take too long to analyze all the variables and figure out what kind of liabilities they may conceal. Because if these variables come from PII, then they need to be treated differently, for example, through an organization, pseudonymization and many other methodologies.
Debbie Reynolds 36:32
So if it were the world, according to Zacharias, and we did everything that you said, what would be your wish for privacy or data science anywhere in the world right now?
Zacharias Voulgaris 36:46
That's a good question. I will say that I would wish that everybody understood the whole thing from different angles. Now, we tend to go into a lot of depth about different technologies, different methods, and algorithms. And all that is fascinating and is really worth pursuing. But it's easy to lose sight of the bigger picture. But if we were all to have a good idea of the bigger picture, we would, first of all, be able to communicate better, we'll be able to collaborate better, and there will be lower risks to take because we'll be we would know what we're doing better. And also, at the end of the day, why we're doing all this is for the end user. So the end user would be bound to be more content. So that would have an effect on everyone. And especially those people who take the risks, those people who are held responsible for all that stuff. So it's not good just for all the people who deal with the data or people who get the fruits of this work, but also for the people who lead these operations. And this can be a very stressful and difficult position to be in. But if everybody had a good idea of the bigger picture, there was more transparency in everything. I think it would be better for everyone and for the field overall. Because whenever somebody would try to advance the field through some new methods and new methodology, even they would understand things from different perspectives and be able to contribute in different ways.
Debbie Reynolds 38:26
I agree with that. I think we need to have more cross-disciplinary conversations and discussions, and understanding. I think that definitely will help. Yeah, well, it was great to have you on the show. I really appreciate it. I am looking forward to seeing your book as well. And your podcast is amazing. So I know a lot of people should definitely check it out.
Zacharias Voulgaris 38:56
Thank you. Thank you very much. It's great talking with you about anything data and privacy related. And yeah. I'm looking forward to hearing what you have to say about the book once you take a look at it.
Debbie Reynolds 39:10
Definitely, definitely. I'm looking forward to it. Well, yeah, we'll talk soon. Thank you so much.
Zacharias Voulgaris 39:16
Thank you. Bye bye.