The Data Diva E50 - Kurt Cagle and Debbie Reynolds

Debbie Reynolds “The Data Diva” talks to Kurt Cagle, CEO Semantical, LLC, Editor in Chief of The Cagle Report, Community Editor, Data Science Central. We discuss his background in technology, what is the Semantic web,how the semantic web will change computing, Data Privacy and the rights of the individual, problems with search and image recognition, his biggest concerns about Data Privacy, the dangers of inference, and his hopes for data privacy in the future.

Support the show

Kurt_Cagle

Mon, 9/20 10:32PM • 51:18

SUMMARY KEYWORDS

data, information, semantic web, people, technology, classification system, inference, identify, problem, privacy, classification, area, created, graph, image recognition, bias, talk, relationships, building, terms

SPEAKERS

Debbie Reynolds, Kurt Cagle

Debbie Reynolds 00:00

Personal views and opinions expressed by our podcast guests are their own and are not legal advice or official statements by their organizations. Hello, my name is Debbie Reynolds. They call me "The Data Diva." This is "The Data Diva Talks" Privacy podcast, where we discuss Data Privacy issues with industry leaders around the world with information that businesses need to know now. Today I have a special guest on the show. Kurt Cagle is the founder and CEO of Semantical LLC. He's also the editor in chief of the popular Cagle report, which is a newsletter on LinkedIn. Definitely check that out. And he's also the community editor of Data Science Central. Welcome today, Kurt.

Kurt Cagle 00:49

Thank you very much, Debbie.

Debbie Reynolds 00:52

Yeah, I have been really interested in I guess I started to know you on LinkedIn from your Cagle report. So I subscribed to your newsletter. And a couple of fascinating things that you do in your newsletter. I like them because they're kind of like long-read things. And you, you really sort of traverse a lot of different topics, those not just technology, not just data. So I like the fact that you're kind of multifaceted, you can kind of talk about different things, I think it helps that you're you know, a writer, editor of novels. On top of this, though, things are really cool to me. So why don't you tell me a little bit sort of about yourself and your journey to where you are now in your career?

Kurt Cagle 01:44

Oh, I started out many years ago wanting to be a physicist. And when I was a kid, I thought the whole idea of describing things as both waves and particles to be just really fascinating, which tells you probably more than we need to know about me as a 13-year-old. And so went off school. I got a degree in physics but spent probably, yeah, good 80% of the time, actually, in the computer lab because this was back in the 80s. And all of a sudden, you had these great, wonderful, portable computers, which was a really, really novel thing at the time. And so despite the fact that I spent most of my time on the computer, I did get a physics degree and discovered very quickly in 1980, a physics degree, bachelor's degree was absolutely worthless for anything other than going on and getting a master's degree. So I kind of took advantage of the fact that I'd spent most of my formative years in college, working on computers, wanting to computer graphics with PostScript, and more for a while with a couple of ad agencies that are doing their computer graphics work. And then, from that made my way into computer gaming. Back in the early 1990s, and shortly thereafter, I kind of spent a little while doing gaming work, probably about five years. And then that was about the time the web hit. But while I was doing things, and one of the things that I really concentrated on was increasingly how do you represent data? How do you represent information within your applications? How do you store it? How do you work with it how you structure because databases were these big complex things that you saw in commercial organizations, but for the most part, what I was interested in was basically small data was data that essentially came from external sources but is something you can basically utilize as a structure about the same time. Tim Berners Lee kind of moved on from the XML Working Group events and then the HTML side of things and was beginning to explore the Semantic Web. And his notion was again, basically saying if I have you know, if I was to basically represent information as concepts that each had a specific name, where that name was essentially unique, so not just a name in terms of the words that we think about the main terms of a unique identifier, or what we call it today or a Uniform Resource Identifier, then we can think about information as being made up of these names connected by links to other names that represent things. Well, when you start to deal with names and properties that those names describe, and say, Okay, I have a, you know, here is a book, because as often as the title tells, you know, these chapters and so forth. That entity that we're talking about, who has a specific name that is unique or can be making, not just in terms of a unique key, and database, but really all over. And so all of these pieces basically started coming together, where you could then say, okay, you know, if I have nodes connected by labeled properties to other nodes, what I've created is what's called a graph. And so, the Semantic Web ultimately was about building graph technologies. You know, so you could think about wrestling about URLs as being locations. But they could also be thought of as names into this graph and navigating the cross that was essentially navigating through a conceptual map of information. And that was essentially the beginning of the Semantic Web.

Debbie Reynolds 07:18

Yeah. Let's back up a bit, make sure people understand. So I sort of kind of after a bit of a question I was going to ask, but I want to make sure people understand what the Semantic Web is and how it's different than maybe what they think it is.

07:33

Okay. So, first of all, the Semantic Web sounds scary. In fact, there's this whole; there's this whole space talk about taxonomies and ontologies. And semantics, and other big scary Greek and Latin words, most people kind of look at and go. I'm not so sure about this. It sounds more like philosophy because of programming. And to a certain extent, that was kind of largely where it came from. Because a lot of this was essentially the idea of basically connecting things in a graph has actually been around for a while, you know, if you, if you look at any kind of structured and unstructured data, that data can be thought of as a graph. And depending upon whether that graph is I'm treating like like most documents are, or whether it's something that's more connected, more intricately connected. What you end up with is, either Yeah, there's terms that get thrown out, you know, these are all things like directed acyclic graphs or DAGs, and directed sit with graphs or DCG's and, and things like this, that really kind of only a, an academic book. But the idea behind it actually is pretty simple. And that is that you have information, the things that you're describing. And one of those things that you're describing basically has a property, or it has a set of property metadata that describes a thing's characterization. And so, you can basically start using what I would call logical formalisms for a typical reader. You can think of it as logic as ways of being able to say, here are, here's a set of assertions, and if this assertion is true, and this assertion is true, and the object of one assertion is the subject of another assertion, then you can start making reason, on that information, you know, a good way of thinking about it is, is I can basically talk about, I have a, I have a brother and that brother, and I happen to share a father. So, there is this relationship that basically exists, where I can define that relationship in terms of other things, I can say, okay, if I have the concept of a father and I have the concept of a sibling, and that sibling has the concept of gender, then I can say, yes, I know that this person is my father, and this father has, and this other person is also my as also has that same father, then I have a relationship. Now, the same thing applies up on the other side, and it gets a little more complicated about genealogy.

Kurt Cagle 11:35

But the idea is that these relationships that you're defining are things that are, can then be tied into categories. So even, you can, you can talk about this, in terms of, okay, I have the class of people. And some of those people share these relationships. And because they share these relationships, I can identify those patterns of those relationships. And with those patterns, I can basically start wearing and building inferences and definitions. So you know, if I then go and say, I have a relationship with my father, and he also has a father who has another brother, then I can define in terms of that this relationship that says this person is my uncle. So a lot of the Semantic Web, when you get right down to this, basically talking about classes of things, people, animals, pets, buildings, locations, books, ideas, whatever, and the relationships that they have. And those together once again make up a graph and make up a way of storing information. Now, we've been doing this action for a while. It is possible way back in the late 1970s. You had Ted Codd, who was one of the instrumental, well, one of the instrumental people for creating relational databases in the first place. And he wrote fairly extensively about it and introduced the notion of what would become SQL. And he said, you know, there are, these are still graphs, these are still things that we can talk about. However, at the time, the capabilities of what he was working with were just not sufficiently strong enough. And that it just wasn't fast enough to actually talk about more complex structures beyond you know, here are tables, and there are rows, and here are columns. But you can kind of think of the graph as if you were to take these tables and rows, and each row essentially becomes an identifier for a thing, and each column then becomes a property then there is a relationship that gets formed, and that relationships become a graph essentially becomes a way of describing entities at a basic schematic level. Right, so I'm not sure if that necessarily helps in the understanding that the idea really is when you talk about semantics, what you're talking about, or are these relationships now they build.

Debbie Reynolds 15:05

So is the purpose of a semantic web to be able to have organizations or our technology be able to create, make decisions faster?

Kurt Cagle 15:22

Yes, among other things, you can kind of think about it when you're storing information. And this actually gets into one of the distinctions between two different versions of AI. One of those versions is essentially classification, where you say, I have a bunch of pictures of cats and dogs or other pets. And each one of these, we're going to run through a process. And we're going to basically do these, these image recognition pieces to be able to identify certain characteristics, and work with the labeled data set that identifies as Okay, this is a picture of a dog, this is a picture of a cat. And we've got 10,000 of these pictures. And that happens to work pretty well when you're starting to talk about information that is categorical in nature and is largely dependent upon identifying certain patterns. But it's a very brute force method of being able to get information. The problem that you run into with that is that as you are working with this, you can say this is a cat, this is a dog, this is a rabbit. But unless you have specifically added into this some kind of relation factor that says, okay, a cat as a pet is an animal, a dog as a pet, that kind of information gets lost. And when you start talking about semantics, you can start building up relationships, things like taxonomies, Linnaean, taxonomy. Where, where you can say, okay, if I can identify this thing as a cat, then I also happen to know that it's natural, I know when it needs to be fed, because animals are being fed, I know that it needs to be registered because pets need to be registered. Because pets need to be registered, I need to have a way of registering body, and you start building these relationships that exist. About this kind of in relationship information, and it is that kind of information, that categorization aspect, that's actually very difficult for machine learning to be able to facilitate. So one area that Semantic Web really helps us is that it kind of provides a convenient index, or what's called a rubric for identifying certain concepts. And those rubrics are a lot like the kinds of rubrics that you have as an example, for the Dewey Decimal System is a pretty good example. Dewey's decimal system basically was a classification system that was developed in the early part of the late part of the last century, actually, the late part of the 19th century. And it essentially said, okay, we're going to take quality knowledge, and we're going to assign all of the knowledge to different numbers and different collections of numbers. So you say 00, through 099900, maybe nine is a description of basic philosophy, and, you know, gets into other areas like mathematics and stuff like that, but not complete. You know, you talk about the four hundred, and those are generally sciences, mathematics, physics, chemistry, biology, whatever. But each of these numbers essentially is a classification system. Now, I could take a machine learning system and say, here is a book. And that book includes the includes a Dewey Decimal System, a DDS number. And from the structure that we're talking about, I can actually infer to a certain extent, after I have a lot of data, what that structure looks like, but it takes a lot of data to be able to get that inference, right. Or I could basically say, okay, there's a classification scheme that I'm using called the Dewey Decimal System. And it has these numbers. And as you're classifying this information, you can basically use these numbers to be able to identify, given a certain set of topics where this book goes. And so one approach that bottom-up approach is really how machine learning the top-down approach where you're basically saying, I'm creating a structure for information, that is essentially an example of a top-down classification system or taxonomy. Or in a little broader sense.

Debbie Reynolds 21:20

Yeah. That's fascinating. So let's talk a bit about the individual. So I, because I am very interested in privacy, right? Kind of, you know, just the legal and technology issues. The way that you describe kind of the Semantic Web, and the different ways you just talked about, like being a bottom-up versus a top-down way to look at data, is interesting because I guess, I guess if you have problems, and both, but I guess the thing that concerns me a lot related to privacy or individuals, is that if a person doesn't fit neatly into a category, or you can't really, you know, if there are sort of outlying information that doesn't, that can't really respond to, you know, a particular categorization, I think that that is sort of a way that kind of bias can reveal itself in data. What are your thoughts?

Kurt Cagle 22:22

Absolutely. And that's true for both kinds of technologies. With when you get into machine learning, which is really where I'm spending a lot of my time now, what you're doing is basically saying, I'm going to take a data set. And that data set, the same weight we used, is run through, essentially, a form one of the literature. This is kind of the upper edge of athletics that is run through a neural network. And that neural network is then going to take that information and utilize it in such a way that it says, okay, I've done my classification of a priori classification, that essentially identifies the kinds of things I'm expecting to find this is how I need this information. And when you talk about labeling, what you're actually talking about is classified. It's a different notation because they kind of came up from different directions. But the idea in both cases is that you have various classification systems. And then you can effectively say, if I take this and I shard the space up into various places where you have concentration information, then that basically represents a thing. And depending upon the kind of machine learning that you're talking about, you have some types which are essentially labeled and trained. You have some which have, which essentially do the training afterward and do the labeling afterward. So you essentially identify clusters of information and then say, okay, I'm going to try this in this way. But both of them essentially work in much the same idea of saying, okay, we're going to identify these clusters. And based upon the clusters, we will then essentially say the rubrics that we're talking about. Here is the category. So the problem with that is that it really comes down to the data utilized. And one thing that an ontology status, and I'll jump into one minute. So one thing they do is they actually try to make their models as unbiased as possible. So they look at edge cases, they look at places where there are certain constructs, and they say, okay, what about this particular case? Does it fit neatly into this set of attributes that can effectively identify some? If it does, then that's great, you know, you have a classification, we have a mechanism to say, here is a concept that we're going to want to call, you know, this is a cat, this is a dog. But sometimes, you get a fox. And foxes are really problematic. Because if you don't define those foxes, but you basically go in and say, here is the classification system that I've created, then they tend to get misclassified. That creates bias because a cat is not a fox dog is not a fox, as they both have similar characteristics. But you know, they're different animals are different species. And so from a machine learning perspective, on the other hand, if you have these boxes in basically misclassified, miscategorized, that also creates bias, the danger there is sometimes that bias is a lot more hidden, because you don't necessarily have someone that is essentially that is acting as a governor of the concepts that you're dealing with. And so you know, and the new note goes both ways, people have a bias when you start putting things in a prime example, this is when we talk about gender, the FBI currently has, according to the National Information Exchange, currently has 17 different classifications from gender, they tried very hard to be able to look at all the possible edge cases that still are large enough to be able to have more than a single individual. You know, simply because, you know, yes, you're always going to have people that ended up in the other bucket. But you want to make sure in your vision in your ontological design that, you know, the other bucket has a stream of verses is conceivably possible. And you know, if there's something where you look at that other bucket and say, okay, this happens to be a platypus. Then you can essentially go in and say, you know, maybe this thing isn't really a mammal at all. And in fact, if you talk to biologists, they will say, yeah, platypus really isn't mammals, it's kind of a, it's kind of somewhere between a bird and a mammal, but it kind of is its own little thing. But because there are so few of them, it tends to get lumped in with larger groups. And so this is one area where bias becomes a fairly, fairly significant area is, is in that particular classification.

Debbie Reynolds 29:17

You know, so I, a dear friend of mine, David Walker, did a reverse image search on the Internet. He has a famous profile picture of him in a yellow post or in a baseball cap. I think it's like a marine cap or something that has a little yellow in it. So he did a reverse image search. And it came up with pictures of men and baseball hats, and not all of them have on yellow shirts, but there are some yellow elements in it, and it was like guys, all ages and all races. So, you know, as a novelty search, right? That's not a big deal. But like if the same technology and thinking and logic are using something like a facial recognition database, you know, that's a problem. Right?

Kurt Cagle 30:04

Well, and what makes that even more problematic is that you have competing interests, each of which has a particular desire to use that analogy. And, you know, image recognition is certainly a big one. You know, particularly since image recognition is actually, first of all, relatively easy to spoof. You know, there's, there were some very interesting, very interesting studies done, which kind of became a meme at one point where people would actually put bars of paint on their faces, strategic places. It would just acutely confound the image recognition systems. I don't know what this is. And, additionally, even with that, you get into the question of why or what purpose? Are you using this for? You know, image recognition by itself is a pretty cool technology. Hi, Dave, you know, you may open the pod bay door now. But this gets into an area where you basically start talking about the implications of Artificial Intelligence, in particular, within the ethical constraints of society. And yes, you know, the problem that you run into with that is, is that, again, data is good data is relatively hard to come by. You know this is actually one of the great myths of the Big Data era. Yes, there's a lot of big data out there, most of its garbage, because most of the data that you're dealing with, is not actually does not actually come about for the purposes of being able to do the kind of classifications that you're looking at, you know, the most data, the vast majority vaulting was created as transaction data. You know, the trend, the records that you keep from transactions, what you generally don't have, because it gets into the arena of markets, and it gets into the arena of what is the public trade, you know, is essentially, that metadata that describes what's on each end of those of that transaction, who these people are, what are their interests? Why are they? That's something that marketers love. That's the information they live for, and they want to be able to get that information. But it's also something that, for exactly the same reason, many of us don't want to have happened. And that's because essentially, it becomes a question of who owns that metadata. How do you access that data? How who has access to that data? And under what circumstances?

Debbie Reynolds 33:48

So what is your biggest concern right now around kind of data privacy and just data in general? I don't know. I don't know about you. But I feel like we're going in two different directions in some way. So I feel like the laws are trying to give people more transparency and more kind of agency with their data attempting to, but I feel like the data, you know, we're finding new ways to collect even more and more and more data. So I think we're kind of chasing the technology or what's happening with data right now. What are your thoughts?

Kurt Cagle 34:24

Oh, yes, absolutely. I think that at the convention, there are a couple of conundrums that we're facing right now. One of them is in the arena of inference, and inferencing this actually is actually one of those areas from the semantic standpoint. Inferencing so is a fairly major part of developing any kind of Knowledge Graph. The problem with inferencing is inferencing is basically specifically designed to surface information that was not known previously or at least was not known in a hole previously until you get enough information together to be able to identify that. So you know, when an example of this is I can, I can probably with about five pieces of generic information, you know, your zip code, you know, your zip code, your age border, what kind of car you drive, by having those three pieces of information, I can probably manage two little downs of this from potentially millions down to maybe a few dozen. And it doesn't take much for that inferencing process to get to the point where even if you have clean data, even if you have data that has been, has been run through PPI, which has been PHI information, personal health information versus the privacy of patients, even if you have that information in a place where you've created a way to cleanse the data set. The problem is that it doesn't take much to be able to give a fairly high degree of probability to give up individualism to a person. And, you know, that's that is going to be the bane of privacy experts for pretty much the foreseeable future. Because you don't, the ability to basically stay anonymous is so completely compromised at this point. That it's been, and this was a discussion that I had with a number of health experts during the early days of the pandemic and specifically talking with people in the European Union. And the problems that they ran into was that they had the data, they could influence the data, they couldn't legally utilize the data because that data, the laws that were in place, were laws that had not really taken into account the speed at which this technology has been involved. And so, you know, you've got a lot of what's going on with GDPR, you have a lot of what's going on with the California data initiatives, and so forth. Both of them, they're well-intentioned. I'm just not sure that their actions can be that effective. And, you know, I think that that's, that's a real challenge, you know, moving forward.

Debbie Reynolds 38:48

I'm glad you brought up inference because I feel like inference can be extraordinarily dangerous if depending on how people are using inference, right, especially as we're what I'm seeing is people using technologies that were developed for one purpose and trying to shoehorn it into another purpose, it may create inferences that as a result of inference, a company's organization may take life or Liberty with actions against people that could be harmful.

Kurt Cagle 39:20

Well, I think the good case in point there was the actions of Cambridge Analytica. You know, I think that's actually that should be a Cambridge Analytica should be a case study. For anyone who was involved in the ethics of privacy and information. Because what you had, in that particular case, with my kind of the last name, Robert Mercer basically set up a company that essentially utilized the mechanisms of influencing and, you know, push pulling and pulling information to essentially create profiles. And this is, you know, what in campaign data organization does, but then was essentially utilizing it to generate misinformation that would then be targeted to the candidates and, or to be the potential voters of candidates. And it was unfortunate, and it's very effective. And it was very effective, even though it wasn't really all that terribly ethical. Because it kind of gets down to a lot of the ways that we tend to think how, you know, unfortunately, they're human beings don't like to admit this, but they can be programmed. And they can be programmed primarily, in certain ways by, by repetition of information by creating certain conditions that tend to lead towards conclusions that are erroneous, but by effectively controlling the space, the information space around people, and by eventually subsuming them in that allow them to be captured. And this is likely to continue to be a problem. Because we have tools, both the bottom-up tools that you see with some with machine learning the top-down tools to see what semantics where you can not only identify but you could also use social engineer a response back into the society that is very worried. And, you know, I think that that, to me, was one of the big problems that I saw on this last administration, the Trump administration. I'm not going to get too political here. But the Trump information, the Trump administration basically used a lot of the same kind of mental control mechanisms, largely predicated upon understanding and control of a fairly tightly constrained information audience to be able to significantly changed the way that people fought. And there's a term for that, the Overton window, what gets discussed and how it gets discussed. And this is actually a fairly critical aspect, that from a perspective of being a journalist, is very important is that you know, when we're talking about this kind of information, it really kind of does then come down to okay, you know, what, what are the ethical boundaries? Where do you basically draw the line between telling what happened and influencing the thought process? And it's one of the reasons that I think that while it's never waiting to determine lucrative area, you know, I see the ability of journalists and influencers to be both, you know, inescapable and frilly permissions in terms of the overall effects that it has upon society. Interesting, interesting.

Debbie Reynolds 44:34

I would love to know if it were the world, according to Kurt, and we did everything that he said. What are your wishes for privacy anywhere in the world, whether it's law technology anywhere?

Kurt Cagle 44:49

Nice simple questions here. I think One of the things that we need to do is to develop a mechanism to effectively identify the provenance of information. And from that, to start making sure that the mechanisms that we have for the transmission of that information and that knowledge, effectively, because they are increasingly becoming digital, have some mechanism to be able to say, this information is biased, this information is not this information is biased in this way, this information is relatively neutral. And there are pieces that we possibly make that possible. You know, a lot of the work is being done right now with blockchain. Distributed ledger technology is one precise example of that. Where you can say this is where inflammation has been, this is what establishes the trail of authenticity is also the same kind of mechanism for creating certificates of authority or for building authority. Now, you get down; one of the things that you discover about the way that we build and have built our society is that we don't have a good, solid concept of how to deal with trust in a digital. And I'm not really sure that we can. No, I've actually been something of a skeptic about self-sovereign authentication and self-sovereign systems, primarily because I believe the biggest problem they have is that they don't necessarily provide the ability for someone to sue someone else based upon the lack of authentication. And that's actually very important minutes, and it may sound Oprah, you know, that's what we mean is there are even more lawyers in the mix. But ultimately, what it does come down to, and I think this is what trust comes down to, is that trust implies a certain degree of liability. If I assert something, and you do something based upon that assertion, and it turns out that you, you basically cause damage. Where's the responsibility for that? Is it new? Or is it the person who provides the assurance, the surety, or is it the person who accepts the surety? And I think that you need to have that it's a social construct. Is it not programmatic? And I don't think you can build a programmatic construct, at least not yet. That is capable of managing that surety.

Debbie Reynolds 49:06

That's a heck of an answer to that question. That's wonderful.

Kurt Cagle 49:12

I mean, it's it. I've been thinking a lot about it. You know, I think that it's, it's something that we need to be very careful about because there's so much of a temptation to want to use technology to solve sociological problems. And unfortunately, technology doesn't solve such problems. Technology only facilitates the means by which people can accomplish things more efficiently. If those things are wrong, then they're wrong, more efficient. It's what it comes down to.

Debbie Reynolds 49:53

Right. Well, this is wonderful. Wow, I can listen to you talk for hours, and I highly recommend that people go on LinkedIn and subscribe to your Cagle report or newsletter because it's always fascinating. And I love the fact that you mix it up. So it's not always about you know, it's just different stuff like you. You've done some fascinating things about names, and naming you all to have to find out is really good. So go on his LinkedIn under articles. I believe that's where your newsletter shows up for people and definitely subscribe to it. So thank you so much. I really appreciate you being on the show. This is wonderful.

"The Data Diva" Talks Privacy Podcast

The Data Diva E50 - Kurt Cagle and Debbie Reynolds

Listen to this podcast on