Welcome back. In this video, we are going to talk about semantic similarity. Right at the start, I'm going to ask you a question. Which pair of words are the most similar in the following, we have deer and elk, deer and giraffe, deer and horse, and deer and mouse? What do you think? Hopefully, you would have given the answer as deer and elk. But how can you quantify this? Why do deer and elk appear and sound to be similar or more similar than other animals here? To help us with that, we could use some semantic similarities resources, and we're going to talk about that in more detail now. But first, let's see what are the applications of semantic similarity. Semantic similarity is useful when you're grouping similar words into semantic concepts into concepts that have the same meaning - appear to have the same meaning, for example. Or semantic similarity is very useful as a building block in natural language understanding tasks. Tasks such as the textual entailment or paraphrasing. Paraphrasing is a task where you rephrase or rewrite some sentence you get into another sentence that has the same meaning. Textual entailment, on the other hand, is a little bit more complex. It says that the smaller sentence or one of the two sentences derives its meaning or entails its meaning from another piece of text. So you have a text document or a text passage and a sentence. And based on the information in the text passage, you need to say whether the sentence is correct or it derives its meaning from there or not. This is a typical task of semantic similarity. One of the resources useful for semantic similarity is WordNet. WordNet is a semantic dictionary of words interlinked by semantic relationships. It is most extensively developed in English, but there are WordNets available now for quite a few languages. This WordNet includes a rich linguistic information. For example, it has the part of speech, whether something is a noun or an adjective or a verb. A word senses, different meanings of the same word, synonyms, other words that mean the same, hypernyms and hyponyms, that is an is/are relationship. For example, a deer is a mammal or meta-name that is a whole and part of relationship and derivationally related forms. WordNet is also a machine readable and it's freely available, so it is extensively used in a lot of natural language processing tasks and, in general, in text mining tasks. How do you use WordNet for semantic similarity? WordNet organizes information in a hierarchy, in a tree. You have a dummy root that is on top of all words of the same part of speech. So noun has a dummy root. A verb has a dummy root. And then, there are many semantic similarity measures that are using this hierarchy, in some way. For example, you have different hierarchies for this part of speech. And let's take an example of our deer that we started with, where deer, elk, giraffe, horse and so on, these words are grouped together, in some form, in this hierarchy. For example, elk, wapiti, and caribou are all types of deer. Deer and giraffe are siblings in this tree hierarchy because they are ruminants, and so on. And horse's related but not in the same hierarchy. It's related because horse and deer are ungulates, but they are not siblings, for example. So one such measure of using this hierarchy for defining semantic similarity is path similarity. You could imagine that you would start with one of these concepts, and see how many steps you need to take to get to the other. In other words, you are finding a shortest path between these two concepts in this hierarchy. And then, similarity can be just measured as inversely related to this distance that we computed. For example, if you have deer and elk, you would have, the deer and elk, actually are, have a parent-child relationship in this case, so the distance is one, while deer and let's take in another color, deer and giraffe is the sense of two, because you need to go up ruminant and down giraffe, so you have a distance of two. In general, you can see that when we computed with paths you use this one, distance of one between deer and elk and say, it's one over the distance plus one, so one over two that's .5. The distance between a deer and giraffe is one over three, so that's, 0.33 and if you just measure the same way, going from deer to horse you'd say it's one, two, three, four, five, six. It's one over seven and that would be 0.14. The other way to find similarity between two concepts is using what is called lowest common subsumer. Lowest common subsumer is that ancestor that is closest to both concepts. For example, deer and giraffe have the least common or lowest common subsumer to be ruminants. You have deer and giraffe and you know that is the least common subsumer is the one that is an ancestor to both of them, but the lowest in the hierarchy. Even though ungulates, and even toward ungulate are both ancestors, it's the ruminant that is the lowest one in that hierarchy. With respect to deer and elk, it's just the deer because deer is a parent for elk so the one that subsumes both of these would be directive, and for deer and horse it goes all the way up to ungulate. Now, you can use this lowest common subsumer notion to find similarity and that was proposed by Lin and called Lin similarity. You have similarity measure that is based on the information contained in the lowest common subsumer of the two concepts. For example, the formulation for doing that is if you have two concepts u and v, you take the log of the probability of this lowest common subsumer and divide it by some of, log of the probabilities of u and v. And these probabilities are something that is computed or given by the information content that is learnt over a large corpus. How do you do all of this in Python? In Python, especially in NLTK, you have a lot of semantic similarities already available for use directly. One, it is very easy to import into Python through NLTK. You could say import NLTK and from an NLTK corpus import WordNet, and then you can find appropriate sense of the word that you want to find similarity for. So for deer you say, find me the synset of deer which is a noun and give me the first synset, that's what deer.n.01 means. It says I want deer in the sense of given by the noun meaning of it and the first meaning of that. The same way with elk, you find the synset that corresponds to elk.n.01 and so on. Once you have this proper sense of the word, you could use that to find similarity. You could say, deer.path _ similarity(elk) or deer.path_similarity(horse). In this particular case, you recall that deer and elk were in a parent child relationship, the similarity was 0.5. While deer and horse were in two different subtrees, and the distance was six, and the similarity was one over seven, that's 1.1428. Now, if you are using Lin similarity, you're going to use the information criterion in some way, and let's say if we use the information criterion that is given by brown clusters. First, we are going to say from a nltk.corpus import wordnet_ic. You define brown_ic based on the brown_ic data. And then say, deer.lin_similarity(elk) using this brown_ic or the same way with horse with brown_ic, and you'll see that the similarity there is different. The Lin similarity is 0.77 for deer and elk and it's 0.86 for deer and horse. And you'll notice especially here, that this is not using the distance between two concepts explicitly. So deer and horse, that were very far away in the WordNet hierarchy still get the higher significance and higher similarity between them. And that is because, in typical contexts and the information that is contained by these words deer and horse, you have deer and horse are enough closer in similarity because they are both basically mammals. But Elk is a very specific instance of deer and not necessarily, in the particular Lin similarity doesn't come out as close. The other different measure of similarity is using Distributional similarity and Collocations. Collocations can be defined by this code. You know a word by the company it keeps. And that means two words that are frequently appearing in similar concept, in similar contexts are more likely to be similar or more likely to be semantically related. So if you have two words that keep appearing in very similar contexts or that could replace another word in the similar context, and still the meaning remains the same, then they are more likely to be semantically related. An example is this, in these four sentences you have something about meeting at a place, so friends meet at a cafe or Shyam met Ray at a pizzeria or let's meet up near a coffee shop and so on. These words, cafe or pizzeria or coffee shop or restaurant are semantically related because they typically occur around the words meet, around at, or, near, the. So there is the determiner right in front of them and there is some notion of location, and those are the concepts that would form your context around the word. In general, you would define context based on words before, after, or within a small window of a target word, so word what comes before. For example, for all of these was a cafe and restaurant and so on, it was 'a' or 'the', alright? Because it's a noun and you have a determiner right before that. What comes after or what comes within a small window? Let's say, of size three and you will remember that all of those examples had some form of meet. Met, meet, meeting and so on in that small window of three to five words. You could also use parts of speech as context, so part of speech of words before, after, within a small window. You could say that this particular target word occurs right after a determiner or occurs after location morality, two words and so on. You could have some specific semantic relation to the target word or you could have words that come from the same sentence, in same document, and you can define that document as any length you want. Let's say, a passage in a document, a paragraph that would constitute your context. Once you have defined this context you can compute the strength of association between words based on how frequently these words co-worker or how frequently they collocate. That's why it's called Collocations. For example, if you have two words that keep coming next to each other, then you would want to say that they are very highly related to each other. On the other side, if they don't occur together, then they are not necessarily very similar. It's also important to see how frequent individual words are. For example, the word 'the' is so frequent that it would occur with every other word, fairly often. The similarity score would be very high with 'the' just because 'the' itself happens to be very frequent. There is a way in which you can normalize such that this very frequent word does not kind of, super ride all the other similarity measures you find. And one way to do it would be using Pointwise Mutual Information. Pointwise Mutual Information is defined as the log of this ratio of seeing two things together. Seeing the word and the context together, divided by the probability of these occurring independently. What is the chance that you would see the world in the overall corpus? What is the chance that you can see the context word in the overall corpus and what is the chance that they are actually occurring together? This Pointwise Mutual Information is also something that you can directly call in the NLTK. You can use NLTK Collocations and Association measures. For example, you can say, input NLTK and import the collocations from NLTK. You can define bigrams as NLTK collocations bigrams, bigram association measures, and then you can learn that based on a corpus. In this case, given as text here, so text corpus and then, using the PMI measure you can say, I'm going to get the top 10 pairs using the PMI measure from bigram_measures. You can use a Use Finder for other useful tasks such as frequency filtering. So suppose you want all bigram measures that are, there you have supposed 10 or more occurrences of words only then can you keep them, then you could do something like finder.apply_ freq_filter (10). That would then restrict any pair that does not occur at least 10 times in your corpus. So, the big take home messages from this discussion we have had so far, is that finding similarity between words and text is non-trivial. But there are a lot of resources available, such as WordNet that could be very useful for semantic relationship between words and semantic similarity between words. There are many similar functions that are available in WordNet and NLTK provides a useful mechanism to actually access the similarity functions and is available for many such tasks, to find similarity between words or text and so on. In fact, you could start from what similarity and then compute text similarity between two sentences. And in general, the similarity functions are very useful for natural language understanding tasks. In the next video, we are going to go into more detail about topic modeling. See you there.