Measuring Semantic Similarity with Google
In the last year I am actively conducting research in computational linguistics and natural language processing. My PhD thesis concerns identification of cognates and false friends between Bulgarian and Russian.
In short, false friends are pairs of words in two languages that sound similar but differ in their meaning. Identification of false friends is related to orthographic and semantic similarity, so let’s focus on these problems first.
Orthographic Similarity, Cognates and False Friends
Words in two languages that are orthographically similar (written in very similar way) can be either:
- False friends (different meanings)
- Cognates (the same or almost the same meaning)
- Partial cognates (share some different and some same meanings)
Here are some examples of false friends between Bulgarian and Russian:
- Bulgarian майка (mother) and Russian майка (sweat cloth)
- Bulgarian баба (grandmother) and Russian баба (woman)
- Bulgarian добитък (livestock) and Russian добыток (income)
Here are some example of true cognates between Bulgarian and Russian:
- Bulgarian сребро (silver) and Russian серебро (silver)
- Bulgarian наука (science) and Russian наука (science)
Here are some example of partial cognates between Bulgarian and Russian:
- Bulgarian син (blue, son) and Russian синий (blue)
- Bulgarian ягода (strawberry) and Russian ягода (juicy fruit, strawberry)
Measuring Semantic Similarity
Semantic similarity measure is well known linguistics problem. The goal is to design an algorithm that for given two words can identify how much similar are they.
For example the words nice, beautiful and pretty are synonyms and have high similarity. Contrarily, the words computer and vacuum-cleaner are almost unrelated and have low similarity.
To formalize this and we want to differentiate between absolute or almost absolute synonyms (70%-100% similarity), partial synonyms (30-70% similarity) or unrelated or almost unrealated words (0%-30% similarity). More generally we want to measure similarity with a real number between 0 and 1.
Cross-Lingual Semantic Similarity
Semantic similarity can also be measured between words in different languages. In this case it is called cross-lingual semantic similarity. For example the Bulgarian word рокля (dress) and the Russian word платье (dress) have high cross-lingual similarity, but the similarity between the Bulgarian чушка (pepper) and the Russian машина (car) is not high.
Measuring Semantic Similarity by Comparing the Words’ Local Contexts Extracted from Google
In my last few publications on RANLP scientific conference we (with my co-authors) propose a novel algorithm for measuring semantic similarity by using Google searching and analysing the local contexts of the target words. I want to share the basic ideas of these algorithms with the community so I will explain some of the techniques.
Google Web Similarity Algorithm
We measure the semantic similarity between two words using the Web as a corpus through Google searches. The basic idea is that if two words are similar, then the words in their respective local contexts should be similar as well. The idea is formalised using the Web as a corpus and the vector space model used for measuring distance. Let’s go into more details.
Extracting Word’s Local Context
The words that often appear in front of or behind of given word are considered its local context. We believe these words are semantically associated with it. For examle the word science is semantically associated with the words like education, art, nature, technology.
The question is how to extract these typical words for given target word w. Let’s run a Google query for given word, e.g. sience. Below are the few results from the top:
|Science/AAAS | Scientific research, news and career information | International weekly science journal, published by the American Association for the Advancement of Science (AAAS).|
|Science/AAAS | Table of Contents: 31 August 2007; 317 (5842) | AUSTRALIAN SCIENCE: New Misconduct Rules Aim to Minister to an Ailing System … HISTORY OF SCIENCE: The U.S. in the Rebuilding of European Science …|
|Science News – New York Times | Find breaking news, science news & multimedia on biology, space, the environment, health, NASA, weather, drugs, heart disease, cancer, AIDS, mental health …|
|Science – Wikipedia, the free encyclopedia | Science (from the Latin scientia, ‘knowledge’) is a system of acquiring knowledge based on the scientific method, as well as the organized body of knowledge …|
|Science in the Yahoo! Directory | Explore the fields of astronomy, biology, geology, mathematics, and physics and all of their related disciplines with resources designed for professionals, …|
|The top science news articles from Yahoo! News | Use Yahoo! News to find science news headlines and science articles on space, animals, fossils, biotechnology and more.|
|Science Daily: News & Articles in Science, Health, Environment … | Breaking science news and articles on global warming, extrasolar planets, stem cells, bird flu, autism, nanotechnology, dinosaurs, evolution — the latest …|
|Science.gov : USA.gov for Science – Government Science Portal | Science.gov is a gateway to government science information provided by US Government science agencies, including research and development results.|
Google returns some excerpts from Web pages containing the target word science. Let’s examine the words staying around this target word, let’s say 2-3 words on the before and after it. We call this set of words the local context of the target word. Some words happen to occur in the target word’ s local context several times so we can measure also the occurence frequencies. We believe that words occuring more often are semantically more strongly associated with the target word.
We have a big problem: some of the words in the local context are semantically associated with the target word but some are not. Let looking inside: words like journal, knowledge and articles are semanticall strongly related to the target word but some parasite words like news, portal and Yahoo are not related to science but happen to be in its local context.
Filtering the Stop Words from the Local Context
Let’as try to filter the words that are unlikely to be associated semantically. We can remove the so called stop words: prepositions, pronouns, conjunctions, interjections and some adverbs. These words do not bring any semantics but are likely to stay around any word. Examples of stop words from the excerpts above are: and, the, a, is, by, etc.
When we remove them, only meaningful words remain, but still not all of them are semantically related to the target word. Lots of parasite words still remain.
How to get the semantically related words only? We can filter them by creating a global list of parasite words but we can do something better.
Reverse Context Lookup Technique
We believe that if two words w1 and w1 are semantically related, then W1 will be in the local context of W2 and in the same time W2 will be in the local context of W1. Thus if the semantic association is not bidirectional it could be removed from the local context.
More generally we consider two words W1 and W2 are semantically associated each other with a weight of p if and only if W1 is found p1 times in the local context of W2 and W2 is found p2 times in the local context of W1 where p=min(p1, p2). We call this value p a level of association for these two words.
Algorithms for Extracting Local Contexts from Google
Now we have an algorithm that finds the words from the local context of given taget word. Google can return the top 1000 results only but this is not too small ammount of words. With intersection of the forward context lookup and reverse context lookup we can construct a set of words along with their frequences that are semantically associated with given target word. The step are:
1) Perform Google lookup for the target word and collect the first 1000 excerpts of texts containing the target word.
2) Remove all stop words using a list of the most common stop words.
3) Extract all words in a sliding window of size 2-4 words around the target word. We believe that context size of 3 works best. The result is a set of context words and their number of occurences.
4) Optional step: if you have a lemmata dictionary, replace each word with its basic form. For example the words scientist, scientists are the same word in different forms and shoulc be processed as the same word scientist.
5) Retrieve from Google the local contexts of all the retieved context words. Intersect the context words with their local contexts and obtain the co-occurences between the target word and each context word. Finally obtain a set of semantically related words and their occurences.
For example for teh target word science the obtained local context can be as follows:
- education – 780 times
- art – 467 times
- scientist – 260
- news – 133 imes
Measuring Semantic Similarity by Comparing the Contexts
Now we have an algorithm for retrieving the local context of given word – a set of words that often occur on the Web along with the target word. The question is: how to measure semantic similarity between words?
Suppose we have two vectors of N words (W1, W2, …, Wn). We can create a frequence vectors F1 and F2 for the frequences of occurence of these words in the local contexts of out two target words. Once we have these vectors, we can calculate cosine between them and this is our similarity measure.
This measure is quite intuitive: if more words co-occur in the context of the target words with high frequence, the distance between the vectors will be lower and the cosine will be higher. If less context words co-occur, the distance will be higher and the cosine will be smaller.
Measuring Cross-Lingual Semantic Similarity
Finally we need to generalize our semantic measure algorithm to words of different languages. Suppose we have two words W1 of langiage L1 and W2 of language L2. We can find the local context C1 of W1 in language L1 from Google by our algorithm by specifying L1 as the target searching language for Google queries. In the same way we can find the local context C2 of the word W2 for the target language W2.
Now we have two local contexts of different languages. We believe that if two words are similar, their local contexts will also be similar. Hence if we can translate the local context C1 form language L1 to language L2, the result will be similar to C2.
To apply the above idea we need a bilingual glossary of translation pairs between the target languages. If this glossary is big enough, most of the words in C1 will be translated to words in C2 and the context vectors can be compared with cosine to measure distance between them. That’s it!
Measuring Cross-Lingual Semantic Similarity – Example
Let’s have the Bulgarian word наука (science) and the Russian word наука (science). We can find the local context from Google for the Bulgarian word наука:
BG context word
As well, we can find the local context from Google for the Russian word наука:
RU context word
We can translate the Bulgarian context into Russian through our bilingual glossary. Most words will be translated and the rest will be removed:
BG context word
|RU context word||occurrences|
|образование (education)||образование (education)||360|
|изкуство (art)||исскусство (art)||106|
|технология (technology)||технология (technology)||86|
|категория (category)||категория (category)||73|
|български (Bulgarian)||болгарский (Bulgarian)||69|
Finally we can measure the distance between the obtained context vectors by calculating the cosine between them.
Precision, Recall, 11pt Average Precision
In information retrieval the precision of given list of search results is a value calculated as relevant_documents / number_of_documents_returned.
In information retrieval the recall of given list of search results is a value calculated as number_of_documents_returned / total_number_of_documents.
The 11pt average precision is a standard information retrieval measure that averages the precision in 11 points of the returned search results: at recall of 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100%.
Evaluation of the Similarity Measure Algorithm
We created a list of 200 word pairs between Bulgarian and Russian: 100 true cognates and 100 false friends. We executed the above described cross-lingual semantic similarity measure algorithm and ordered the word pairs by similarity. We expected the first 100 words to be false friends and the last 100 to be true cognates. Of course this was not 100% happened but we measured the 11-pt average precision for the obtained ordering and we found our method is about 97% accurate for our testing data set.