PhD Thesis: “Automatic Extraction of False Friends from Parallel Bilingual Corpus” (March 2007 – April 2010)
Svetlin Nakov’s PhD thesis “Automatic Extraction of False Friends from Parallel Bilingual Corpus” is a scientific research in the area of computational linguistics. It conducts research on the cognates and false friends between Bulgarian and Russian and aims to design innovative algorithms for their automatic extraction. New methods for measuring orthographic and semantic similarity (monolingual and cross-lingual) are proposed and their applications in solving various computational linguistics tasks are demonstrated, particularly for synonyms extraction, distinguishing between cognates and false friends and improving words alignment. A two-step method for automatic extraction of false friends from bi-texts is proposed: at the first step pairs of words with similar orthography are collected from the text and at the second step these pairs are categorized as cognates or false friends on the basis of measuring the cross-lingual semantic similarity between them using the Web as a corpus and by applying statistical techniques accounting their occurrences and co-occurrences in the corresponding sentences in the bi-text.
The Open Source Toolkit for Extraction of Cognates and False Friends (TECFF) implements the most significant algorithms designed as part of my research for my PhD thesis:
- MMEDR – algorithm for measuring weighted orthographic similarity between Bulgarian and Russian words taking into account some linguistically motivated Bulgarian-Russian correspondences (current supports Bulgarian and Russian only)
- SemSim – algorithm for measuring semantic similarity between words by searching in Google and analyzing the returned text snippets (currently supports Bulgarian, Russian and English)
- CrossSim – algorithm for measuring cross-lingual semantic similarity by searching in Google and analyzing the returned text snippets (currently supports Bulgarian and Russian only)
- FFExtract: algorithm for extracting false friends from parallel corpus by determining candidates through MMEDR algorithm and combining statistical and semantic evidence for distinguishing between cognates and false friends (currently supports Bulgarian and Russian only)
The toolkit is implemented in C# and is available as open source software under the MIT license.
Microsoft .NET Framework Course and Teaching Materials (March 2004 – December 2006)
The project is intended to create a set of teaching materials for teaching a course on Microsoft .NET Framework Programming in Bulgarian language. These materials consist of presentations, lecture materials, exercises and a textbook and are available for free downloading. The whole course is available in the form of e-learning lessons. The project has earned the support of Microsft Reaserch and Sofia University “St. Kliment Ohridski”.
NakovDocumentSigner (September 2003 – February 2006)
NakovDocumentSigner is a digital document signing framework for Java-based Web applications. It is freeware open-source project intended to provide the Web applications with digital signature functionality. NakovDocumentSigner allows the users to digitally sign and upload files directly from their Web browsers. It consists of a Java-applet for digital signing and a reference Web application for digital signatures and certificates verification.
ArtsSemNet (November 2003)
ArtsSemNet is an electronic lexical reference system, similar to WordNet, for terminology of fine arts. The terms (over 2,600 for each language) are annotated with complete dictionary definitions and organized into a semantic network with two parallel versions: Bulgarian and Russian. Five important lexical relations are defined: polysemy, synonymy, homonymy, antonymy and hyponymy, the latter serving as the basis of the hierarchical organization of the ontology. In addition, a specialized browser is created thus providing an intuitive interface to query and navigate through the network.