Author: Svetlin Nakov
September 30, 2009
Today I granted to the community (under MIT license) the source code of the most interesting algorithms designed for my PhD thesis (implemented in C#):
- MMEDR – algorithm for measuring weighted orthographic similarity between Bulgarian and Russian words taking into account some linguistically motivated Bulgarian-Russian correspondences (current supports Bulgarian and Russian only)
- SemSim – algorithm for measuring semantic similarity between words by searching in Google and analyzing the returned text snippets (currently supports Bulgarian, Russian and English)
- CrossSim – algorithm for measuring cross-lingual semantic similarity by searching in Google and analyzing the returned text snippets (currently supports Bulgarian and Russian only)
- FFExtract: algorithm for extracting false friends from parallel corpus by determining candidates through MMEDR algorithm and combining statistical and semantic evidence for distinguishing between cognates and false friends (currently supports Bulgarian and Russian only)
The project is titled TECFF (Toolkit for Extraction of Cognates and False Friends) and is available for public download from http://code.google.com/p/cognates-and-false-friends-tools/.
Tags: Cognates, community, Extraction, false friends, MMEDR, open source toolkit, parallel corpus, semantic similarity, text, text snippets
Author: Svetlin Nakov
September 22, 2009
All Java and Java EE developers are invited to the unique for the Balkans and Eastern Europe conference on Java technologies called Java2Days. At the conference distinguished speakers will talk in Sofia about Java, Java EE 6, JBoss, EJB 3.1, Spring Framework, JPA, OSGi, GWT, JSF, jBPM, Wicket, JRockit, cloud computing and other hot technologies. Some of the speakers:
- Reza Rahman – independent Java EE consultant, co-author of the book “EJB 3 in Action”
- Mircea Markus – core developer and trainer at JBoss
- Bruno Bossola – agile coach at Vodafone Global and Java champion
- John Willis – CEO at Zabovo Corp., cloud computing expert
- Josh Long – enterprise architect, speaker, consultant, and author
For more information visit the conference official Web site: http://java2days.com/.
Tags: champion john, consultant, europe conference, Framework, GWT, hot technologies, independent java, jBPM, john willis, October
Author: Svetlin Nakov
All .NET and Microsoft oriented developers are invited to DevReach 2009 conference – the premier conference for Microsoft technologies for the Balkans and Eastern Europe region. This year the conference attracts distinguished speakers who will deliver talks about Silverlight, WPF, ASP.NET 4.0, ASP.NET MVC, AJAX, IIS 7, Visual Studio 2010, SharePoint, SQL Server 2008, business intelligence, data access and ORM, LINQ, RESTful applications, WCF, WWF, .NET service bus, Scrum and many others. Some of the speakers:
- Chris Sells – Program Manager for the Business Platform Division Microsoft
- Luka Debeljak – CEE DPE Regional Technical Lead at Microsoft
- Kent Alstad – Microsoft ASP.NET MVP and principal at Strangeloop
- Stephen Forte – Microsoft Regional Director for NY
- Christian Weyer – Microsoft Regional Director and co-founder of thinktecture
- Tiberiu Covaci – INETA Country Leader for Sweden
- Hadi Hariri – Technical Lead iMeta Technologies
- Todd Anglin – Telerik Chief Technical Evangelist
- Richard Campbell – .NET Rocks and RunAs Radio
- Shawn Wildermuth – Microsoft MVP
For more information visit the DevReach conference official web site: http://www.devreach.com/.
Tags: ASP, DevReach, eastern europe region, microsoft mvp, microsoft regional director, NET, October, oriented developers, region, technical evangelist
Author: Svetlin Nakov
September 17, 2009
Today I presented a scientific publication about measuring modified orthographic similarity between Bulgarian and Russian words at the Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages”, held in conjunction with the scientific conference RANLP’2009. The paper is titled “A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words” and is a small part of my PhD thesis.
Abstract
We propose a novel knowledge-rich approach to measuring the similarity between a pair of words. The algorithm is tailored to Bulgarian and Russian and takes into account the orthographic and the phonetic correspondences between the two Slavic languages: it combines lemmatization, hand-crafted transformation rules, and weighted Levenshtein distance. The experimental results show an 11-pt interpolated average precision of 90.58%, which represents a significant improvement over two classic rivaling approaches.
Download
Download the article: RANLP2009-Workshop-Nakov-Paskaleva-Nakov-MMEDR-Similarity-Bulgarian-Russian-Words.pdf
Download the presentation: RANLP-2009-Workshop-Nakov-Paskaleva-Nakov-MMEDR-Similarity-Bulgarian-Russian.ppt.
Tags: Central, conjunction, eastern european languages, Measuring, novel knowledge, paper, russian words, Similarity, slavic languages, transformation rules
Author: Svetlin Nakov
September 14, 2009
Today I presented at the prestigious scientific conference RANLP’2009 a research paper about new methods of extraction of false friends from parallel corpora, which is a major part of my PhD thesis. The article is named “Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus” and was accepted after passing a thorough anonymous review by two distinguished scientists from the area of Natural Language Processing (NLP) and Information Retrieval (IR).
Abstract
False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word co-occurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as “bridges”. Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-known algorithms.
Download
Download the article: RANLP2009-Nakov-Nakov-Paskaleva-Unsupervised-Extraction-of-False-Friends.pdf.
Download the presentation: Nakov-Unsupervised-Extraction-of-False-Friends.ppt.
Tags: Extraction, natural language processing, parallel corpora, RANLP, statistical machine translation, Texts, translation, translation pairs, Unsupervised, word occurrences
Author: Svetlin Nakov
September 9, 2009
Most people using ASP.NET Form Authentication use the built-in <asp:Login> control that works fine but when we use a custom login form we have the follofing problem: the cookie expiration timeout in ASP.NET Forms Authentication for persistent and non-persistent sessions uses the same value. It is defined in Web.config in the timeout attribute of the <forms> tag and has default a value of 30 minutes. Thus but default if you login without “remember me” option, your maximal inactivity period will be 30 minutes. In the same time if you login with “remeber me” option, your cookie’s life will also be 30 minutes, which is obviously incorrect. If you put in Web.config very big session timeout, e.g. 50 years, persistent login will work well but the non-persistent login will not be limited to 30 minutes or so.
The above described problem is a well-known and documented design flaw in Microsoft ASP.NET Forms Authentication framework. The values for persistent timeout and non-persistent timeout obvisously should be designed to be separately definable but Microsoft failed to do this even after numerous discussions in the community groups, forums, blogs, etc.
Note that if you use the <asp:Login> control, and check “remember me”, the asp:Login control itself will set the cookie timeout to 50 years, but if you use a custom (self made) login form or different Web applications framework (not ASP.NET Web Forms), you will need to work around this well-documented bug. Typically I use the following code to workaround this problem:
private void PerformLogin(string username, string password, bool rememberMe, string returnUrl)
{
if (Membership.ValidateUser(username, password))
{
HttpCookie authCookie = FormsAuthentication.GetAuthCookie(username, rememberMe);
if (rememberMe)
{
// In case we have persistent cookie ("remember me" option checked), we need to set manually the cookie
// expiration to 1 year after current date. The default expiration timeout is taken from Web.config
// and is 30 minutes only (for both persistent and non-persistent cookies). This is well documented
// design flaw in ASP.NET Forms Authentication framework and should be manually workarounded!
authCookie.Expires = DateTime.Now.AddYears(1);
}
Response.CreateCookie(authCookie);
Response.Redirect(returnUrl);
}
else
{
// Handle invalid login ...
}
}
Tags: asp login, authentication framework, control, custom login, expiration, formsauthentication, Login, NET, obvisously, Timeout