Kamusi: The Semantic Search Engine of African Languages
{ July 2nd, 2008 }
Email |
PDF |
Print |
Creative Commons | Sphere: Related Content
What African language is spoken by one out of every 60 people on earth? Swahili. The Kamusi Project function is to help that number grow with their collaborative database/dictionary: Kamusi.
Late last year KamusiProject.org put out a call for volunteer programmers to help them make some changes to their website. They wanted to go from Perl to PHP, a debugged MySQL Database, reduce the amount of requests from search engines and they wanted to expand the database with a plan to go from two languages to two dozen in two years. Coders championed the cause and while not all of the updates have been made to date, Kamusi recently announced that they’d be going multilingual (adding many other widely spoken languages African and global) and that they will soon deploy an embeddable widget called Wijiti!
If you aren’t familiar with the Kamusi Project, the owners refer to it as a “living dictionary” that allows users to look up English to Swahili, Swahili to English translations and definitions. What makes Kamusi stand out are all the collaborative features that have been integrated. Users can upload pictures for their corresponding words. Users can also request to become editors (think moderators on Wikipedia) to help improve translations and adding words. There’s a sidebar that allows you to see what other users are searching in the database in real time (with each page refresh).
The biggest problem Kamusi faces is affording enough bandwidth to handle all the calls to their server. One reader asked if they would open the new languages to a Wiki-style system allowing for the contextual database to be populated more quickly. Here is what project director Martin Benjamin had to say in response:
Once we are able to find true funding for each component language, we will be able to open it up to a give-and-take between the official editors and the remote participants. But the fully open Wikipedia model won’t work for these purposes, since there is too much data that requires highly specialized knowledge and too many opportunities for mistakes, vandalism, etc. If we find that there is a particular community that is large enough to open up the process more widely, we could consider modifying the submission model for that language. But even then, the model would need to be partially restricted - people would apply to be part of the group that approves entries, and would only be accepted to that group if they complete a training process that puts them in sync with the rest of the editorial board.
I question the benefit though, whil I think this is a great idea, if the Kamusi board is having problems with bandwidth now, once they deploy this widget their database will be inundated with queries. If a site that runs their widget gets Dugg or in Times magazine or something (not to mention the site itself), they’ll long for the days when they only had search engine problems! Regardless take a look at the plans for Wijiti below.
Slide show of the Wijiti specs:
The best thing about Kamusi is the fact that beyond direct translations, Kamusi queries also return modifiers for words and explains what it means to add them, how grammatical context can change that words meaning and what class (part of speech) the word is. There’s also a learning center that features multimedia tutorials, a forum and all sorts of other useful tools for those who want to learn Swahili. Martin Benjamin obviously has some big plans in mind. In his piece called “The End of English” he references the economics of language and loosely describes in perfect detail what makers of semantic web applications all over the world are envisioning for the future of the internet:
It is much easier for a few people to teach a computer how to speak a language than for the millions of speakers of that language to learn how to speak to a computer in another tongue. It is also easier for a few people to translate a movie into 2000 languages than for the billion people who speak those languages to learn the languages coming out of the actors’ mouths. Let’s add a third IT feature - voice to text, and text to voice. The day is not too far off when you can talk into your computer and the machine will recognize what you say and reliably convert it into text. We also have good working examples of computers that can read written text and speak it out coherently. Combine those technologies with an intermediary translation system, and we can easily picture the following chain:
1) I speak English, and my computer converts what I say into English text.
2) My English text is translated by machine into another language
3) My translated text is read aloud by the computer in that other language.Voila, I can now communicate in real time with someone who does not know a single word of my language! Sure, the translations won’t be perfect, but they will be good enough for me to order a product, or negotiate a contract. And suddenly, the person on the other end does not need English.
What Martin may not realize is that Kamusi could play a huge role in this, allowing for his ‘living dictionary’ to add a semantic layer of context to future widgets and web applications. Wijit truly looks interesting and it’ll be exciting to see how it’s used by fans, bloggers and other website owners adopt it.
Follow the Kamusi Project on Twitter
Categories: Industry News ~ Trackback













Main Feed (All Categories)