Work in Progress
Please bear with us as we update SemanticHacker.com for the transition out of SemanticHacker Innovators’ Challenge mode. The API is still fully functioning and will continue to be available. Thank you for your patience!

Please bear with us as we update SemanticHacker.com for the transition out of SemanticHacker Innovators’ Challenge mode. The API is still fully functioning and will continue to be available. Thank you for your patience!
The SemanticHacker Innovators’ Challenge concluded last night. Thank you to those of you who have entered and good luck! We are eager to start digging into the entries.
A reminder that per the terms of the Official Rules:
Notification of a selection will be made on an individual basis to the Entrant, solely by email to the Entrant’s email address Entrant submitted to Sponsor on the entry form, within 12 weeks after the close of the Challenge, which is Wednesday, September 10, 2008.
A reminder that the SemanticHacker Innovators’ Challenge ends Wednesday @ 11:59pm EST. Or…
10:59pm CST
9:59pm MST
8:59pm PST
Best of luck to those of you who have entered or will be entering by the deadline!
TextWise LLC is actively undertaking research, development, and productization of semantic technologies. As part of this effort, we have an immediate need for a research scientist to perform the following duties:
Required qualifications for this position are as follows:
TextWise is a small company with a start-up atmosphere. The team style is collaborative and high energy. You will work daily with a team of scientists. In addition, you will also interact with the Software Engineers, Quality Assurance, and Project Management, and Sales.
This position reports to the Chief Science Officer. Salary is commensurate with experience.
Please send resumes and cover letters to job_ref_500 [@] textwise [DOT] com.
Perhaps there’s some confusion as to whether or not the end of the Challenge brings an end to our open API as well. In fact, it does not. We are working hard on bringing more features to the API even after the Challenge has ended.
Feel free to pop over to the Forum and suggest other useful tools you might like to see made available.
The SemanticHacker Innovators’ Challenge is scheduled to end on June 18, 2008 at 11:59pm EST. That’s just 14 days from today.
We want to know though - has it been enough time? A Forum topic has been set up for responses here: http://talk.semantichacker.com/javabb/viewtopic.jbb?t=32 or feel free to comment to this blog post.
If we get enough response we may consider a time extension!
Good luck to those of you who have entered already.
The new Semantic Hacker match server exploits the fact that Semantic Signatures® are mathematically related. The original example tools we provided have a “similarity” function. This produces a “score” of how closely related two signatures are. Often times when writing applications a large set of documents is in hand, and one wishes to find the most closely related documents to some other document. This is exactly how the Wikipedia extension (and the front page demonstration) work. We used the API to generate a signature for every single Wikipedia page. All 2 million of them. Then, we took those signatures and added them to a match server. Once that’s all done we can get the most closely related Wikipedia articles to any document.
The concept of the match server is the same as using the similarity tool and then sorting by which ones had the highest score. That’s tedious work and error prone code. We’re providing the match server to speed up application development of ideas that require it.
The match server is also very very fast. It can sift through all those 2 million Wikipedia pages and grab the top matches in less then 10 milliseconds.
Semantic Signatures® represent content as points in an abstract m-dimensional conceptual space. By itself, a signature can give us an idea of what a particular document is about; but they become even more useful when we can define a distance between two signatures. Those familiar with vector space theory and the IR work of Gerard Salton will know what to do next: compute a cosine measure between a pair of signatures to be interpreted as vectors.
You need not understand the math underlying cosine measures. It is enough to know that, for Semantic Signatures®, they will range from 0 to 1, where 0 means no match at all and 1 means a perfect match. The problem is in determining what range of values will indicate a good match for a given application. The SemanticHacker example tools web page recommends a minimum of 0.4, with 0.8 and above being a good match; but the choice really depends on your application.
For example, if you are checking for plagiarism, then a similarity ≥ 0.95 might be a helpful result. In most information contexts, however, you are rarely interested in getting exact or close duplicates of what you have already; and so you might want to set an upper threshold of 0.9 or even 0.85 so that your matches will find more diverse information.
Similarly, when missing something entails a high cost (e.g. 9/11), then you may want to lower your match threshold down to 0.2. This means that most of the matches will be noise, but if someone is willing to sift through them all, then there is a significant chance that you will find something. One has to make the proper trade-off here.
In theory, half of any Semantic Signature® conceptual space will be within 0.7 of any given point in the space. In practice, signatures are so sparse that there will usually be only a few within 0.7 of a given reference point. This sparseness is actually a good situation to have, when your application allows you to take advantage of it.
A Semantic Signature® can be seen as the result of a kind of election to choose semantic categories to describe the content of a document. A semantic dictionary serves to define how each term in the document will vote for different categories; and so this will be critical to the usefulness of signatures. The suitability of a dictionary for an application will depend on its range of categories and on the breadth of its vocabulary for those categories.
The current API dictionary was trained on the listings of the DMOZ Open Directory Project. It is particularly strong in covering the content most commonly found on the Worldwide Web; for example digital electronics, video games, professional sports, movies, and cooking recipes. Since the Web seems to be biased toward the interests of young males, however, an ODP dictionary may provide less detailed coverage of subjects like designer shoes for women, Roth IRA’s, or Tanzanian rural development.
When generating Semantic Signatures® for a particular application, check their weights to see how well they are capturing your own target content. In the top 30 weights now shown, you should see a good contrast between the highest and lowest weights. We want to avoid something like the 2008 Democratic U.S. presidential primaries and caucuses, where one candidate is ahead, but there seems to be no clear winner. One can approach such a degree of contrast statistically, but simple eyeballing should be good enough most of the time.
In areas where the ODP offers fine-grain coverage, you may get many relevant categories, which is OK. The problem is when you see signature weights about the same for many categories that don’t seem to be closely related. In that case, you may want to try increasing the amount of text you generate signatures from in order to get more corroboration on voting. If you insist on doing women’s haute couture or calls and puts in the options market, however, you probably want a specialized semantic dictionary; this is not difficult build, but requires proper training data.
NOTE: For the purpose of the SemanticHacker Innovators’ Challenge we will evaluate all application prototypes using the general purpose dictionary provided with the API. We understand that certain dictionary customization may be required after a winner is selected to improve the “matching” capability for a vertical. That work will be included in the product build.
TextWise will be speaking and exhibiting at the 2008 Semantic Technology Conference in San Jose, CA. We would like to invite you to attend with a $200 discount off the full registration fee. The discount expires May 9th. MORE INFORMATION. The tutorials, sessions, etc. run from May 18th - 23rd.
Join our conference session on Wednesday, May 21st from 9:45 - 10:45am.
Also, look for us at booth #302 and we’ll have some of our beta applications up (internet access permitting) that you can play with and plenty of representatives to chat with. Hope to see you there!