TextWise WordPress Plugin 1.0.4 Released!

March 10th, 2010 by Jay Baker No comments »

TextWise is pleased to release an updated version of our WordPress plugin which now supports WordPress version 2.9.2. After a few rigorous rounds of development and testing, then more development and testing, the plugin is now available for you to enjoy on your blogs. We’ve worked hard to maintain compatibility with all WordPress versions from 2.9.2 back to 2.6.1. We have our eye on WordPress and know that they’re releasing version 3.0 soon, as well. So we’ll be working to ensure our plugin works with that, too.

If you’re currently a user of our plugin, thanks for using it and giving us great feedback to make it even better. If you aren’t using the plugin…why not? Head on over to http://wordpress.org/extend/plugins/textwise/ and download it to enhance your WordPress blog with relevant media, tags, links, and more! Enjoy!

The Topology of Semantic Space

March 8th, 2010 by Clinton Mah No comments »

People in the information sciences are fond of high-dimensional vector spaces as models of document content. These are in fact only approximations of reality, however; and in the specific case of semantics, they are probably an oversimplification. We already know something about how the neural circuitry in our brains work when we process the meaning of language; we can find no clean finite-dimensional linear space in the tangle of our synapses.

Neural imaging like PET does support the theory that linguistic concepts correspond to particular clusters of neurons connected in fairly complex feedback loops. Our understanding here is still quite limited, though. We do not know how many such clusters exist or how widely they are distributed. Visual concepts are in a different part of the brain than auditory concepts, for example; and overall, we have not yet found any obvious switchboard, say in the hippocampus, that could somehow tie everything together neatly.

In our computational semantic model, we assume that all concepts are independent and equal. That seems to work in semantic dictionary applications when we have thousands of concepts of concepts as dimensions, but an espistemologist here would have the lurking suspicion that our actual semantic space has to be some kind of complex manifold with all kinds of holes and twisting surfaces like a deranged n-th-order Moebius strip. Meaning is messy.

Our linear Euclidean model may therefore be valid only in a small local region of our actual semantic space, but in practice, that is really where all our apps have to live. One cannot presume to comprehend all possible content in text. We can only slice off a small piece of the pie of meaning, and until world peace and perfect enlightenment break out, that is a good start.

How Many Dimensions Again?

February 25th, 2010 by Clinton Mah No comments »

We have been thinking lately about how many dimensions a semantic dictionary should have. Some researchers at Carnegie-Mellon have been approaching the same question from the perspective of neuroscience and real-time imaging of activity in the human brain while understanding language (http://bit.ly/buIZEx).

According to CMU, there are really only THREE basic semantic dimensions: (1) Can I eat it? (2) Can I pick it up? (3) Can I hide in it? Admittedly, this primitive partitioning of the world probably goes back to our primate origins, but does have a certain resonance. Let’s remember it the next time we try to categorize journal articles in nanotechnology or search postings on someone’s Facebook wall.

How informative is Twitter? (part 2)

January 26th, 2010 by Cliff Crawford No comments »

In my last post, I presented some research on the different content types we found in our corpus of 8.9 million Twitter messages. One surprising result we found is that Portuguese is apparently the second most common language on Twitter, beating out both Japanese and Spanish. Given the unreliability of TextCat on short pieces of text, I decided to verify our language statistics by looking at the location field in the user info for the unique set of users in our corpus. This was not a straightforward thing to do, however, because the location is a text field which people can write absolutely anything they want into. For example, the following all occurred more than once in our corpus:

  • “New York”
  • “NYC”
  • “everywhere!!!!”
  • “In ur computers, eating ur RAM”
  • “Earth”
  • “Mars”
  • “Utah :)”
  • “utah :(”

To get around this problem, I normalized the text by converting it to lowercase, removing punctuation, and changing things that looked like addresses to have just the city (so that “123 Fake St., Springfield, USA” becomes just “springfield”). I then looked at the top 500 locations in terms of number of twitterers. These are the most common countries represented in users’ locations:
Twitter User Locations (by Country)
And the top 10 cities are:

  1. New York
  2. São Paulo
  3. Los Angeles
  4. London
  5. Chicago
  6. San Francisco
  7. Rio de Janeiro
  8. Tokyo
  9. Atlanta
  10. Toronto

While the locations are dominated by English-speaking countries, Brazil does come in second in terms of number of users, and two Brazilian cities show up in the top 10, which suggests that our language stats aren’t too far off the mark.

Another question we considered in our study is whether there is any way to distinguish between twitterers who post broadly informative messages from those who post mainly personal messages or spam. Our first thought was that the number of followers a twitterer has would be a good indication of how informative their messages are to a wider audience. But we were quite surprised when we looked at the distribution of the number of followers in our sample:
Histogram of Log Number of Followers on Twitter
The x-axis here is the logarithm base 10 of the number of followers. While most twitterers in our corpus have between 15 to 60 followers (log=1.2 to 1.8), there is a long tail where we can find accounts with more than a thousand, 100,000, or even a million followers. We didn’t realize at first the number of celebrities currently using Twitter, as you can see in this list of the top 100 most-followed Twitter accounts. Of course, it’s a matter of opinion whether the latest funny video that Ashton Kutcher found on YouTube is more important than what Barack Obama has to say about health care, but for our purposes, we’d rather filter out celebrity ramblings from the more serious messages, and that is not easy to do based on the number of followers alone.

A more surprising fact we discovered is that spammer accounts can have relatively high numbers of followers as well, as you can see in the following boxplot:
Number of Followers on Twitter, by Message Type
This data is from the 1,000 tweet sample which was classified by message type that I discussed in my previous post. In this plot, messages about the user’s current status and private conversations are grouped together as “personal” messages, while all other messages (excluding spam) are “info” messages. The boxes show the middle 50% of the distribution for each type, while the whiskers extending from the boxes show where 99% of the data points lie. (There are a few outliers above 5,000 followers which are not shown here, to make the distributions easier to see.) While spam messages only made up a small fraction (4%) of our sample, the plot shows that within the set of spammer accounts there are quite a few which have more than 500-1000 followers, a number which would be pretty high for the other two message types. There is even one spam account in our sample which had over 10,000 followers at the time they posted.

But how could a spammer get so many followers, given that all they post is spam? Given that for nearly all of these accounts, the number of friends (accounts they are following) is greater than the number of followers, I suspect that what’s going on is that spammers go around following other twitterers at random, and at least some of these people are following them back out of courtesy, without realizing that they are actually a spam account. The only way a Twitter spammer could get someone to see their tweets is if they are followed by them, after all. There’s probably a high rate of turnover in a spammer’s followers list, but that wouldn’t matter much, as long as they can find more people to follow who will follow them in return without checking them out first.

All of this means that distinguishing spam from informative tweets will not be easy, even if there isn’t that much of it currently. But some good news for us is that twitterers who post lots of informative content do tend to have more followers than those who post mainly personal messages. This fact, combined with some semantic analysis of Twitter messages, should help us a great deal in mining the Twitter stream for useful content.

QA Team New Year’s Resolutions

January 14th, 2010 by Maurice Forrester No comments »

The end of 2009 provided the TextWise Quality Assurance team with an opportunity to create some New Year’s resolutions. The end of the year coincided with the conclusion of a testing cycle giving us the chance to review what worked and what needs improvement as we plan for the future. This is not necessarily a year end activity and it is certainly not a once a year activity. Reviewing what worked and what didn’t work should be done at the end of each test cycle. But timing enables us to call this our New Year’s resolutions at least this one time.

Our first resolution is to review and update test cases. No matter how much we want to have our formal test cases completely defined before testing begins, there will inevitably be improvements that can be made to the documentation. Exploratory and ad hoc tests need to be written up as formal test cases. New test cases need to be updated or fleshed out based on what was learned during the test cycle. Bug reports and test reports are a useful resource for identifying gaps in the formal test documentation.

Our second resolution is to review and update our existing processes. These processes include both our internal QA team processes as well as larger organizational processes that intersect with QA work. This is an opportunity to address process issues that were encountered during the test cycle and to add new processes that may be needed. Some of the areas we look at are the processes dealing with requirements, design, release, and bug triage. Because many of these processes cross groups, all stakeholder must be involved in making changes.

Our third resolution is to review the data we collect and report. We review the specific data we are collecting, how we collect that data, and the format in which we present the data. Process improvements may be needed to improve our collection of data. Other groups, and particularly the management team, need to be involved to ensure that we are providing the information they needed to continue to grow our organization.

Our final resolution is to look to the future. All of the previous resolutions are aimed at making improvements for the future but they are based on what has happened in the past. Now is the time to also look forward at the projects that are coming our way and ask what information we need and what planning we need to do. It’s also an opportunity to take the things that we’ve learned from the previous resolutions and apply them to our pending tasks. In this way, we seek to continually improve the QA team and contribute to the growth of the organization.

In With the New, Out With the Old

January 12th, 2010 by Clinton Mah 1 comment »

Happy 2010! It’s time to think about building the next generation of semantic dictionaries, much like rolling out next year’s models of automobiles from Detroit. We are not talking tail fins and rich Corinthian leather here, however. In the Age of the Web, any kind of dictionary will have a strictly limited life span.

Think of Susan Boyle, the Droid, and the Zhu Zhu hamster. You are not going to find any of these in a dictionary compiled just a year ago. And a year-old semantic dictionary would probably have few associations between Barack Obama and White House.

We have been at TextWise have been busy learning how to build new semantic dictionaries in a hurry and to make them better at the same time. With our statistical technology, we can now turn one out in about two weeks, including data collection time. Everyone will want them fresh and hot.

How informative is Twitter?

January 8th, 2010 by Cliff Crawford 57 comments »

Recently we’ve been looking at how well our Semantic Signatures technology works with messages posted to Twitter. These kinds of messages pose significant challenges for the semantic web in general, because their extremely short length (140 characters or less) means that there will be very little context available for understanding the content of the message. In addition, many of these messages feature “creative” spellings and grammar, and are of a personal nature (e.g. “Having sushi for lunch today”) that would not be of general interest. Extracting any meaningful information from these snippets of random conversation will be quite a difficult task indeed.

To see what exactly we’re up against, we undertook a small study to characterize the different types of messages that can be found on Twitter. » Read more: How informative is Twitter?

Learning

December 21st, 2009 by Clinton Mah No comments »

Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.

So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.

The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.

Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.

Testing WordPress Plugins

December 17th, 2009 by Maurice Forrester 1 comment »

Testing our SemanticHacker WordPress plugin has some similarities to testing foof, our Firefox extension, in that we are testing within another application. As with testing Firefox extensions, WordPress plugin testing must include testing on multiple operating systems and multiple versions of Firefox, and it adds the need to test on additional browsers. Because WordPress has been releasing frequent updates we’ve had to focus attention on how to quickly verify our plugin on each WordPress upgrade. As a result, we have two major types of testing for our WordPress plugin: testing a new release of the plugin and verifying our plugin in a new WordPress release.

Regardless of which type of test sequence we’re on, there are some things that we always have to test. We need to validate all supported browser and OS combinations and we need to test all functionality of the SemanticHacker plugin. This functionality includes the ability to use text in a blog post to find relevant content links, tags, webpage links, and products.

When testing a new release of our WordPress plugin, we have two user paths we need to test: An update of an older plugin release and a fresh install of the new version of the plugin. We run our tests on all versions for WordPress that we are supporting following both paths. Of course, if there is new functionality or bug fixes, we need to add test cases to cover those cases.

When there is a new WordPress release, we also consider two paths in which our plugin can appear in that version of WordPress: One is an existing instance of WordPress with the Semantic Hacker plugin is upgraded to the new version. The other is that our plugin is installed fresh on the version being tested. All tests are run on the new version of WordPress following both possible paths. Assuming the new WordPress release passes our tests, we add that version to our list of supported WordPress releases. At the same time we determine if there are older versions on the list for which it is no longer worthwhile to continue testing because they are too little used.

ABC’s of Semantic Dictionaries

December 14th, 2009 by Clinton Mah No comments »

A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.

In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.

Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.

Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.

In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.