Showing posts with label linguistics. Show all posts
Showing posts with label linguistics. Show all posts

Friday, June 29, 2007

The Powerset Demo Day

I just had the luck of being invited, together with forty other people, to the first public demo of what the guys at Powerset are building up. The demo was in their headquarters in San Francisco. It was fairly impressive, to put it mildly.



They showed demos of all what they've been blogging about, like this or this and some other new applications.

I also had a chance to talk to some of the core Ruby developers, the search ranking engineers and the linguists working there and it's really an incredible team. One can smell the excitement floating in the air about what they do.



Steve Newcomb and others commented on different sides of their technology and the company itself. Mark Johnson introduced a just out of the oven Powerlabs, to give a taste of what's coming in September. It'll be possible to generate mash-ups using their natural language processing and understanding technology which, in my humble opinion, I think it's going to truly open the doors to a new generation of clever semantic tools. They aim to being really open about their system and to let people interact with it. Another side in which they also want to be open is in their contribution back to the open source community from which they build a lot of their infrastructure.



From seeing their demos, it really looks like they are are taking search, among other things, to a new level by not indexing keywords but actually indexing "concepts" that they extract by semantically parsing the searchable data. If they find "dog" in a sentence they will associate also "mammal", "animal", "pet" allowing for real abstraction when performing searches.
They combine that with normal search ranking techniques to obtain impressive results.

It really creates new ways of interacting with the data to be searched. Queries like "What politician died from a disease?" or "What disease killed a politician?" work flawlessly even when there are no references to die or kill in the text. Their natural language engine understands that "died from" equals to "killed by" and relates to "deceased" or "pass away". It really knows about those concepts abstractly and the semantic relations in the search query.

Powerlabs, the social site aimed at letting people play with their technology that will be launched in September, already has more that 10000 members signed up , so definitely there's a growing community interested in their developments.

During the Q&A sessions some very interesting topics popped up, like support for multiple languages and detection and resistance to spam (or text created by different models in order to appear human generated). Also, the "understanding" that they obtain from parsing a sentece could allow to better spam filtering, not just by spotting more or less likely-to-be-spam words, but actually detecting incoherent meanings or just uninteresting topics... just imagine having messages in your inbox automatically clustered by their real meanings, without having to specify a single rule (emails dealing with this, emails dealing with that...). The applications are endless...

I'm dying to play with it...

Friday, June 22, 2007

Powerset and the garden path

I've recently bumped again into Powerset. I had previously heard about them when they got some people from PARC (if I remember correctly) and went into attempting to build something that I had always dreamed about. The guys at Powerset are tackling one of the hardest and most interesting (in my opinion) problems currently known, that is, helping computers process and "understand" natural language and use those results to make information more accessible. From my humble amateur-linguistic-aficionado point of view, they are doing it a great work there. Soon I will have a chance to see it live, first hand, and I can't wait.

In one of their latest posts they discuss some ambiguities that arise from using words with several meanings in contexts where the least used of the meanings is taken into use, leading to misunderstandings.

To put it in other terms, the problems arise when using the less known meanings of words in a way that the brain is misled when starting to read a sentence and leads to misunderstand the subsequent words (which can also have several meanings which depend on how one understood the start of the sentence) .
Normally, once the sentence has been read several times, the brain finally "switches" into the right interpretation of the different meanings of those words in a way that the whole construct becomes coherent.
I personally see it as resembling the visual phenomena where the brain interprets specially crafted images in different ways, switching back and forth between their different interpretations, like in the Young Girl-Old Woman Illusion or the Rabbit-Duck one.

In the case of these garden path sentences, as they are commonly called, the brain gets confused because of the dependencies between the words and their meanings.

As the brain starts reading a sentence, it will attempt to predict what follows, and it's amazingly good at that. The trick is to throw it off track by using words with multiple meanings.

In the example that they have as their post title "Search Engines Leaking Oil for Holes" the brain is tricked by taking the most common meaning of the first two words (a composite noun or collocation) and attempting to interpret it in a way that later becomes rather confusing when reaching "leaking oil".



Re-reading the sentence can lead to a second interpretation



In their post they ask how hard would be to find an automated way of generating such garden path sentences and they describe a pseudo-algorithm like the following:


You can make your own garden path sentences by following a few simple heuristics (...). The trick is to choose words that can act as both nouns and verbs, or as both adjectives and nouns, words like store, search, and post. Then follow the ambiguous word by another word that can take on more than one form. The hard part is to then add on another noun phrase that makes sense with the less common interpretation of the second word.


Trying to follow their heuristics, the first thing to do would be to find sets of words that can be both a noun and a verb or and adjective and a noun. Thanks to WordNet, PyWordNet and the mash-up of those and more provided by the guys from NodeBox that's not such a hard task as it would have otherwise been without such toolset.

Sets of words fulfilling those requirements can be build in a few lines of Python.


# Collect nouns, verbs and adjectives
verbs = set( wordnet.V.keys() )
nouns = set( wordnet.N.keys() )
adjectives = set( wordnet.ADJ.keys() )

# Pick the ones that can work both as nouns and verbs or as nouns and adjectives
noun_verbs = verbs.intersection(nouns)
noun_adjectives = adjectives.intersection(nouns)

print 'Found % d words that are both verbs and nouns' % len(noun_verbs)
print 'Found % d words that are both adjectives and nouns' % len(noun_adjectives)

Found 4096 words that are both verbs and nouns
Found 3138 words that are both adjectives and nouns



I will also need to have some means of knowing which words are more likely to follow a given one. For that I will reach into some datasets I collected years ago for some computational linguistics experiments I did. Using a small corpora of 2.071.007 sentences built out of books from the Project Gutenberg and parsing it through some Python code I obtained 16.057.624 word pairs, 2.365.383 of them unique. That will provide me with some numbers on what words are likely to follow others.

I can now look for frequently used words that can be both nouns and verbs. In the following line "occurrences" is a list containing all the words and the number of times they appear. They are filtered to only show the ones that are both nouns and verbs.


print [word for word in occurrences[:300] if word[0] in noun_verbs]

{{"be", 10070}, {"have", 7827}, {"like", 6577}, {"will", 6201}, {"out", 5422}, {"still", 4136}, {"even", 4049}, {"man", 3957}, {"can", 3866}, {"down", 3376}, {"see", 3104}, {"do", 3097}, {"time", 2729}, {"people", 2663}, {"well", 2602}, {"last", 2581}, {"back", 2337}, {"white", 2250}, {"make", 2088}, {"till", 2083}, {"come", 2048}, {"black", 2021}, {"general", 2004}, {"found", 1935}, {"light", 1918}, {"round", 1910}, {"go", 1880}, {"better", 1815}, {"face", 1755}, {"saw", 1742}, {"lay", 1740}, {"work", 1682}, {"form", 1678}, {"let", 1673}, {"right", 1654}, {"set", 1647}, {"lord", 1621}, {"look", 1579}, {"take", 1577}, {"hand", 1574}, {"head", 1546}, {"full", 1544}, {"best", 1538}, {"put", 1534}, {"state", 1531}, {"party", 1522}, {"love", 1517}, {"place", 1493}, {"house", 1491}, {"say", 1440}, {"get", 1401}, {"part", 1386}, {"water", 1385}, {"name", 1384}, {"second", 1370}, {"give", 1344}, {"felt", 1342}, {"present", 1327}, {"fell", 1320}, {"land", 1319}, {"use", 1311}}



Now given a word it's possible to find other words that would often follow it and can also have several functions. For instance, lets see what comes out for "look":


# Pick words following 'look' that can be both nouns and verbs
succeeding_words = [p for p in word_sparse['look'].items () if p[0] in noun_verbs]
# Sort them by the most frequently used to the least
succeeding_words.sort ( lambda a, b : -1 if a[1] > b[1] else 0 if a[1] == b[0] else 1)
print succeeding_words[: 100]

"[('like', 255), ('out', 185), ('down', 124), ('back', 115), ('forward', 82), ('round', 49), ('well', 42), ('pale', 26), ('better', 16), ('black', 9), ('right', 7), ('full', 6), ('white', 5), ('blue', 5), ('grave', 5), ('even', 4), ('still', 4), ('double', 4), ('cross', 4), ('close', 3)]



And the results for "form"


[('name', 185), ('part', 18), ('can', 8), ('saint', 8), ('like', 7), ('see', 5), ('will', 5), ('state', 4), ('ice', 3), ('till', 3), ('lay', 3), ('french', 3), ('people', 3), ('found', 2), ('out', 2), ('put', 2), ('well', 2), ('note', 2), ('black', 2), ('starch', 2)]



Although not being a native English speaker makes this a tiny bit more challenging, I can see how one could play with combinations like "look, like", "look, still", "look, well", "form, name", "form, like", etc. to build slightly confusing sentences.

Collocations also are great to mislead the brain whenever one of the words has more than a meaning ("visitor center", "search engines", "meeting point") .
A quick hack to try to spot some automatically could be to look for pairs of words often appearing together and having the desired properties of fulfilling more than one function.
But given the low quality results in the shown next; one could, for instance, also take into account the relative frequency of a noun-noun compound as compared to other pairings of the nouns, to try to see how much more often those two words appear together than with others. There's extensive literature on how to improve this and this was meant as a short-ish blog post after all.



print [ p for p in word_pairs_occurrences[:10000] if en.is_noun(p[0][0]) and en.is_verb(p[0][0]) and en.is_noun(p[0][1]) and en.is_verb(p[0][1]) ]

{{will, be}, {can, be}, {labor, force}, {will, have}, {be, found}, {can, do}, {come, back}, {exchange, rate}, {will, do}, {will, make}, {can, read}, {prime, minister}, {will, go}, {will, give}, {have, come}, {come, out}, {go, back}, {right, hand}, {set, out}, {go, out}, {find, out}, {can, see}, {will, come}, {come, down}, {will, take}, {have, found}, {short, form}, {will, tell}, {birth, total}, {get, out}, {go, down}, {land, use}, {be, put}, {can, tell}, {father, brown}, {will, find}, {white, man}, {put, out}, {take, care}, {can, get}, {dare, say}, {will, see}, {can, make}, {be, well}, {short, time}, {can, have}, {found, out}, {lay, down}, {second, time}, {be, better}, {be, read}, {can, think}, {go, home}, {lord, will}, {birth, rate}, {hoist, side}, {meter, gauge}, {ftp, program}, {be, true}, {be, like}, {last, time}, {look, like}, {will, say}, {man, can}, {set, down}, {license, fee}, {come, home}, {can, find}, {make, out}, {put, down}, {give, notice}, {can, say}, {be, cut}, {take, place}, {low, voice}, {will, try}, {cast, out}, {get, index}, {have, put}, {lie, down}, {can, go}, {radio, relay}, {still, be}, {will, get}, {be, ready}, {well, be}, {wait, till}, {get, back}, {tax, return}, {free, copyright}, {fell, down}, {can, copy}, {set, bin}, {have, felt}, {look, out}, {be, out}, {form, name}, {satellite, earth}, {burst, out}, {will, keep}, {be, free}, {can, give}, {double, track}, {people, have}, {cut, down}, {will, show}, {fish, catch}, {turn, out}, {carry, out}, {well, have}, {work, force}, {be, set}, {have, set}, {miss, garland}, {will, put}, {can, take}, {do, well}, {let, go}, {mine, hand}, {earth, station}, {fell, back}, {take, heed}, {short, distance}, {air, force}, {can, help}, {will, help}, {cry, out}, {will, let}, {free, state}, {feel, like}, {will, cause}, {present, time}, {will, think}, {be, present}, {will, return}, {cast, down}, {black, man}, {narrow, gauge}, {bulletin, board}, {man, be}, {be, right}, {dry, tree}, {will, set}, {be, back}, {point, out}, {right, side}, {can, come}, {look, down}, {will, call}, {run, down}, {file, size}, {major, transport}, {labor, party}, {be, content}, {will, leave}, {man, will}, {will, look}, {can, use}, {need, be}}


Definitely the problem is very challenging with current tools, but it's always fun to give it a spin. With a few hours and limited tools I could only get to think of some ways to find good candidate words for garden path sentences. Definitely nowhere close to actually completing full sentences.

It would be great to expand on this toy research and make it actually useful and interesting. Using larger data sets (like this Google data set) from which to extract word relationships would be a good way to start. Having statistics for trigrams, fourgrams, etc. of words would make things better, having more contextual information would be possible to get more meaningful constructs by ensuring that the chosen words occur close within a small context.

I can think of more ways of improving it, most of them involving large datasets and lots of computational power... gosh, I'm getting carried away thinking about this...

Looking forward to Powerset letting people play with their tools, I' m sure that implementing ideas like the one discussed in this rant will become much easier.