SIGIR Highlights July 22, 2009

posted by Mary McKenna on 07/23/09

SIGIR Day Three  July 22, 2009

Great job by Daniel Tunkelang of Endeca putting this track together

Morning: Industry Track Speakers

“Webspam  and Adversarial IR: the Road Ahead” (Google) Matt Cutts

Requirements for spammers: Content, Reputation, Opportunity for monetization.  Examples of on-page and off-page spam provided. Spoke of defensive tools such as nofollow.  Clear increase in devising spamming routines to outright hacking.  1)  Concentrate on finding hackers  - joining with spammers – malware detection key  - hack sites and sell links.  2)  Prevent common spam – human tests, etc   – which techniques prevent it that any site pub can use (spam classification for wordpress blogs – good tool).  3) Looking for trust, identity, authentication.   Warning – facebook, twitter, etc new ecosystem new forms of spams, fake profiles abound.

“The Searchable Nature of Acts in Networked Publics” Danah Boyd (Microsoft)

Danah’s research area is social media, she’s looked at differences between myspace and facebook, etc. Her focus is communication – she is an ethnographer:  “How young people use the internet” Everything is VISIBLE. Distinction between social network sites and social networking sites. Social network sites: A/ engaging with preexisting friends  (diff from social networking sites – meet new people) Profile is the digital body – misinformation is intended and everywhere  1) meant to be funny (alter egos) 2) young people have been told to lie about who/what – keep away the predators, 3) don’t want to searchable/found. Don’t assume there is accurate information in social network sites. Average age stats are wrong!  B/ Public articulation of “friends” – assumes links are equal but relationships are not equal. Three key concepts of networks: sociological, articulated (public), behavioral (exchange content/interact).   Networked Publics: Issues – Persistence, Replicability (context freq gone), Searchability (not who you want for the most part) Scalability (who is seeing your content) Invisible Audiences – who are you talking to? Leads to imaginary audiences. Collapsed Contexts (social context is constantly changing, freq misleading) New Public/Private Boundaries (getting reworked). Twitter: just this spring a Big player. Twitter is not a chat. It is constantly changing who is using it for what; celebrity cache and mouthpiece to get back at powerful bloggers; soapbox, you choose who you follow.  5-15% accounts are protected – most accts are public. 5% contain a hashtag (almost half of these contain a URL).    22% include a URL.  36% mention another twitter user (put it at the beginning – Tweet is really directed at individual).   50 accts  have over 1M followers, 350 have several hundred thousand,  millions of accounts are dead. 140 characters – very  difficult constraint for searchability; retweets – some attribute, some drop it.  Info Retrieval Thoughts: Social media is about conversation and contexts – tough to make sense of the social context.  Danah@danah.org

“Ad Retrieval – A New Frontier of Information Retrieval”  (Vanja Josifovski – Yahoo! Research)

Disclaimer – can’t expose any Yahoo! trade secrets.   40% ads textual – competing with other content on page  - sponsored search and content match placement.  ~30% of web users interact with ads (thinks this is because ads are not relevant).  Text ads have visible/non-visible parts – landing url too big, too much info.  Bid phrase (keywords) used to target ad (ads are creatives + bid phrase).  Ad Retrieval:  Sponsored search – keyword bid – dbase technology.   Content Match –look for bid phrases, place ads – still single feature matching.  New way to look at it: Treat the ad as a document in IR.  Cost of serving the ad needs to be less than the revenue returned, also need to keep performance in mind.  

“Corpus Linguistics and Semantic Technology at the New York Times”  (Evan Sandhaus – NYT)   (Semantic Technologist, NYT R&D) 

NYT annotated corpus – LDC – 20 years of data.  20 years of annotated corpus (launched 10/08) obtain through LCD or nyt – 1987-2007 – 1.8M articles, abstracts, 900K+ tags, 665K abstracts  – NITF formal, xml standard  corpus.nytimes.com   Reuters corpus came first, smaller collection, annotated.  Potential uses of data: some ideas..  Location of article is implicit ranking, # of words, etc. (mm: too temporal?); Automated document summarization corpus gold.  80 users after nine months.

“Query Modeling at bing” Nick Croswell, Bing (Microsoft Research Cambridge)  (Filling in for sick speaker – OCLC)

Can’t  tell how everything works (trade secrets, etc)   Ambiguous queries: ‘house’  Mine the logs – click logs; Session data – what query follows the query “house”?; What other queries have click on that URL (co-click data); Intent clustering – use session data string of queries, keep nodes but replace edges with co-click data; then cluster on top of this.  Provides measurement poss, improve IR modeling of queries, put into UI dev. Temporal dynamics in logs: spiking/seasonal queries – feed ranking; Periodic queries (DST 2007/DST 2008);  Stale anchors / trailing signals (BO -campaign page v white house) temporal query expansion – watch spikes. Table of Contents: Summary of aspects of entity – summarize mainline (results) Sticky control panel.   “Bing Gets It” provide info along popularity and consistent content, summary of mainline results.  V1 just out, continues development.

Industry Track: Afternoon Panels

Search Industry Analysts:    Whit Andrews Gartner  Sue Feldman IDC  Theresa Regli (CMS Watch)    Marti Hearst (Responder)     Daniel Tunkelang (Moderator)

TR: ( Implementation consultant for ten years) 3 yrs as an analyst, evaluate products, what fits for your needs? CR of search products – clients have very specific needs. Sounds like she has lots of eDiscovery clients; audio/video clients;

SF: (linguistics background) Web search is not ahead of enterprise search – v interesting stuff in the enterprise search systems. Real time information – Enterprises have immediate info needs, automated online info key;  Mobile work force – access to everything in company (security and access); Money – now a necessity to fund access, not a nice to have.

Trends – search based apps to solve a biz problem – borrowed from search architecture; Convergence of platforms – IM, search, etc, Unified access to info – BI tools on all data – flexible Hybrid architectures – dbase with inverted indexes / but dbase features supporting ad hoc querying and search. Search is not a goal in itself, needs to be integrated into the workflow process. UI will sell new tools to new buyers (mktg/mgrs, not it staff). Task/tools, not single search tech.

Open source embedded everywhere – collab, crm, sales, etc.  Lucene, Solar, etc.  (not free really)

WA: (ex-journalist) 4 trends: Federation (access without paying for it) doesn’t nec always work, but seeing improvement and will see value. Conversation – disambiguation of query – ask the user – participatory search; Transparency – what is driving results ranking. Video – growing like crazy in 2009. Real time is more important than ever.  Value. “Relevance is about money”

Lively discussion throughout the session about the relevance=money statement. Big Disagreement.

QA Session – Could not hear some floor speakers, selected questions only her

?Employees want a search box – no one wants sophistication? No defense against the search box and search button

A: People are not married to Google results if you show effective use of different system.

? How do you evaluate an enterprise search system?

A: depends on the need: recall sometimes, precision sometimes A: precision.  Business goals are what matter -any evaluation needs to be within business goal and most companies don’t have them coming to the table.

Marti: Spent a lot of time in this panel talking about integrity, unusual for this conference as we strive to be honest and direct in our work. How do you control when your competitor says “this is the cutting edge” how do you not go on the bandwagon.  SF: Don’t read competitor reports. If buyers, first bring your requirements.   WA: we have a hype cycle.  Tech trigger to hype, to expected expectation, trough of no delivery, then the reality.  Where it is on the hype cycle? where is it on the adoption curve? What hype are you willing to tolerate – where does your business fit on this curve?

Theresa: Tamping down the hype – be skeptical.

(Several other questions here but responses were wide ranging so I have left out)

Industry Track: Vendor Panel

Jeff Fried Sr Prod Mgr MSFT(M); Rual Valdes-Perez Co-Founder Vivisimo (V); Adam Ferrari CTO Endeca(E)

Liz (Moderator)    Bruce Croft (Responder)

Liz: What areas need work/advances?

Endeca: Evaluation for interactive IR? What features to include? Efficiency of architectures

Search to find vs search to learn – > interactive search

Microsoft: 3 views – Biz Analyst, End User, Systems Guy. Where researchers and practitioners align:

Take users intent, match it to content. Most people are unhappy with enterprise search. Context matters, diff systems for diff companies; Positive feedback: what’s working; User Experience Measure: search.ui. matters.

Vivisimo: The big opportunity in enterprise search – web search history.  Enterprise – lots of new companies and UIs

Single search box access to everything is the big opportunity in enterprise search.   

Liz: What is the first problem we should work on? E: Test the efficacy of interactive IR.  M: Holistic eval is needed. Better theory about interaction. V: Search to Problem solving systems.

Liz: What’s unique about determining relevance in your system? M: Provide controls, slide bar for precision/recall control, exploration. Use user logs to improve relevance. V:Tunability, and something else, missed it. E: need relevance, facets, may care about them differently – customizability.

Liz: What would be most appreciated by users of websearch from Enterprise search. ? V: Def UI.  Don’t need to worry about ad real estate on the enterprise screen. E: People will enjoy better faceting, etc. but slow. Advance features, visualization, etc. really is appreciated. M: UI, faceting, exploration. People will use these things if they work.

L: What evaluation measures does Enterprise search use/need? E: business metrics will dominate. Look at logs. What’s working, what isn’t.  M: Task completion and happy users.  V: we see people use # searches before/after adoption. What is the average position of the doc people click on?

L:What is the fundamental technical problem you are currently focused on?  M: Systems scale footprint needs. Grow as much as you need it to. V: getting search to work across vast array of content types.  E: Greater effectiveness with greater simplicity.  M: Search connectors are always tough, would love to have research on it.

Where could academics and vendors work together? V: Companies only want people, not their research. People transfer would be the best project.  Fund university – get grad students. E: Events like today. Cross pollinate. Formalize ways of opening dialogue. More openness on part of vendors for transparency. M: LDC, SIGIR Industry Track, University sponsorship needs to be easier, cheaper.

Panel had an opportunity to ask each other questions:

E? How do you reconcile single search box with enterprise search needing deep interaction with repositories of data?

High value deep problems won’t work with this… V: Do believe this is an opportunity. Not a research problem. It’s an opportunity. Facets are needed for every diff kind of data. Provide both.

? Where does Endeca see the role of federation? Coexist but infrastructure needs to be there to accommodate deep search needs.

V?: Sharepoint is wonderful. I’ve seen the videos. Can a platform serve the entire a market? Underserve/overserve markets.  High end medical records mgmt might be underserved.   (Missed audience QA and Responder)

3 Responses to “SIGIR Highlights July 22, 2009”

  1. [...] post of hers that covers very similar material. You can also read summaries of the actual talk by Mary McKenna at SemanticHacker and Daniele Quercia at [...]

  2. [...] For other perspectives, take a look at Theresa’s blog post, “Know Your Relevance“, or Mary McKenna’s summary post. [...]

  3. [...] was a major theme, though perhaps that’s not surprising given the participants. As usual, Mary McKenna took more detailed [...]

Leave a Reply

You must be logged in to post a comment.

Semantic Signature is a registered trademark - © 2010 TextWise, LLC. All rights reserved. Privacy Policy