HighTechCville

Just another WordPress.com weblog

LinkedIn terrified of OpenCalais?

with 4 comments

Based on some basic testing it seems like LinkedIn blocks any traffic from the SemanticProxy.com site. I wanted to see what my page http://www.linkedin.com/in/epugh would render when fed through the OpenCalais entity extraction engine, and instead I get back connection errors.

This actually makes a lot of sense. LinkedIn’s public profile pages are there to be indexed by search engines, drive more traffic to their site, but to convince users to join their walled network. So they put some information out there. But, to make sense of the data, you need to join the network so you can do queries, and see the underlying meaning behind the text.

But, with tools like OpenCalais proliferating, this allows other folks to add meaning to these profile pages, and reduces the need to join the walled garden.

For my part, a couple lines of Ruby code and this is what I extracted (type, value, relevancy):

Organization: Apache Software Foundation, 0.268
City: Charlottesville, 0.286
Technology: Information Technology, 0.344
Position: Services Consultant , 0.302
Technology: Apache, 0.268
Person: Eric Pugh, 0.845
Company: LinkedIn Corporation, 0.724

What OpenCalais missed that I would have linked was the My Interests which would have maybe returned some industry terms such as agile practices, ruby on rails, open source, unit testing, scrum, selenium and the websites listed.

Written by Eric

February 25, 2009 at 10:11 am

Entity Extraction of URL’s made easy…. Partly.

leave a comment »

Thanks to Ed Summers at the Library of Congress for his post on SemanticProxy. Semantic proxy offers a dead simple API for feeding URL’s to the OpenCalais entity extraction engine.

For those of you not familar with OpenCalais, it is a “rules” based Entity Extraction engine that knows how to find in free form text certain bits of information, like the name of a person. OpenCalais is sponsored by Thomson Reuters, so most of the rules are based around the text you would find in a newspaper. Like “GM is in talks with Chrysler for a merger” would give you the companies GM and Chrysler, as well as relationship called “merger_talks” between the two.

I was hoping to use OpenCalais to extract place, time, and subject information for all those free form event announcements. Unfortunately OpenCalais doesn’t have the rules to pull that type of entity out. It does find on the people listed in an event, but that’s it.

However, I did fine it very useful to find more people information for HTC. A new group in Charlottesville called FirstWednesdays has started, and their “Find Me” page in the comments has a wealth of data. I am just splitting up the DOM on each comment, and feeding each one to OpenCalais and getting back a person and a couple of relevant links. It’s working great.

So between SemanticProxy for pages of raw content and using OpenCalais for specific chunks of text I expect to simplify the process of adding new data to HighTechCville.

Written by Eric

February 24, 2009 at 5:48 pm

Posted in Uncategorized

Add HTC to your Firefox list of search engines

leave a comment »

Find your self search HighTechCville often? Now you can add HTC to your list of search engines in Firefox through the magic of the OpenSearch.org API. Just browse to http://www.hightechcville.com using Firefox and click the small blue arrow by your search bar and choose “Add HTCFind”:

open_search

Add a shortcut tag like “htc” and then type into your location bar “htc: Eric Pugh” to find me!

Written by Eric

February 4, 2009 at 8:07 pm

Posted in Uncategorized

People and Org Listing pages should be faster..

leave a comment »

I’ve gone down the slippery path of caching the listing pages for people and organizations, so they should render much faster (well, except for the first person to hit them 😉 ).

I’ve also tuned a bit when we include the simile timeline javascript, as that always seemed to load slowly. It now only comes up on pages that need it like the browse recent events.

If you see out of date information, or any other oddness. Please drop me a line!

Written by Eric

November 5, 2008 at 6:19 pm

Posted in Uncategorized

Now you can download contact information!

leave a comment »

I’m rolling out the ability to download people and organizations contact information into your address book, wether that is AddressBook.app on the Mac, or Outlook on the PC. Through the magic of a really neat Technorati service the hCard information that is microformatted for a person or an organization is converted to the standard vCard format used by most address book applications.

Also, I’ve taken advantage of the Yahoo Geocoder service so that when we look up a lat and lng for mapping an organization based on a free form address, we also now parse out the street, zip, state, and country and store them! Hopefully this will lead to cleaner information!

Written by Eric

November 3, 2008 at 6:19 pm

Posted in Uncategorized

HTC, now in a faster version!

leave a comment »

One of the bits of feedback I got doing the Neon Guild presentation a couple weeks ago is that the site was kinda slow, and the uptime rate was pretty bad! I’ll blame this on it being mostly a research project, but now that I’ve shared it with the Guild, I realized I better look into this.

A couple of changes, from big to little have been made:

  1. Reducing # of SQL queries to generate a page. Used to be that the common pages would require up to a couple hundread SQL queries to get all the data, now it’s a handful.
  2. Caching the Blog section on the homepage. The RSS feed for this blog was pulled into the homepage every time someone visited. This obviously was inefficient, and added to how slow the site was. I am now caching the content, and using the ETag header from the RSS feed to see if I need to update content. By the way, a lot of credit for making this visible goes to NewRelics Rails Performance Monitor tool.
  3. Background jobs are now more “backgroundy” and shouldn’t take up so many resources. They are also running more reliable, and I am able to monitor them through the Job Log interface. You too can monitor them if you join the site!
  4. And a little change, on an organization page we would query Yahoo for the GPS coordinates of Charlottesville every time. I realized that since Charlottesville isn’t likely to moving, barring a major Ike or Katrina hurricane, that I could probably hard code the coordinates to 38.032125, -78.477519.

Written by Eric

September 25, 2008 at 8:30 am

Posted in Uncategorized

How many ways can you spell Charlottesville? At least 5.

leave a comment »

I recently went to the search tags page to see how many companies are based in Charlottesville. I typed in “Char” and was surprised to see in my AJAXy search box that 5 suggestions for “Char” were entered. At first I thought it was a bug, but then I realized that indeed there were multiple misspellings for Charlottesville: Charlotesville, Charlotsville, Charlottesvile, and Charlotteville.

What is ironic is that this data was loaded from the data compiled as part of CBIC’s survey of high tech business in the Charlottesville area! Clearly whoever was supplying that data was hand entering it into the spreadsheet.

To help deal with bad tags like this, I have moved up to the main navigation a “Browse Tags” link that shows you all the tags, and reformatted the tag pages. By the time you read this, the various tags that should be part of Charlottesville should be!

Please feel free to login to the site with your OpenID and start cleaning some tags up!

Written by Eric

September 23, 2008 at 12:19 pm

Posted in Uncategorized