I’m back!

Earlier I have described the OpenCalais Web Service.

The ecosystem of web services

The NASA Earth Observatory Glossary defines an ecosystem as “any natural unit or entity including living and non-living parts that interact to produce a stable system through cyclic exchange of materials” [NASA]. The concept can be applied to Internet-based applications that function as information-consuming or information producing “organisms” and that interact with each other in an interdependent way through exchange of information.

The IBM web site, on the other hand, defines “web services” as “self-contained, modular, distributed, dynamic applications that can be described, published, located, or invoked over the network to create products, processes, and supply chains.”

As discrete, possibly autonomous “organisms” in an Internet-based information ecosystem, web services-enabled applications expose data and/or service end points in multiple ways including Really Simple Syndication (RSS) feeds, and web services Application Programming Interfaces (APIs) using Simple Object Access Protocol (SOAP), XML Remote Procedure Call (XML-RPC) or REpresentational State Transfer (REST). Aside from the use of XML to embed data in responding to data or process requests, an increasing number of web service applications also provide responses using Javascript Object Notation (JSON). OpenCalais and Alchemy use Resource Description Framework (RDF), an XML-based semantic web format that structures data as triples (subject, predicate, object), to respond to API requests and both perform named entity disambiguation by linking to external knowledge bases (e.g., CIA Factbook, Wikipedia, Freebase). These web service applications may even provide machine learning-based services such as natural language processing (specifically named entity extraction and concept annotation), language detection and translation and text classification. Tools that enable semantic processing of content (not just classification) potentially allow exposing richer knowledge-based content embedded in unstructured data such as news about outbreaks and disasters.

Read the rest of this entry

EpiSPIDER now uses the OpenCalais natural language processing (NLP) web service from Thomson-Reuters to annotate specific entities (medical condition and location) found in news reports. Named entity recognition and coreference resolution are classical NLP challenges. OpenCalais exposes a NLP application programming interface (API) to leverage algorithms that perform named entity recognition and coreference resolution in free text from news sources.

Recognition of location and medical condition entities are two areas of interest for EpiSPIDER and “outsourcing” the ability to extract critical data from unstructured information leverages the emerging, bottom-up, service-oriented architecture on the web.

Although OpenCalais performs on almost any text thrown at it, the following pitfalls were observed for named entity recognition:

  1. Using OpenCalais (OC), EpiSPIDER extracted the location entity Buffalo, Indiana, US from the ProMED Mail report with the title “PRO/AH/EDR> Undiagnosed deaths, buffalo – India (Orissa): RFI”.
  2. OC did not “disambiguate” between Atlanta, New York, United States and Atlanta,GA, United States.
  3. OC georeferencing assigned Norway with lat/long values that are locaed in the Russian heartland.
  4. For some unknown reason, it identified West Virginia as part of South Korea.

In spite of these rare glitches, the OpenCalais web service heralds exciting times as we see computing and network power bring down the barriers to an emerging service-oriented architecture out there.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes