Language understanding remains one of AI’s grand challenges

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: David Ferrucci on the evolution of AI systems for language understanding. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS. In […]

Continue reading


Online Text Mining Tools & WebServices

Currently, many projects in GeoVista center  (at Pennsylvania State University) deal with unstructured textual data and require parsing of space, time, and entitiy information from textual document. Trying to compare one of our text-mining tool with other projects, I stumbled across many freely available text-mining tools. However, comparing them directly wouldn’t have made sense as they offer different levels of functionality. Hence, I […]

Continue reading


Efficient Textual Similarity Across Millions of Web Queries

Computing textual similarity (such as Jaccard similarity coefficient) between millions of search queries can be an arduous task. The main challenge is the number of pairs that one needs to consider; a relatively small dataset containing ten thousands queries leads to more than 49 million possible query pairs (). Based on Vernica, et.al. paper, I show […]

Continue reading


Google Web History: A Gold Mine of Personal Information (Part I)

In 1945, Vannevar Bush proposed the idea of Memex (Memory Extender), a machine that can capture all the nuggets of  information that ever came across us. Even by today’s standard the idea of Memex sounds profound, but, thanks to Google and Facebook, it is now very close to becoming a reality. In 2005, Google released “Web History”, a […]

Continue reading


From search to distributed computing to large-scale information extraction

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. February 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some […]

Continue reading