Using Random Forests and Wordclouds to Visualize Feature Importance in Document Classification

What characteristics do the works of famous authors have that make them unique? This post uses ensemble methods and wordclouds to explore just that. Project Gutenberg offers a large number of freely available works from many famous authors. The dataset for this post consists of books, taken from Project Gutenberg, written by each of the […]

Continue reading


30 Years of Devotion to Natural Language Processing Based on Concepts

Recently jiqizhixin.com interviewed Mr. Qiang Dong, chief scientist of Beijing YuZhi Language Understanding Technology Co. Ltd. Dong gave a detailed presentation of their NLP technology and demoed their YuZhi NLU platform. With HowNet, a well-known common-sense knowledge base as its basic resources, YuZhi NLU Platform conducts its unique semantic analysis based on concepts rather than […]

Continue reading


Applying Multinomial Naive Bayes to NLP Problems: A Practical Explanation

1.Introduction Naive Bayes is a family of algorithms based on applying Bayes theorem with a strong(naive) assumption, that every feature is independent of the others, in order to predict the category of a given sample. They are probabilistic classifiers, therefore will calculate the probability of each category using Bayes theorem, and the category with the […]

Continue reading


Downloading more than 20 years of The New York Times

Articles for the period from 1987 to present are available without subscription. Their copyright notice is web scraping friendly: “… you may download material from The New York Times on the Web (one machine readable copy and one print copy per page) for your personal, noncommercial use only.” Why waste the opportunity to download these […]

Continue reading


A Sneak Peak of Our Upcoming “AI Tech Report”

Synced has recently begun putting together a research report on the technologies powered Artificial Intelligence (AI) to identify their development paths. You may sign up to receive the report as soon as the rolling out starts. The Background In the report, the concept of Artificial Intelligence (AI) will be represented together with a brief history […]

Continue reading


RNN, LSTM in TensorFlow for NLP in Python

We covered RNN for MNIST data, and it is actually even more suitable for NLP projects. You can find more details on Valentino Zocca, Gianmario Spacagna, Daniel Slater’s book Python Deep Learning. from __future__ import print_function, division # -*- coding: utf-8 -*- ###”War and peace” contains more than 500,000 words, making it the perfect ###candidate […]

Continue reading


Deep Learning Paper Sparks Online Feud!

Feature image is created by Jannoon028 – Freepik.com Researchers Yoav Goldberg and Yann Lecun face off on Natural Language Processing Social media is humanity’s new intellectual battlefield. Sports fans, social justice warriors, and even the President of the United States tweet, take to discussion boards or make memes to mock, preach, thrust and parry against […]

Continue reading


Word2vec: Google’s New Leap Forward on the Vectorized Representation of Words

Introduction Word2vec is an open source tool developed by a group of Google researchers led by Tomas Mikolov in 2013. It describes several efficient ways to represent words as M-dimensional real vectors, also known as word embedding, which is of great importance in many natural language processing applications. Word2vec also expresses the quality of the […]

Continue reading


Preprocess in Python-Scale

We use the data from sklearn library, and the IDE is sublime text3. Most of the code comes from the book: https://www.goodreads.com/book/show/32439431-introduction-to-machine-learning-with-python?from_search=true ###Preprocess methods ###The StandardScaler in scikit-learn ensures that for each feature, the mean is zero, ###and the variance is one, bringing all features to the same magnitude. However, ###this scaling does not ensure […]

Continue reading


Language understanding remains one of AI’s grand challenges

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: David Ferrucci on the evolution of AI systems for language understanding. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS. In […]

Continue reading