In this post I brush over some basic NLP processes including text tokenisation, part of speech tagging as well named entity recognition. Particularly how these can be achieved in the Python Programming language through the Natural Language Toolkit (NLTK) package developed by Steven Bird and his associates.
NLP is broad field spanning Linguistics , Computer Science and Engineering, to mention a few. NLP which is used here to mean Natural Language Processing shouldn’t be confused with Neurolingistic Programming. NLP as we mean it here, is that field concerned with the interaction between machines and human languages, it seeks to make Language accessible to machines and to make human machine interaction much easier. This field is one which has come a long way, however there remains many NLP problems that are yet to be solved. The ultimate goal of the field may be considered to be true A.I (i.e a truly intelligent machine possessing every human cognitive ability, in this case the L.A.D and thus an ability to acquire, use, comprehend and learn Language).
NLP is at work everyday of our modern lives. Everything from search engines, to online translation services such as bing and Google translate, chatbots and text auto completion on our phones are NLP driven.
Basic Text Analysis with NLTK
What follows is a presentation of some very basic NLP procedures, with NLTK a NLP package for the Python Programming language, although not the only package available for NLP, it has the advantage of simplicity afforded by the nature for the Python Programming language(Other popular NLP packages include Stanford’s Core NLP, and Apache’s Open NLP, these packages are written in Java.). Amongst the numerous high level programming languages in existence, Python proves to be one of the least complicated, it is very true to the essence of such languages, as languages which are more human readable and portable than Assembler and Machine languages while sacrificing speed.
- Word Tokenisation
- Sentence Tokenisation
Part of Speech tagging
Named Entity Recognition