Winter or categories of a particular word, that

Winter
Internship Report

Part
of speech Tagging On Bhojpuri dataset

By

Rashika
Pandey

NIT Mizoram

Under
the Guidance of

Dr.

A. K. Singh

 

 

 

 

 

 

 

 

 

 

 

 

 

Department
of Computer Science & Engineering

INDIAN
INSTITUTE OF TECHNOLOGY (BANARAS HINDU UNIVERSITY)

VARANASI
– 221005

 

 

Tagging:

  The descriptors are called the tags and the
automatic assignment of the descriptors to the given tokens is called tagging.

POS Tagging

The process of assigning
one of the parts of speech to the given word is called Parts Of Speech tagging,
commonly referred to as POS tagging. Parts of speech include nouns, verbs,
adverbs, adjectives, pronouns, conjunction and their sub-categories

POS Tagger

A
Part-Of-Speech Tagger (POS Tagger) is a  software that reads text and then assigns parts
of speech to each word (and other token), such as noun, verb, adjective, etc.,
It uses different kinds of information such as dictionary, lexicons, rules,
etc. because  dictionaries have category or categories of a particular word,
that is a word may belong to more than one category. For example, run is both
noun and verb so to solve this ambiguity taggers use probabilistic information.

There
are mainly two type of taggers:

Rule-based
– Uses hand-written rules to distinguish the tag ambiguity.

Stochastic
taggers are either HMM based – chooses the tag sequence which maximizes the
product of word likelihood and tag sequence probability, or cue-based, using
decision trees or maximum entropy models to combine probabilistic features.

HMM

Hidden
Markov Model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.

In
simpler Markov models, the state is directly visible to the observer, and
therefore the state transition probabilities are the only parameters, while in
the hidden Markov model, the state is not directly visible, but the output,
dependent on the state, is visible. Each state has a probability distribution
over the possible output tokens. Therefore, the sequence of tokens generated by
an HMM gives some information about the sequence of states.

The
adjective hidden refers to the state sequence through which the model passes,
not to the parameters of the model; the model is still referred to as a hidden
Markov model even if these parameters are known exactly.

HMMs
involve counting cases (such as from the Brown Corpus), and making a table of
the probabilities of certain sequences. For example, once you’ve seen an
article such as ‘the’, perhaps the next word is a noun 40% of the time, an
adjective 40%, and a number 20%.  More advanced (“higher
order”) HMMs learn the probabilities not only of pairs, but triples or
even larger sequences but when several ambiguous words occur together, the
possibilities multiply. However, it is easy to enumerate every combination and
to assign a relative probability to each one, by multiplying together the
probabilities of each choice in turn. The combination with highest probability
is then chosen.

Accuracy
achieved

The
European group developed CLAWS, a tagging program that did exactly this, and
achieved accuracy in the 93–95% range.

Many
machine learning methods have also been applied to the problem of POS tagging.

Methods such as SVM, maximum entropy classifier, perceptron, and nearest-neighbor
have all been tried, and most can achieve accuracy above 95%.

A
more recent development is using the structure regularization method for
part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.

 

 

 

 

 

 

 

 

 Natural Language Processing(NLP) with Python

NLTK
is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic
reasoning, wrappers for industrial-strength NLP libraries, and an active discussion
forum. It has many libraries to work on natural language. Using we can tokenize
and tag some text, identify some named entities and display a sparse tree.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tagset

A
set of tags from which the tagger choses a relevant tag for the word.

Data set

A
merged Bhojpuri dataset containing of sentences of Bhojpuri and the
corresponding labels to the words.

 

 

 

 

 

 

 

ACKNOWLEDGEMENT

I express my profound and sincere gratitude to my mentor Dr. Anil Kumar
Singh for providing me with all the facilities and support during my winter
internship period.

I would like to thank my guide Mr. Rajesh Mundotiya for their valuable
guidance, constructive criticism, encouragement and also for making the requisite
guidelines enabling me to complete my work with utmost dedication and
efficiency.

At last, I would like to acknowledge my family and friends for the
motivation, inspiration and support in boosting my moral without which my
efforts would have been in vain.

 

 

 

 

 

 

 

 

 

References

1.     Speech and Language Processing (3rd Edition). Book by Daniel Jurafsky and James H. Martin ?

2.     A Brief introduction of POS Tagging

Markov Models

3.     Stanford Log-linear Part-Of-Speech Tagger https://nlp.stanford.edu/software/tagger.shtml

4.     NLTK Documentation

http://www.nltk.org/