arsenal.nlp package

Submodules

arsenal.nlp.annotation module

exception arsenal.nlp.annotation.ParseError[source]

Bases: Exception

Custom exception class used by this module.

class arsenal.nlp.annotation.Span(label, begins, ends)[source]

Bases: object

begins
ends
label
arsenal.nlp.annotation.bio2span(seq, tagger=None, include_O=True)[source]
arsenal.nlp.annotation.bracket2bio(x)[source]

generate BIO-token pairs from bracket-style annotation. Note: splits text of spaces, so wordsplitting should already be done.

>>> x = bracket2bio("[TITLE Cat in the Hat][AUTHOR Dr. Seuss]")
>>> list(x)                                  #doctest:+NORMALIZE_WHITESPACE
[('B-TITLE', 'Cat'), ('I-TITLE', 'in'), ('I-TITLE', 'the'),
 ('I-TITLE', 'Hat'), ('B-AUTHOR', 'Dr.'), ('I-AUTHOR', 'Seuss')]
arsenal.nlp.annotation.extract_contiguous(s, labeler=None)[source]
>>> list(extract_contiguous(""))
[]
>>> list(extract_contiguous("AAAA"))
[Span(label='A', begins=0, ends=4)]
>>> list(extract_contiguous("AABBC"))
[Span(label='A', begins=0, ends=2), Span(label='B', begins=2, ends=4), Span(label='C', begins=4, ends=5)]
>>> list(extract_contiguous("AABBB"))
[Span(label='A', begins=0, ends=2), Span(label='B', begins=2, ends=5)]
arsenal.nlp.annotation.fromSGML(f, linegrouper='\n', bioencoding=False)[source]
arsenal.nlp.annotation.line_groups(text, pattern)[source]

Very simple function for breaking up text into groups based on a single pattern.

>>> list(line_groups("a BB c d BB", "BB"))
['a', 'c d']
arsenal.nlp.annotation.sgml2bio(x)[source]
>>> sgml2bio('<title>Cat in the Hat</title><author>Dr. Seuss</author>')
[('B-title', 'Cat'), ('I-title', 'in'), ('I-title', 'the'), ('I-title', 'Hat'), ('B-author', 'Dr.'), ('I-author', 'Seuss')]
arsenal.nlp.annotation.sgml2segmentation(x, lexer=re.compile('\\S+'))[source]
>>> sgml2segmentation('<title>Cat in the Hat</title><author>Dr. Seuss</author>')
[('title', ['Cat', 'in', 'the', 'Hat']), ('author', ['Dr.', 'Seuss'])]
arsenal.nlp.annotation.sgml2seq(x)[source]
>>> sgml2seq('<title>Cat in the Hat</title><author>Dr. Seuss</author>')
[('title', 'Cat'), ('title', 'in'), ('title', 'the'), ('title', 'Hat'), ('author', 'Dr.'), ('author', 'Seuss')]

arsenal.nlp.evaluation module

Evaluations methods common in NLP and information extraction.

TODO: Have a look at
https://github.com/nschneid/pyutil/blob/master/chunkeval.py, there appear to be richer evaluation methods.
class arsenal.nlp.evaluation.F1(confusion_matrix=False)[source]

Bases: object

add_relevant(label, instance)[source]
add_retrieved(label, instance)[source]
confusion()[source]
latex()[source]
report(instance, prediction, target)[source]
scores(verbose=True)[source]
arsenal.nlp.evaluation.plot_confusion(y_true, y_pred, alphabet, normalized=False)[source]

Draw confusion matrix

Options:

  • normalized: Normalize the confusion matrix by row (i.e by the number of samples in each class)

Module contents