arsenal.nlp package¶
Subpackages¶
Submodules¶
arsenal.nlp.annotation module¶
-
exception
arsenal.nlp.annotation.ParseError[source]¶ Bases:
ExceptionCustom exception class used by this module.
-
arsenal.nlp.annotation.bracket2bio(x)[source]¶ generate BIO-token pairs from bracket-style annotation. Note: splits text of spaces, so wordsplitting should already be done.
>>> x = bracket2bio("[TITLE Cat in the Hat][AUTHOR Dr. Seuss]") >>> list(x) #doctest:+NORMALIZE_WHITESPACE [('B-TITLE', 'Cat'), ('I-TITLE', 'in'), ('I-TITLE', 'the'), ('I-TITLE', 'Hat'), ('B-AUTHOR', 'Dr.'), ('I-AUTHOR', 'Seuss')]
-
arsenal.nlp.annotation.extract_contiguous(s, labeler=None)[source]¶ >>> list(extract_contiguous("")) []
>>> list(extract_contiguous("AAAA")) [Span(label='A', begins=0, ends=4)]
>>> list(extract_contiguous("AABBC")) [Span(label='A', begins=0, ends=2), Span(label='B', begins=2, ends=4), Span(label='C', begins=4, ends=5)]
>>> list(extract_contiguous("AABBB")) [Span(label='A', begins=0, ends=2), Span(label='B', begins=2, ends=5)]
-
arsenal.nlp.annotation.line_groups(text, pattern)[source]¶ Very simple function for breaking up text into groups based on a single pattern.
>>> list(line_groups("a BB c d BB", "BB")) ['a', 'c d']
-
arsenal.nlp.annotation.sgml2bio(x)[source]¶ >>> sgml2bio('<title>Cat in the Hat</title><author>Dr. Seuss</author>') [('B-title', 'Cat'), ('I-title', 'in'), ('I-title', 'the'), ('I-title', 'Hat'), ('B-author', 'Dr.'), ('I-author', 'Seuss')]
arsenal.nlp.evaluation module¶
Evaluations methods common in NLP and information extraction.
- TODO: Have a look at
- https://github.com/nschneid/pyutil/blob/master/chunkeval.py, there appear to be richer evaluation methods.