Inside–outside–beginning (tagging)

From Wikipedia, the free encyclopedia

The IOB format (short for inside, outside, beginning), also commonly referred to as the BIO format, is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).[1] It was presented by Ramshaw and Marcus in their paper "Text Chunking using Transformation-Based Learning", 1995[2] The I- prefix before a tag indicates that the tag is inside a chunk. An O tag indicates that a token belongs to no chunk. The B- prefix before a tag indicates that the tag is the beginning of a chunk that immediately follows another chunk without O tags between them. It is used only in that case: when a chunk comes after an O tag, the first token of the chunk takes the I- prefix.

Another similar format which is widely used is IOB2 format, which is the same as the IOB format except that the B- tag is used in the beginning of every chunk (i.e. all chunks start with the B- tag).

A readable introduction to entity tagging is given in Bob Carpenter's blog post, "Coding Chunkers as Taggers".[3]

An example with IOB format:

Alex I-PER
is O
going O
to O
Los I-LOC
Angeles I-LOC
in O
California I-LOC

Notice how "Alex", "Los" and "California", although first tokens of their chunk, have the "I-" prefix.

The same example after filtering out stop words:

Alex I-PER
going O
Los I-LOC
Angeles I-LOC
California B-LOC

Notice how "California" now has the "B-" prefix, because it immediately follows another LOC chunk.

The same example with IOB2 format (with tagging unaffected by stop word filtering):

Alex B-PER
is O
going O
to O
Los B-LOC
Angeles I-LOC
in O
California B-LOC

Related tagging schemes sometimes include "START/END: This consists of the tags B, E, I, S or O where S is used to represent a chunk containing a single token. Chunks of length greater than or equal to two always start with the B tag and end with the E tag."[4]

Other Tagging Scheme's include BIOES/BILOU, where 'E' and 'L' denotes Last or Ending character is such a sequence and 'S' denotes Single element or 'U' Unit element.

An Example with BIOES format:

Alex S-PER
is O
going O
with O
Marty B-PER
A. I-PER
Rick E-PER
to O
Los B-LOC
Angeles E-LOC

Drawbacks[edit]

IOB syntax does not permit any nesting, so cannot (unless extended) also represent even very simple phenomena such as sentence boundaries (which are not trivial to locate reliably), the scope of parenthetical expressions in sentences, grammatical structures, nested Named Entities such as "University of Wisconsin Dept. of Computer Science", and so on. It also leaves no place for metadata such as an identifier for the particular sample, the confidence level of the NER assignment, and so on, which are commonplace in NLP systems.

Because of these limitations, data must often be converted out of IOB format, or projects must create custom extensions, which has led to a large number of not-quite-interoperable "IOB-like" formats.

The space and "O" (meaning "not in any chunk") convey no information and could simply be omitted. The same is true for putting the "type" suffix on "I-" or "E-" markers as in some variants of "BIOES"; and for marking both "I" and "E" (if you have begun and not ended you are "in", and if you are "in", you have begun and not ended). Some other formats deploy verbosity to improve readability and/or error-checking, but no such benefits appear to come to IOB in exchange for its verbosity.

IOB's "one token per line" depends on the tokenization used, even though tokenization is not standardized in NLP, and details of tokenization do not have to be entangled with the representations of NERs. "11/31/2019" could be anywhere from one to five tokens in different systems, but the NER is the same. Some systems even permit whitespace within tokens, and space as a delimiter collides with this, narrowing the applicability of IOB and motivating more extensions. "space" might or might not include tab, multiple spaces, hard spaces, and so on, differences which are difficult to detect when proofreading.

More powerful formats (most obviously XML and JSON) can handle far more diverse annotations, have less variation between implementations, and are often shorter and more readable as well. For example:

<PER>Alex</PER> is going with <PER>Marty A. Rick </PER>to<LOC> Los Angeles</LOC>

XML takes 80 bytes to do the same things as the 91 byte BIOES version shown above, or the 79 byte IOB version. However, it can easily also support sentence boundaries, part-of-speech annotations, and other features commonly needed in NLP systems. Breaking all tokens in particular places is not strictly part of the NER task; but even if every token were tagged (like "<T>is</T>") the total would grow only to 139 bytes:

<PER><T>Alex</T></PER><T>is</T><T>going</T><T>with</T><PER><T>Marty</T><T>A.</T><T>Rick</T></PER><T>to</T><LOC><T>Los</T><T>Angeles</T></LOC>

References[edit]

  1. ^ "Entity Recognition". Archived from the original on 30 September 2013. Retrieved 22 August 2013.
  2. ^ Ramshaw and Marcus (1995). "Text Chunking using Transformation-Based Learning". arXiv:cmp-lg/9505040.
  3. ^ Bob Carpenter (2009). "Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+". Archived from the original on 5 August 2017.
  4. ^ http://cs229.stanford.edu/proj2005/KrishnanGanapathy-NamedEntityRecognition.pdf Archived 11 July 2019 at the Wayback Machine[bare URL PDF]