{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", "lifestyle governs mobile choice faster better or funkier hardware alone is not going to help phone firms sell more handsets\n", "research suggests. instead phone firms keen to get more out of their customers should not just be pushing the technology \n", "for its own sake. consumers are far more interested in how handsets fit in with their lifestyle than they are in screen size \n", "onboard memory or the chip inside shows an in-depth study by handset maker ericsson. \n", "historically in the industry there has been too much focus on using technology \n", "said dr michael bjorn senior advisor on mobile media at ericsson s consumer and enterprise lab.\n", "\n", "\n", "Download the data bbcnews.zip and place it in ../data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.\n", "\n", "[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in \n", "[ORG Real Madrid ] .\n", "\n", "\n", "NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labelling the expressions\n", "(with tags such as PER, LOC or ORG). This sequence labelling task is reduced into a classification task, using the BIO encoding of the data:\n", "
\n", " Wolff B-PER\n", " , O\n", " currently O\n", " a O\n", " journalist O\n", " in O\n", " Argentina B-LOC\n", " , O\n", " played O\n", " with O\n", " Del B-PER\n", " Bosque I-PER\n", " in O\n", " the O\n", " final O\n", " years O\n", " of O\n", " the O\n", " seventies O\n", " in O\n", " Real B-ORG\n", " Madrid I-ORG\n", " . O\n", "\n", "\n", "\n", "
\n", " Wolff NP B-PER\n", " , , O\n", " currently RB O\n", " a AT O\n", " journalist NN O\n", " in IN O\n", " Argentina NP B-LOC\n", " , , O\n", " played VBD O\n", " with IN O\n", " Del NP B-PER\n", " Bosque NP I-PER\n", " in IN O\n", " the AT O\n", " final JJ O\n", " years NNS O\n", " of IN O\n", " the AT O\n", " seventies NNS O\n", " in IN O\n", " Real NP B-ORG\n", " Madrid NP I-ORG\n", " . . O\n", "\n", "Classes\n", "1 B-PER\n", "2 I-PER\n", "3 B-LOC\n", "4 I-LOC\n", "5 B-ORG\n", "6 I-ORG\n", "7 O\n", "\n", "Feature WORD-FORM:\n", "1 Wolff\n", "2 ,\n", "3 currently\n", "4 a\n", "5 journalist\n", "6 in\n", "7 Argentina\n", "8 played\n", "9 with\n", "10 Del\n", "11 Bosque\n", "12 the\n", "13 final\n", "14 years\n", "15 of\n", "16 seventies\n", "17 Real\n", "18 Madrid\n", "19 .\n", "\n", "Feature POS\n", "20 NP\n", "21 ,\n", "22 RB\n", "23 AT\n", "24 NN\n", "25 VBD\n", "26 JJ\n", "27 NNS\n", "28 .\n", "\n", "Feature ORT\n", "29 number\n", "30 contains-digit\n", "31 contains-hyphen\n", "32 capitalized\n", "33 all-capitals\n", "34 URL\n", "35 punctuation\n", "36 regular\n", "\n", "Feature Prefix1\n", "37 W\n", "38 ,\n", "39 c\n", "40 a\n", "41 j\n", "42 i\n", "43 A\n", "44 p\n", "45 w\n", "46 D\n", "47 B\n", "48 t\n", "49 f\n", "50 y\n", "51 o\n", "52 s\n", "53 .\n", "\n", "\n", "Given this encoding, we can compute the vector representing the first word \"Wolff NP B-PER\" as:\n", "
\n", "# Class: B-PER=1\n", "# Word-form: Wolff=1\n", "# POS: NP=20\n", "# ORT: Capitalized=32\n", "# prefix1: W=37\n", "1 1:1 20:1 32:1 37:1\n", "\n", "\n", "When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset).\n", "You should, therefore, plan for a special value for each feature of type \"unknown\" when this is expected.\n", "\n", "\n", "Instead of writing the code as explained above, use the Scikit-learn vectorizer and pipeline library.\n", "General information on feature extraction for text data in Scikit-Learn is \n", "in the Scikit-Learn documentation.\n", "Refer to the DictVectorizer for this specific task.\n", "Hashing vs. DictVectorizer\n", "also provides useful background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "