Conference PaperPDF Available

A new approach for named entity recognition

Authors:
$1HZ$SSURDFKIRU1DPHG(QWLW\5HFRJQLWLRQ
$GOÕ9DUOÕN7DQÕPDGD<HQL%LU<DNODúÕP
%XUDN(UWRSoX$OL%X÷UD.DQEXUR÷OX2]DQ7RSVDNDO2QXU$oÕNJ|]
Ali Tunca Gürkan1,Berkzenç
1,˙
Ilker Çam1
BegümAvar2,GökhanErcan1,OlcayTanerYıldız1
EXUDNHUWRSFX#LVLNHGXWUEXJUDNDQEXURJOX#LVLNHGXWUR]DQWRSVDNDO#LVLNHGXWURQXUDFLNJR]#LVLNHGXWU
DOLJXUNDQ#LVLNHGXWUEHUNHR]HQF#LVLNHGXWULONHUFDP#LVLNHGXWU
EHJXPDYDU#ERXQHGXWUJRNKDQHUFDQ#LVLNHGXWUROFD\WDQHU#LVLNXQHGXWU
1'HSDUWPHQWRI&RPSXWHU(QJLQHHULQJ,úÕN8QLYHUVLW\øVWDQEXO7XUNH\
2'HSDUWPHQWRI/LQJXLVWLFV%R÷D]LoL8QLYHUVLW\øVWDQEXO7XUNH\
g]HWoH²%LUoRN FPOHQLQ LON EDNÕúWD LQVDQODUD YHUGL÷L
L]OHQLPOHU YDUGÕU %X L]OHQLPOHU ELU WDNÕP YDUOÕNODU VD\HVLQGH
LQVDQODUD RNXGXNODUÕ úH\OHULQ QH KDNNÕQGD ROGX÷XQX NDYUDPDGD
\DUGÕPFÕ ROXU %XQXQ 'R÷DO 'LO *HOLúWLUPH¶GHNL NDUúÕOÕ÷Õ $GOÕ
9DUOÕN 7DQÕPDGÕU $97 $97 DOJRULWPDODUÕ WHPHO RODUDN
FPOHGHNLúL\HU]DPDQWDULKVDDWYH\DSDUDJLELELUoRNYDUOÕ÷Õ
WDUD\DELOLU%XLúOHPOHUGHNLHQE\NSUREOHPELULVPLQNLúL\HPL
\RNVDELU\HUH PL YH\DRUJDQL]DV\RQDPÕ YH\D ELU VD\ÕQÕQWDULKH
PL\RNVD SDUD\DPÕDLWROGX÷XJLELVRUXODUGÕU%XoDOÕúPDGD$GOÕ
9DUOÕN 7DQÕPD DOJRULWPDODUÕQD \HQL ELU PRGHO WDVDUODGÕN %XQX
ROXúWXUGX÷XPX]YHULVHWLQGH oDOÕúWÕUÕS HOGH HGLOHQ VRQXoODUÕGL÷HU
PRGHOOHULQNLOHUOH NDUúÕODúWÕUGÕN 6RQXo RODUDN UHWWL÷LPL] 
FPOHOLNYHULVHWLQGHND\GDGH÷HUEXOJXODUHOGHHWWLN
$QDKWDU6|]FNOHU²'R÷DO'LO*HOLúWLUPH9HULdHNPH
$GOÕ9DUOÕN7DQÕPD
Abstract—Many sentences create certain impressions on peo-
ple. These impressions help the reader to have an insight about
the sentence via some entities. In NLP, this process corresponds
to Named Entity Recognition (NER). NLP algorithms can trace
a lot of entities in the sentence like person, location, date, time
or money. One of the major problems in these operations are
confusions about whether the word denotes the name of a person,
a location or an organisation, or whether an integer stands for
a date, time or money. In this study, we design a new model for
NER algorithms. We train this model in our predefined dataset
and compare the results with other models. In the end we get
considerable outcomes in a dataset containing 1400 sentences.
.H\ZRUGV²1DWXUDO/DQJXDJH3URFHVVLQJ,QIRUPDWLRQ
([WUDFWLRQ1DPHG(QWLW\5HFRJQLWLRQ
, ,1752'8&7,21
1DWXUDO ODQJXDJH SURFHVVLQJ 1/3 LV D ¿HOG LQ FRPSXWHU
VFLHQFHZKLFKLQYHVWLJDWHVKRZDFRPSXWHUFDQXQGHUVWDQGDQG
PDQLSXODWHODQJXDJH LQLWVZULWWHQ DQGVSRNHQIRUP /DQJXDJH
DELOLW\ LV RQH RI WKH PRVW LPSRUWDQW FKDUDFWHULVWLFV RI KXPDQ
EHLQJVWKDWVKHGVOLJKWRQWKHIXQFWLRQLQJRIWKHKXPDQEUDLQ,I
KXPDQODQJXDJHFDQEHPRGHOOHGLQ FRPSXWHU HQYLURQPHQW LW
FDQEHXVHGIRUDGYDQFHGDQGHIIHFWLYHFRPPXQLFDWLRQWDVNV
7KH IRXQGDWLRQV RI 1/3 OLH LQ D QXPEHU RI GLVFLSOLQHV
QDPHO\FRPSXWHUDQGLQIRUPDWLRQVFLHQFHVOLQJXLVWLFVPDWK
HPDWLFVHOHFWULFDODQGHOHFWURQLFHQJLQHHULQJDUWL¿FLDOLQWHOOL
JHQFH DQG URERWLFV DQG SV\FKRORJ\ 6RPH 1/3 DSSOLFDWLRQV
LQFOXGH VWXGLHV VXFK DV PDFKLQH WUDQVODWLRQ QDWXUDO ODQJXDJH
WH[W SURFHVVLQJ DQG VXPPDULVDWLRQ XVHU LQWHUIDFHV PXOWLOLQ
JXDO DQG FURVVODQJXDJH LQIRUPDWLRQ UHWULHYDO &/,5 VSHHFK
UHFRJQLWLRQDUWL¿FLDOLQWHOOLJHQFHDQGH[SHUWV\VWHPV
1/3 VWXGLHV LQ JHQHUDO IRFXV RQ ¿JXULQJ RXW KRZ WR UH
VROYHWKHUXOHVRIQDWXUDOODQJXDJHVIRUWKHPDFKLQHV7KURXJK
VXFK UHVROXWLRQ PDQ\ SURFHVVHV  VXFK DV WUDQVODWLRQ LQIRU
PDWLRQ H[WUDFWLRQ IURP VWUXFWXUDOO\ LUUHJXODU WH[WV TXHVWLRQ
DQVZHULQJ WH[W VXPPDULVDWLRQ  DUH DLPHG WR EH SHUIRUPHG
DXWRPDWLFDOO\E\PDFKLQHV
1DPHGHQWLW\ UHFRJQLWLRQ 1(5 LVRQH RI WKH VXEWDVNVLQ
1/38VLQJSUHYLRXVO\H[LVWLQJRUSXEOLVKHGLQIRUPDWLRQ1(5
DLPV WR UHFRJQLVH ZRUGV VXFK DV SHUVRQ LQVWLWXWLRQ
HVWDEOLVKPHQWSODFHQDPHVWLPHH[SUHVVLRQVDQGFXUUHQFLHVLQ
DZULWWHQ WH[W ,Q WKLVVWXG\ ZH FUHDWH DQHZ PRGHO IRU 1(5
:HWUDLQWKLVPRGHO LQ RXU SUHGH¿QHG GDWDVHW ZKLFK FRQWDLQV
VHQWHQFHVDQGFRPSDUHWKHUHVXOWVZLWKRWKHUPRGHOV
7KLVSDSHULVRUJDQLVHGDVIROORZV:HGH¿QH1(5SUREOHP
LQ 6HFWLRQ ,, DQG JLYH WKH SUHYLRXV ZRUN LQ 6HFWLRQ ,,, ,Q
FRQWLQXRXV PRGHOV ZH UHSUHVHQW ZRUGV ZLWK FRQWLQXRXV
YHFWRUVQDPHO\ZRUGHPEHGGLQJV$EULHILQWURGXFWLRQWRZRUG
HPEHGGLQJVLVJLYHQ LQ 6HFWLRQ ,9 7KH GHWDLOV RIRXUGDWDVHW
DQGKRZLWLVFRQVWUXFWHGDUH JLYHQLQ6HFWLRQ9 :HJLYHRXU
H[SHULPHQW PHWKRGRORJ\ LQ 6HFWLRQ 9, DQG UHVXOWV LQ 6HFWLRQ
9,, /DVWO\ZHFRQFOXGHLQ6HFWLRQ9,,,
II. 1$0('(17,7< 5(&2*1,7,21
$Q\WKLQJ WKDW LV GHQRWHG E\ D SURSHU QDPH L H IRU
LQVWDQFHDSHUVRQDORFDWLRQRUDQRUJDQL]DWLRQLVFRQVLGHUHG
WR EH D QDPHG HQWLW\ ,Q DGGLWLRQQDPHG HQWLWLHV DOVR LQFOXGH
WKLQJV OLNH GDWHV WLPHV RU PRQH\ +HUH LV D VDPSOH WH[W
ZLWKQDPHGHQWLWLHVPDUNHG
[ORG Türk Hava Yolları] bu [TIME Pazartesi’den]
itibaren [LOC ˙
Istanbul] [LOC Ankara] güzergahı için indirimli
satı¸slarını [MONEY 90 TL’den] ba¸slataca ˘
gını açıkladı.
7KLVVHQWHQFHFRQWDLQVQDPHGHQWLWLHVLQFOXGLQJZRUGV
ODEHOHGDV25*$1,=$7,21ZRUGVODEHOHGDV
/2&$7,21ZRUGODEHOHGDV7,0(DQGZRUGODEHOHGDV
021(<7DEOH,VKRZVW\SLFDOJHQHULFQDPHGHQWLW\W\SHV
978-1-5386-0930-9/17/$31.00 ©2017 IEEE
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 
Tab le I. L IST OF NAMED ENTITY TYPES WITH THE KINDS OF ENTITIES
THEY BELONG TO
Tag Sample Categories Example
PER people, characters Atatürk yurdu dü¸smanlardan kurtardı.
ORG companies, teams IMKB günü 60 puan yükselerek kapattı.
LOC regions, mountains, seas Ülkemizin ba¸skenti Ankara’dır.
TIME time expressions Cuma günü tatil yapaca˘
gım.
MONEY monetarial expressions Geçen gün 3000 TL kazandık.
,Q QDPHG HQWLW\ UHFRJQLWLRQ RQH WULHV WR ¿QG WKH VWULQJV
ZLWKLQDWH[WWKDWFRUUHVSRQGWRSURSHUQDPHVH[FOXGLQJ7,0(
DQG021(<DQG FODVVLI\WKHW\SHRIHQWLW\ GHQRWHGE\WKHVH
VWULQJV7KHSUREOHP LVGLI¿FXOWSDUWO\GXH WRWKHDPELJXLW\LQ
VHQWHQFH VHJPHQWDWLRQ RQH QHHGV WR H[WUDFW ZKLFK ZRUGV
EHORQJ WR D QDPHG HQWLW\ DQG ZKLFK QRW $QRWKHU GLI¿FXOW\
RFFXUV ZKHQ VRPH ZRUG PD\ EH XVHG DV D QDPH RI HLWKHU D
SHUVRQDQ RUJDQL]DWLRQRUDORFDWLRQ)RUH[DPSOH'HQL]PD\
EHXVHGDVWKH QDPH RI D SHUVRQ RU ZLWKLQDFRPSRXQG LW
FDQUHIHUWR D ORFDWLRQ0DUPDUD'HQL]L³0DUPDUD6HD´RU DQ
RUJDQL]DWLRQ'HQL]7DÜVÕPDFÕOÕN³'HQL]7UDQVSRUWDWLRQ´
7KH VWDQGDUG DSSURDFK IRU 1(5 LV D ZRUGE\ZRUG FODV
VL¿FDWLRQZKHUH WKH FODVVL¿HULV WUDLQHG WR ODEHOWKH ZRUGV LQ
WKHWH[W ZLWKWDJVWKDWLQGLFDWHWKHSUHVHQFHRISDUWLFXODUNLQGV
RI QDPHG HQWLWLHV $IWHU JLYLQJ WKH FODVV ODEHOV QDPHG HQWLW\
WDJVWRRXUWUDLQLQJGDWDWKHQH[WVWHSLVWRVHOHFWDJURXSRI
IHDWXUHVWRGLVFULPLQDWHGLIIHUHQWQDPHGHQWLWLHV IRUHDFKLQSXW
ZRUG 7DEOH ,, VKRZV WKH VDPSOH WH[W UHSUHVHQWHG ZLWK WDJ
ODEHOVDQGWKUHH SRVVLEOHIHDWXUHVQDPHO\WKH URRWIRUPRIWKH
ZRUGWKHSDUWRIVSHHFK326WDJRIWKHZRUGDQGDERROHDQ
IHDWXUHIRUFKHFNLQJWKHFDSLWDOFDVH
Table II. NAMED ENTITY TAGGING AS CLASSIFICATION PROBLEM
Wor d Features Label
Root Pos Capital ...
Türk Türk Noun True ... ORGANIZATION
Hava Hava Noun True ... ORGANIZATION
Yolları Yol Noun True ... ORGANIZATION
bu bu Pronoun False ... NONE
Pazartesi’den Pazartesi Noun True ... TIME
itibaren itibaren Adverb False ... NONE
˙
Istanbul ˙
Istanbul Noun True ... LOCATION
Ankara Ankara Noun True ... LOCATION
güzergahı güzergah Noun False ... NONE
için için Adverb False ... NONE
indirimli indirimli Adjective False ... NONE
satı¸slarını sat Noun False ... NONE
90 90 Number False ... MONEY
TL’den TL Noun True ... MONEY
ba¸slataca˘
gını ba¸slat Noun False ... NONE
açıkladı açıkla Ver b False ... NONE
. . Punctuation False ... NONE
*LYHQVXFKWUDLQLQJGDWDDFODVVL¿HUOLNHQHXUDOQHWZRUNRU
GHFLVLRQ WUHH FDQ EH WUDLQHG WR ODEHO QHZ VHQWHQFHV )LJXUH 
VKRZVWKHRSHUDWLRQ RI VXFKDFODVVL¿HUDWWKHSRLQW ZKHUH WKH
ZRUG $QNDUD LV QH[W WR EH ODEHOHG )RU WKLV FODVVL¿HU WKH
ZLQGRZ VL]H LV  WKDW LV ZH DVVXPH D FRQWH[W ZLQGRZ WKDW
LQFOXGHVWZRSUHFHGLQJZRUGVDQGWZRVXFFHHGLQJZRUGV
III. 35(9,286:25.
$/LQJXLVWLF%DFNJURXQG
:KLOH1(5LVDUDWKHUXQSUREOHPDWLFWDVNDPRQJ1/3
VWXGLHVLQZHOOVWXGLHGODQJXDJHVOLNH(QJOLVKLWIDFHVFHUWDLQ
FKDOOHQJHVZKHQGHDOLQJZLWKDODQJXDJHOLNH7XUNLVK2QHRI
...
itibaren
Adverb
F
a
l
se
itibaren
˙
Istanbul
Noun
Tr
ue
˙
Istanbul
Ankara
Noun
Tr
ue
Ankara
güzergah
Noun
F
a
l
se
güzergahı
için
Adverb
F
a
l
se
için ...
CLASSIFIER
... NONE LOC ?...
)LJXUH  &ODVVL¿HU EDVHG DSSURDFK WR QDPHG HQWLW\ UHFRJQLWLRQ 7KH WDJJHU
VOLGHVDFRQWH[WZLQGRZRYHUWKHVHQWHQFHFODVVLI\LQJZRUGVDVLWSURFHHGV$W
WKLVSRLQWWKHFODVVL¿HULVDWWHPSWLQJWR ODEHO $QNDUD )HDWXUHV GHULYHG IURP
WKHFRQWH[WW\SLFDOO\LQFOXGHWKHZRUGVSDUWRIVSHHFKWDJVHWF
WKHFHQWUDOIRFLRIOLWHUDWXUHRQ7XUNLVKOLQJXLVWLFVKDVEHHQWKH
FRPSOH[LW\RI 7XUNLVK PRUSKRORJ\ 7XUNLVK LVDWH[WERRN H[
DPSOHIRUDQDJJOXWLQDWLYHODQJXDJHLHZRUGVLQWKHLUVXUIDFH
IRUPPD\FRQWDLQYDULRXV PRUSKHPHVHVSHFLDOO\VXI¿[HV7KH
PDLQSUREOHPSRVHGE\VXFKODQJXDJHVLVOH[LFDOVSDUVLW\>@
ZKLFKLQJHQHUDO FDQ EH FRQVLGHUHGDVDFKDOOHQJHIRU HDUOLHU
VWHSVVXFKDVWKH 0RUSKRORJLFDO $QDO\VLVDQG'LVDPELJXDWLRQ
WDVNVZKLFKIHHG1(5
1RWRQO\WKHPRUSKRORJ\EXWDOVRWKHV\QWDFWLFVWUXFWXUHRI
7XUNLVK LV D IHUWLOH VRXUFH IRU FKDOOHQJLQJ LVVXHV 7XUNLVK LV
W\SLFDOO\WUHDWHGDVDVXEMHFWREMHFWYHUE629ODQJXDJH\HWLW
KDVDUDWKHUIUHHZRUGRUGHU,QRWKHUZRUGVWKHFRQVWLWXHQWVRI
DVHQWHQFHFDQRFFXULQDQ\RUGHUZLWK VOLJKW PRGL¿FDWLRQV LQ
WKHVHQWHQWLDO PHDQLQJ$SDUWLFXODURUGHULVFKRVHQPDLQO\ RQ
SUDJPDWLFJURXQGV>@7KHUHIRUHWKHSRVLWLRQRIDZRUGZLWKLQ
D VHQWHQFH GRHV QRW SURYLGH DQ\ FOXHV DERXW ZKHWKHU LW LV D
1DPHG(QWLW\RUQRW
$VDQDGGLWLRQDOSUREOHPVSHFL¿FWR1(5WKHUHH[LVWPDQ\
SURSHU QRXQV LQ 7XUNLVK ZKLFK DUH GHULYHG IURP FRPPRQ
QDPHV WKURXJK VXI¿[DWLRQ VXFK DV WKH IROORZLQJ QDPHV RI
FLWLHV LQ 7XUNH\ 'HQL]OL GHULYHG IURP GHQL] ³VHD´ FRP
SRXQGLQJVXFKDV%DOÕNHVLUIURPEDOÕNHVLU³¿VKSULVRQHU´RU
]HURGHULYDWLRQ VXFK DV 2UGX IURP RUGX ³DUP\´ ,Q VSHHFK
SURVRGLFFXHVDUHKHOSIXO LQGLVWLQJXLVKLQJSURSHUQRXQVHVSH
FLDOO\SODFHQDPHVIURPFRPPRQ QRXQV GXH WR WKHLU LGLRV\Q
FUDWLFVWUHVVSDWWHUQ>@<HWWKHUHLVQRRUWKRJUDSKLFFRUUHODWHV
WR VWUHVV ,QVWHDG LQ IRUPDO WH[WV DW OHDVW RUWKRJUDSKLF FOXHV
FDQ EH KHOSIXO LQ GLVWLQJXLVKLQJ D SURSHU QRXQ ZKLFK VWDUWV
ZLWKDFDSLWDO OHWWHU IURP D FRPPRQ QRXQ ,QVHQWHQFHLQLWLDO
SRVLWLRQ KRZHYHU DOO ZRUGV EHJLQ ZLWK D FDSLWDO OHWWHU DQG
KHQFHWKLVFOXHLVQRWDYDLODEOH
,Q VKRUW VHYHUDO OLQJXLVWLF IHDWXUHV RI 7XUNLVK VXFK DV
LWV ULFK PRUSKRORJ\ IUHH ZRUGRUGHU DV ZHOO DV GHULYDWLRQ
DV D ZRUGIRUPDWLRQ SURFHVV IUHTXHQWO\ HPSOR\HG LQ IRUPLQJ
SURSHUQDPHV\LHOGSUREOHPVIRU1(5WDVNV
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 
B. Computational Background
6WXG\ RQ 1(5 KDV D ELJ DWWHQWLRQ LQ WKH OLWHUDWXUH
+RZHYHU PRVW RI WKH PRGHOV DUH VSHFL¿F WR WKH ODQJXDJH RI
IRFXV $ ODQJXDJH LQGHSHQGHQW PHWKRG LV SURSRVHG LQ >@
ZKLFKLV DERRWVWUDSSLQJDOJRULWKPEDVHGRQLWHUDWLYHOHDUQLQJ
7KHPHWKRGUHOLHV RQ ZRUGLQWHUYDODQGFRQWH[WXDOFOXHV7KLV
ZRUNLVWHVWHGRQ7XUNLVK5RPDQLDQ(QJOLVK*UHHNDQG+LQGL
DQGFDQEHFRQVLGHUHGDVWKH¿UVWZRUNRQ7XUNLVK1(5
7XUNLVKVSHFL¿F VWXGLHV DUH UHODWLYHO\ VFDUFH ZKHQ FRP
SDUHGWRODQJXDJHVWKDW KDYH D ZLGHU JOREDO GLVWULEXWLRQ )LUVW
ZRUN RQ 7XUNLVK 1(5 >@ SUHVHQWV D VWXG\ EDVHG RQ
LQIRUPDWLRQ H[WUDFWLRQ >@ IRFXVHV RQ FRQGLWLRQDO UDQGRP
¿HOGVXVLQJPRUSKRORJLFDODQGOH[LFDOIHDWXUHV 7KHLU EDVHOLQH
LV LQÀXHQFHG E\ D OHDGLQJ ZRUN KRZHYHU WKH GDWDVHW XVHG LV
JDWKHUHGIURPUHDOQDWXUDOODQJXDJHGDWD7DNLQJ7XUNLVK1(5
VWXGLHVLQWR DFFRXQWPRGHOVEDVHGRQ +LGGHQ0DUNRY0RGHOV
+00V >@ &RQGLWLRQDO 5DQGRP )LHOGV &5)V >@ DQG
UXOH EDVHG >@ VWXGLHV DUH SUHVHQWHG RQ GDWD JDWKHUHG IURP
QHZV UHSRUWV )RUPDO WH[WV OLNH QHZV DUH ZULWWHQ RQ FHUWDLQ
UXOHVRIODQJXDJH$XWKRUVRIVXFKWH[WVIROORZWKHJUDPPDWLFDO
DQGRUWKRJUDSKLFDOUXOHVRIWKHODQJXDJHLQTXHVWLRQ DQG WKXV
JHQHUDWHVWDWLVWLFDOO\OHVVYRODWLOHGDWD7ZLWWHULVZLGHO\XVHGLQ
1/3 VWXGLHV DV ZHOO \HW LQ WZHHWV WKHUH LV QR QHHG WR IROORZ
VSHOOLQJ UXOHV DQG ZRUGV DQG HYHQ OHWWHUV LQ WKH FDVH RI
HPRWLFRQV DQG VRPH LQIRUPDO DEEUHYLDWLRQV FDQ EH XVHG LQ
GLIIHUHQW VHQVHV WKDQ XVXDO $Q H[SHULPHQWEDVHG VWXG\ RQ
WZHHWV >@ VKRZV WKH GLIIHUHQFH EHWZHHQ SURFHVVLQJ D IRUPDO
V\QWD[RYHUDVRFLDOPHGLDWH[W
IV. :25'(0%('',1*6
7UDGLWLRQDOUHSUHVHQWDWLRQV RI ZRUGV LHRQHKRW YHFWRUV
DUHEDVHG RQ ZRUGZRUG :î : FRRFFXUUHQFH VSDUVH PD
WULFHVZKHUH:LVWKHQXPEHURIGLVWLQFWZRUGVLQ WKH FRUSXV
2QWKHRWKHUKDQGGLVWULEXWHGZRUGUHSUHVHQWDWLRQV'5VLH
ZRUG HPEHGGLQJV DUH ZRUGFRQWH[W : î & GHQVH PDWULFHV
ZKHUH&:DQG&LVWKHQXPEHURIFRQWH[WGLPHQVLRQVZKLFK
DUH GHWHUPLQHG E\ XQGHUO\LQJ PRGHO DVVXPSWLRQV 'HQVH
UHSUHVHQWDWLRQV DUH DUJXDEO\ EHWWHU DW FDSWXULQJ JHQHUDOL]HG
LQIRUPDWLRQ DQG PRUH UHVLVWDQW WR RYHU¿WWLQJ GXH WR FRQWH[W
YHFWRUV UHSUHVHQWLQJ VKDUHG SURSHUWLHV RI ZRUGV '5V DUH UHDO
YDOXHG YHFWRUV ZKHUH HDFK FRQWH[W FDQ EH FRQVLGHUHG DV D
FRQWLQXRXVIHDWXUHRIDZRUG'XHWRWKHLUDELOLW\WRUHSUHVHQW
DEVWUDFW IHDWXUHV RI D ZRUG '5V DUH FRQVLGHUHG DV UHXVDEOH
DFURVVKLJKHUOHYHOWDVNVLQHDVH HYHQ LI WKH\ DUH WUDLQHGZLWK
WRWDOO\GLIIHUHQWGDWDVHWV
3UHGLFWLRQ EDVHG '5 PRGHOV JDLQHG PXFK DWWHQWLRQ DIWHU
0LNRORY HW DO¶V QHXUDO QHWZRUN EDVHG 6NLS*UDP PRGHO
LQ >@ 7KH VHFUHW EHKLQG WKH SUHGLFWLRQ EDVHG PRGHOV
LV VLPSOH QHYHU EXLOG D VSDUVH PDWUL[ DW DOO 3UHGLFWLRQ
EDVHG PRGHOV FRQVWUXFW GHQVH PDWUL[ UHSUHVHQWDWLRQV GLUHFWO\
LQVWHDG RI UHGXFLQJ VSDUVH RQHV WR GHQVH RQHV 7KHVH
PRGHOV DUH WUDLQHG OLNH DQ\ RWKHU VXSHUYLVHG OHDUQLQJ WDVN
E\ JLYLQJ ORWV RI SRVLWLYH DQG QHJDWLYH VDPSOHV ZLWKRXW
DGGLQJ DQ\ KXPDQVXSHUYLVLRQ FRVWV $LP RI WKHVH PRGHOV
LV WR PD[LPL]H WKHSUREDELOLW\ RI HDFK FRQWH[W F ZLWK WKH
VDPH GLVWULEXWLRQDO DVVXPSWLRQV RQ ZRUGFRQWH[W FR
RFFXUUHQFHVVLPLODU WR FRXQWEDVHG PRGHOV >@ >@
V. DATA
2XU GDWDVHW LV FROOHFWHG IURP 3HQQ7UHHEDQN FRUSXV DQG
HDFK VHQWHQFH RI WKLV GDWDVHW LV WUDQVODWHG LQWR 7XUNLVK >@
7KLV GDWDVHW LQFOXGHV  VHQWHQFHV DQG  ZRUGV LQ
FOXGLQJ SXQFWXDWLRQ PDUNV 2XU GDWD IRUPDW LV VKRZQ LQ
)LJXUH
)LJXUH'DWDFRQWHQWDVWUDLQ¿OH
$0RUSKRORJLFDO'LVDPELJXDWLRQ
7XUNLVK LV DQ DJJOXWLQDWLYH ODQJXDJH LQ ZKLFK ZRUGV DUH
IRUPHGE\DWWDFKLQJGHULYDWLRQDODQGLQÀHFWLRQDOVXI¿[HVWRWKH
URRWV 0RUSKHPHV DGGHG WR D ZRUG FDQ FKDQJH LWV SDUW RI
VSHHFK LH IRU LQVWDQFH FRQYHUW D QRXQ WR D YHUE  RU YLFH
YHUVDRUFDQFUHDWHDGYHUEVIURPDGMHFWLYHV0RUHRYHUGXULQJ
ZRUG IRUPDWLRQ VRPH OHWWHUV FDQ EH FKDQJHG RU XQGHUJR
GHOHWLRQ +HQFH ZLWKRXW GHWHUPLQLQJ WKH OHPPD RI D ZRUG
IURPLWVVXUIDFH IRUP EDVHG RQLWVLQWHQGHGPHDQLQJLWLV QRW
SRVVLEOH WR LGHQWLI\ WKH ZRUG FRUUHFWO\ DQG H[WUDFW FDQGLGDWH
VHQVHVIURPDGLFWLRQDU\
)LJXUH  0RUSKRORJLFDO GLVDPELJXDWLRQ WRRO
)ROORZLQJWKHWUDQVODWLRQWKHFRUSXVKDVEHHQPRUSKRORJ
LFDOO\GLVDPELJXDWHG,QWKDWZRUN KXPDQDQQRWDWRUVVHOHFWHG
WKHFRUUHFW PRUSKRORJLFDOSDUVLQJ IURPPXOWLSOHSRVVLEOHDQDO
\VHV UHWXUQHG IURP WKH DXWRPDWLF SDUVHU 6HH )LJXUH  IRU
WKH PRUSKRORJLFDO GLVDPELJXDWLRQWRRO XVHG 7KH WDJ VHW DQG
PRUSKRORJLFDOUHSUHVHQWDWLRQZDVDGRSWHGIURPWKHVWXG\>@
(DFK RXWSXW RI WKH SDUVHU FRPSULVHV WKH URRW RI WKH ZRUG LWV
SDUWRIVSHHFKWDJDQGDVHWRIPRUSKHPHVHDFKVHSDUDWHGZLWK
D¶¶ VLJQ
%1HU7DJJLQJ
,Q WKH VHFRQG VWDJH ZH KDG UDQGRPO\ VHOHFWHG 
VHQWHQFHV DQG DQQRWDWHG WKHP PDQXDOO\ E\ XVLQJ WKH ³1(5
$QQRWDWLRQ 7RRO´ RI ,ÜVÕN 8QLYHUVLW\ 7KH WRRO LV VKRZQ LQ
)LJXUH
)LJXUH  $QQRWDWLRQ WRRO IRU 1(5
1(5 WDJV WKDW DUH XVHG DV FODVV ODEHOV LQ WKLV VWXG\ DUH
3(5621 IRU SHUVRQ¶V QDPHV /2&$7,21 IRU SODFH QDPHV
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 
25*$1,=$7,21 IRU JRYHUQPHQWDO RU FLYLO RUJDQL]DWLRQV
¿UPVDVVRFLDWLRQVHWF 7,0( IRU VSHFL¿F SRLQWV LQ WLPHVXFK
DV GDWHV \HDUV KRXUV HWF 021(< IRU ¿QDQFLDO DPRXQWV RU
FXUUHQFLHVDQG121(IRU HYHU\WKLQJHOVH%\WKHVHFDWHJRULHV
RIFODVV ODEHOV WKH GLVWULEXWLRQRI WKH GDWDLV VKRZQ EHORZ LQ
7DEOH ,,, $QQRWDWHG GDWDVHW DQG VRXUFHV FRGHV DUH IUHHO\
DYDLODEOH
7DEOH,,, ',675,%87,212) 7+( '$7$
Label Count
PERSON 606
LOCATION 235
ORGANIZATION 685
MONEY 387
TIME 299
NONE 10982
9, 0(7+2'2/2*<
7KHVHYHQNQRZQFODVVL¿FDWLRQDOJRULWKPVZHXVHDUH
'XPP\ 'HFLGHV EDVHG RQ WKH SULRUFODVV SUREDELOLW\
ZLWKRXW ORRNLQJ DW WKH LQSXW $OO WHVW LQVWDQFHV DUH
DVVLJQHGWRWKH FODVV ZLWK WKHPD[LPXPSULRU
&7KHDUFKHW\SDOGHFLVLRQ WUHH PHWKRG >@
.QQ.1HDUHVW1HLJKERUFODVVL¿FDWLRQDOJRULWKPWKDW
XVHVWKH(XFOLGHDQGLVWDQFH
/S/LQHDUSHUFHSWURQZLWKVRIWPD[RXWSXWVWUDLQHGE\
JUDGLHQWGHVFHQWWR PLQLPL]HFURVVHQWURS\>@
0OS:HOONQRZQPXOWLOD\HUSHUFHSWURQFODVVL¿FDWLRQ
DOJRULWKP>@
1E&ODVVLF1DLYH %D\HVFODVVL¿HUZKHUHHDFKIHDWXUH
LV DVVXPHG WR EH *DXVVLDQ GLVWULEXWHG >@ DQG HDFK
IHDWXUHLVLQGHSHQGHQWIURPRWKHUIHDWXUHV
5I 5DQGRP )RUHVW PHWKRG LPSURYHV EDJJLQJ LGHD
ZLWK UDQGRPL]LQJ IHDWXUHV DW HDFK GHFLVLRQ QRGH>@
DQGFDOOHGWKHVH UDQGRP GHFLVLRQWUHHVDVZHDNOHDUQ
HUV ,Q WKH SUHGLFWLRQ WLPH WKHVH ZHDN OHDUQHUV DUH
FRPELQHGXVLQJFRPPLWWHHEDVHGSURFHGXUHV
$'LVFUHWH0RGHO
 )HDWXUHV :HXVHGWKHIROORZLQJIHDWXUHVLQ RXU'LVFUHWH
0RGHOV
&DVH$WWULEXWH & 7KLV LV D GLVFUHWH DWWULEXWH IRU D
JLYHQZRUG ,I WKH ODVWLQÀHFWLRQDO JURXS RI WKHZRUG
FRQWDLQVFDVH LQIRUPDWLRQ WKH DWWULEXWHZLOO KDYH WKDW
FDVHYDOXH2WKHUZLVHWKHDWWULEXWHZLOOKDYH WKHYDOXH
QXOO
,V&DSLWDO$WWULEXWH ,& 7KLV LV D ELQDU\ DWWULEXWH ,W
FKHFNVHDFKZRUGLQDVHQWHQFH,IWKH¿UVWFKDUDFWHULV
XSSHUFDVH WKH DWWULEXWH UHWXUQV WUXH RWKHUZLVH LW
UHWXUQVIDOVH
,V'DWH$WWULEXWH ,' 7KLV LV D ELQDU\ DWWULEXWH WKDW
FKHFNV ZKHWKHU D ZRUG LV ZULWWHQ LQ GDWH IRUPDW ,I
WKH ZRUG LV LQ GDWH IRUPDW LW UHWXUQV WUXH RWKHUZLVH
LWUHWXUQVIDOVH
KWWSKD\GXWLVLNXQHGXWUQOSWRRONLWKWPO
,V)UDFWLRQ$WWULEXWH,)7KLV LV D ELQDU\DWWULEXWHWKDW
FKHFNV ZKHWKHU D ZRUG LV D IUDFWLRQDO QXPEHU ,I WKH
ZRUGLVDIUDFWLRQDOQXPEHULWUHWXUQVWUXHRWKHUZLVH
LWUHWXUQVIDOVH
,V+RQRUL¿F$WWULEXWH,+7KLV LV DELQDU\DWWULEXWH,W
FKHFNVHDFKZRUG LQ DVHQWHQFH,IWKHZRUG HTXDOVWR
RQHRIWKHWLWOHV³ED\´³ED\DQ´³VD\ÕQ´³GU´³SURI´
RU³GRo´LWUHWXUQVWUXHRWKHUZLVHLWUHWXUQVIDOVH
,V0RQH\$WWULEXWH ,0 7KLV LV D ELQDU\ DWWULEXWH ,W
FKHFNV ZKHWKHU D ZRUG LQ D VHQWHQFH GHQRWHV D
FXUUHQF\,IWKHZRUGLV³GRODUÕ´³OLUDVÕ´RU³DYUR´WKH
DWWULEXWHUHWXUQVWUXHRWKHUZLVHLWUHWXUQVIDOVH
,V1XP$WWULEXWH ,1 7KLV DWWULEXWH FKHFNV ZKHWKHU
WKHZRUGKDV QXP QXPEHU WDJ
,V2UJDQL]DWLRQ$WWULEXWH ,2 7KLV LV D ELQDU\ DW
WULEXWH ,W FKHFNV D JLYHQ ZRUG DQG LI LW HTXDOV WR
³FRUS´ ³LQF´ RU ³FR´ LW UHWXUQV WUXH RWKHUZLVH LW
UHWXUQVIDOVH
,V3URS$WWULEXWH ,3 7KLV DWWULEXWH FKHFNV ZKHWKHU
WKHZRUGKDV SURS SURSHU QDPH WDJ
,V5HDO$WWULEXWH ,5 7KLV LV D ELQDU\ DWWULEXWH WKDW
FKHFNVZKHWKHUD ZRUG LV D UHDO QXPEHU RUQRW,IWKH
ZRUG LV D UHDO QXPEHU LW UHWXUQV WUXH RWKHUZLVH LW
UHWXUQVIDOVH
,V7LPH$WWULEXWH ,7 7KLV LV D ELQDU\ DWWULEXWH WKDW
FKHFNV ZKHWKHU D ZRUG LV ZULWWHQ LQ WLPH IRUPDW ,I
WKH ZRUG LV LQ WLPH IRUPDW LW UHWXUQVWUXH RWKHUZLVH
LWUHWXUQVIDOVH
0DLQ3RV$WWULEXWH037KLVLV D GLVFUHWH DWWULEXWH
,WUHWXUQVWKHPDLQSDUWRI VSHHFK RI WKH ZRUG
/DVW,*&RQWDLQV3RVVHVVLYH$WWULEXWH 3 7KLV LV D EL
QDU\DWWULEXWH,IWKHODVWLQÀHFWLRQDOJURXSRIWKHZRUG
FRQWDLQV SRVVHVVLYH LQIRUPDWLRQ WKH DWWULEXWH ZLOO
UHWXUQWUXHRWKHUZLVHLWZLOOUHWXUQIDOVH
5RRW)RUP$WWULEXWH 5) 7KLV LV D GLVFUHWH DWWULEXWH
,WUHWXUQVWKHURRW IRUP RI D JLYHQ ZRUG
5RRW3RV$WWULEXWH 53 7KLV LV D GLVFUHWH DWWULEXWH ,W
UHWXUQVWKHSDUWRIVSHHFKRIWKHURRWIRUPRI D JLYHQ
ZRUG
6XUIDFH)RUP$WWULEXWH6)7KLVLVDGLVFUHWHDWWULEXWH
,WUHWXUQVWKHVXUIDFH IRUP RI WKH ZRUG
 0HWKRGV )LUVW RI DOO ZH DQQRWDWHG  VHQWHQFHV WR
JHQHUDWHRXUGDWDVHW7KHDQQRWDWLRQZDVGRQHPDQXDOO\DQGLW
ZDV D YHU\ VLJQL¿FDQW SRLQW IRU WKH H[SHULPHQW EHFDXVH ZH
QHHGHGDQHIIHFWLYHGDWDVHWWRJHWUHOLDEOHUHVXOWV
:HJLYHWKHGHWDLOVRIWKHPHWKRGVVXFKDVWKHDWWULEXWHV
WKHFODVVL¿HUDQGLWVSDUDPHWHUVLQ7DEOH,9
%&RQWLQXRXV0RGHO
:H XVHG 0LNRORY HW DO¶V 6NLS*UDP DQG &%2: PRGHOV
IRUOHDUQLQJ FRQWLQXRXV UHSUHVHQWDWLRQ RI ZRUGV :H XVHG D
QHZV FRUSXV FRQVLVWLQJ RI  PLOOLRQ 0 XQDQQRWDWHG
VHQWHQFHV IRUXQVXSHUYLVHG WUDLQLQJ FROOHFWHG IURP QHZV DQG
DUWLFOHV IURP
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 
Table IV. D ETAILS OF METHODS USED IN DISCRETE MODEL
Method Attributes Window Classifier Parameters Author
METHOD1IC,ID,IR,IT,MP 4Rf attribute subset size = 4, ensemble size = 20 B.Ö.
METHOD2IC,ID,IF,IT,MP,RF,SF 1Mlp learning rate = 0.1, number of hidden nodes = 30 B.E.
METHOD3IH,IM,MP,RF,SF 0Mlp learning rate = 0.1, number of hidden nodes = 50 A.T.G.
METHOD4C,IC,IH,IO,IR,MP,P,RF,RP,SF 2Lp learning rate = 0.1 A.B.K.
METHOD5IC,IN,IP,MP,RF,SF 1Mlp learning rate = 0.1, number of hidden nodes = 5, 8 O.T.
METHOD6IC,IP,MP,RF,SF 1Mlp learning rate = 0.1, number of hidden nodes = 10 O.A.
Tab le V. E RROR RATES OF METHODS USING STRATIFIED 10-FOLD CROSS-VAL I D AT I O N
Classifier DUMMY METHOD1 METHOD2 METHOD3 METHOD4 METHOD5 METHOD6
Error Rate 14.89 14.71 7.64 12.19 7.65 8.45 9.28
QHZVSDSHUV :H WUDLQHG PRGHOV ZLWK WKH VDPH VHWWLQJV LQ
WKUHHVHSDUDWH VL]HV . . 0 '5 PRGHOV JHQHUDOO\
KDYH WHQV RI K\SHUSDUDPHWHUV:H OLVWHG WKH PRVW LPSRUWDQW
K\SHUSDUDPHWHUV RQO\ DORQJ ZLWK GHIDXOW YDOXHV ZH XVHG
LQ RXUH[SHULPHQWVEHORZ
&RQWH[W ZLQGRZ ZLQ 6L]H RI D FRQWH[W ZLQGRZ
ZKHUH WKH SUHGLFWLRQ PRGHO JDWKHUV FRRFFXUUHQFH
LQIRUPDWLRQ IURP 7KH UHDGHU VKRXOG QRWH WKDW HYHQ
WKRXJKFRQFHSWV DUHVLPLODUWKLVSDUDPHWHU VKRXOGQRW
EHFRQIXVHG ZLWKWKHFRQWH[WZLQGRZVL]HZHXVHG RQ
H[HFXWLQJFODVVL¿HUV:HXVHGDVWKHGHIDXOWVHWWLQJ
'LPHQVLRQ G 6L]H RI D & FRQWH[W GLPHQVLRQV RI
YHFWRUV IRU ZRUGFRQWH[WGHQVH PDWULFHV : î& :H
XVHGDV WKH GHIDXOW VHWWLQJ
'HOHWLQJ LQIUHTXHQW ZRUGV GHO 0RVW '5 PRG
HOV LJQRUH GHOHWH LQIUHTXHQW ZRUGV LQ WKH FRUSXV
ZLWK VRPH WKUHVKROGLQJ PHFKDQLVP DVVXPLQJ WKDW
LQIUHTXHQW ZRUGV DUH LQIRUPDWLYH :H GLG QRW LJQRUH
LQIUHTXHQW ZRUGV GXH WR WKH QDWXUH RI WKH 1(5 WDVN
UHTXLULQJ PRUH VDPSOHV RI LQIUHTXHQW LQVWDQFHV VXFK
DV2KLR,%0
8QOLNHWKH '5WH[WSUHSURFHVVLQJFRQYHQWLRQVZHGLGQRW
ORZHUFDVHWRNHQVDQGGLGQRW¿OWHURXWDQ\SXQFWXDWLRQVHJ
FRPPDVGRWVK\SKHQVGXHWR1(5WDVNUHTXLULQJLQFOXVLRQRI
SURSHU ZRUGV DQG SXQFWXDWLRQV 7KHUHIRUH GLVWULEXWLRQ RI
SXQFWXDWLRQVDUHDOVROHDUQWDORQJZLWKZRUGV)RUH[DPSOHDV
H[SHFWHG RXU 0 VHQWHQFHG PRGHO UHWXUQV ³´ DV VSDWLDOO\
FORVHVW WRNHQ WR ³´ $QRWKHU LQWHUHVWLQJ H[DPSOH LV WKDW RXU
PRGHO UHWXUQV WKH WRNHQV ³RUDQÕQGD´ ³JHULOHPLÜV´ ³\]GH´
³´ ³RUDQ´ ZRUGV UHODWHG WR SHUFHQWDJH LQ 7XUNLVK DV WKH
PRVWVSDWLDOO\FORVHVWRQHVWRWKHWRNHQ³´
:H WUDLQHG RXU '5 PRGHOVLQ WZR NLQGV RI ZRUG IRUPV
6XUIDFH )RUP 68 1DWXUDO IRUP RI D ZRUG ZKLFK
DSSHDUHG LQ D WH[W DV LW LV ([ *]HOJ|]O
WXUQDODUJ|oWOHU
5RRW)RUP52 5RRWRIDZRUGXVHG LQ'5WUDLQLQJ
EDVHGRQPRUSKRORJLFDOGLVDPELJXDWLRQ RI HYHU\ VHQ
WHQFHSURYLGHGE\WKH³0RUSKRORJLFDO'LVDPELJXDWLRQ
/LEUDU\´RI,ÜVÕN8QLYHUVLW\ >@ ([ *]HOJ|]WXUQD
J|o
:H XVHG RXU WUDLQHG '5 RXWSXWV : î & WR IHHG RXU
FODVVL¿HUV DV FRQWLQXRXV IHDWXUHV RI ZRUGV ZLWKRXW GRLQJ DQ\
DGGLWLRQDO WDVNVSHFL¿F IHDWXUH KDQGFUDIWLQJ 7KHUHIRUH ZH
IHHGHYHU\ZRUGZLWK&GFRQWLQXRXVIHDWXUHVE\XVLQJ&
dimensional word vectors of our generated representations. If
generated DRs don’t have a word vector for a word instance
requested by the classifier model, we provided zero-vector
in C dimensions to represent a null feature value. DRs are
real valued vectors normalized between -1 and 1. We also
reported missing word requests with each experiment as out-
of-vocabulary (OOV) rate due to the fact that classifier perfor-
mances are highly dependent on OOV rates in experiments.
9,, (;3(5,0(176
$,QWHUDQQRWDWRU$JUHHPHQW
)RU HYDOXDWLRQ RI WKH DQQRWDWHG GDWDVHW ZH XVHG LQWHU
DQQRWDWRUDJUHHPHQW PHDVXUH 7ZR GLIIHUHQW JURXS RI DQQR
WDWRUV DQQRWDWHG VDPH VHQWHQFHV 'XH WR D ODFN RI WLPH ZH
FRXOGRQO\UHDQQRWDWHRIWKH WRWDO RI  VHQWHQFHV DQG
JRWRILQWHUDQQRWDWRUDJUHHPHQW$V DFRPSDULVRQWKH
H[SHFWHG LQWHUDQQRWDWRU DJUHHPHQW DVVXPLQJ WKH DQQRWDWRUV
DQQRWDWHGFRPSOHWHO\UDQGRPO\LV 
%'LVFUHWH0RGHO
:HHYDOXDWHGWKHSURSRVHGDSSURDFK XVLQJ D VWUDWL¿HG 
IROGFURVVYDOLGDWLRQRQWKHGDWD7KHDFFXUDFLHVREWDLQHGIURP
HDFKFODVVL¿HUDUHVKRZQLQ7DEOH9
&&RQWLQXRXV0RGHO
:H ¿[HG 6NLS*UDP 6* PRGHO FRQ¿JXUDWLRQ DV GHIDXOW
ZLWK K\SHUSDUDPHWHUV ZLQ  G  GHO  WR XVH DV D
EDVHOLQHIRUDOO RWKHU FRQ¿JXUDWLRQV 6*ZLQGGHO:H
FKRVHWKHGHIDXOWFRQ¿JXUDWLRQEDVHGRQ RXU REVHUYDWLRQV WKDW
LW SHUIRUPV EHWWHU RQ DYHUDJH FRPSDUHG WR DOO RWKHU YDULRXV
FRQ¿JXUDWLRQ RSWLRQV ZH SUHH[SHULPHQWHG :H WUDLQHG 
PRGHOVLQWRWDOIRUGLIIHUHQWFRUSXVVL]HV..0
DQG  GLIIHUHQW ZRUG IRUPV 6XUIDFH 5RRW XVLQJ GHIDXOW
FRQ¿JXUDWLRQ :H UDQ RXU PRGHOV ZLWK 'XPP\ /S ZLWK
OHDUQLQJUDWH 0OS ZLWK OHDUQLQJ UDWH KLGGHQQRGHV
.QQ ZLWKN &ZLWKSUXQLQJDQG1EFODVVL¿HUVZLWK
IROGFURVVYDOLGDWLRQRQGDWD:HUDQDOOFODVVL¿HUVZLWKWKH
ZLQGRZ VL]H RI  7DEOH 9, VKRZV FODVVL¿HU HUURU UDWHV DQG
229SHUFHQWDJHVRIRXUH[SHULPHQWV
2XUFRQWLQXRXVPRGHOH[SHULPHQWVVKRZWKDW
8QVXSHUYLVHG'5 PRGHOV FDQSURYLGH YDOXDEOH LQIRU
PDWLRQ DV IHDWXUHV RI ZRUGV IRU 1(5 FODVVL¿FDWLRQ
WDVNV DQG FDQ SHUIRUP DV JRRG DV VXSHUYLVHG PDQ
XDOO\ KDQGFUDIWHG GLVFUHWH IHDWXUHV 2XU '5 PRGHOV
RXWSHUIRUPHG GLVFUHWH WDVNV RQ 0OS /S .QQ ZLWK D
VPDOOPDUJLQHUURUUDWHV
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 
Tab le VI . ERROR RATES OF DIFFERENT METHODS USING 10-FOLD
STRATIFIED CROSS-VALIDATIONOF CONTINUOUS MODEL EXPERIMENTS
OOV% Dummy Lp Mlp Nb Knn C45
SURFACE (SU)
100K 32.4 9.1 8.13 8.25 35.79 9.4 9.13
500K 23.05 10.75 8.06 7.81 35.18 10.24 10.61
1M 20.31 11.18 8.05 7.94 31.57 10.23 9.96
ROOT (RO)
100K 20.77 9.79 6.87 6.74 15.22 9.3 7.85
500K 17.49 11.1 6.96 6.62 20.43 14.8 9.55
1M 16.33 11.56 6.7 6.56 9.52 16.82 9.87
OTHER CONF.
1M-RO-d300 16.33 11.56 6.96 6.65 10.04 21.67 9.54
1M-RO-d20 16.33 11.56 7.47 6.76 11.49 7.92 9.33
1M-RO-d10 16.33 11.56 8.3 7.24 13.13 7.4 10.55
MIN 16.33 9.1 6.7 6.56 9.52 7.4 7.85
Word embeddings with basic morphological root form
enrichment can provide better results than simple
surface form word embeddings (1.2pt reduction from 8
to 6.8 on average in Lp and Mlp). This improvement
is linearly related with the reduction of OOV rates
provided by the root form representation ability of
models.
Nb works significantly better with root forms com-
pared to surface forms (reduced 31.57 to 9.52 error
rate).
Increasing corpus size did not increase the NER
classification performances; it was even observed to
decrease OOV rates.
SkipGram performed better than CBOW almost on all
experiments.
Increasing vector dimensions (d= 300) for classifica-
tion tasks did not increase NER performance. On the
contrary, vectors in low dimensions (10, 20) performed
significantly better with Knn.
VIII. CONCLUSION
Named entity recognition is the problem of detecting the
type of named entities in sentences, such as person, organi-
sation, time or money. The problem in NER lies in deciding
whether a noun denotes a person, an organisation or a location
and whether a number denotes time or money. As a new
approach to NER, we produced six different discrete models
with various features and tested these models on the tagged
NER data created by us. The data consist of 1400 Turkish
sentences which are manually tagged by 7 different annotators.
We also used Mikolov et al.’s SkipGram and CBOW
models for learning continuous representation of words. Then
we used our trained continuous representation of words to
feed our classifiers as continuous features of words without
doing any feature enhancement. Our results show that, these
continuous models for NER classification tasks can perform
as good as supervised, manually handcrafted discrete features.
As in many natural language processing subfields, words
have been used as the atomic unit of language in distributional
semantics (DS) modeling field for the sake of model simplicity.
Even though word-based models yield good results for lan-
guages with limited vocabulary such as English, these models
are pretty ineffective for morphologically rich languages with
unlimited vocabularies such as Turkish.
ACKNOWLEDGEMENTS
This work was supported by I¸sık University BAP projects
14B206 and 15B201. All authors contributed equally to this
work. O. A., ˙
I. Ç., B. E., A. T. G., A. B. K., B. Ö., O. T.
designed and implemented discrete model experiments. They
also labeled the data and wrote the manuscript. G. E. designed
and implemented continuous model experiments. B. A. wrote
the manuscript. O. T. Y. supervised the project, gave conceptual
advice and wrote the manuscript.
REFERENCES
[1] E. Alpaydın, Introduction to Machine Learning,3rded. TheMIT
Press, 2010.
[2] M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! a sys-
tematic comparison of context-counting vs. context-predicting semantic
vectors,” in ACLC, 2014, pp. 238–247.
[3] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford
University Press, 1995.
[4] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32,
2001.
[5] G. Celikkaya, D. Torunoglu, and G. Eryigit, “Named entity recognition
on real data: a preliminary investigation for Turkish,” in 7th Interna-
tional Conference on Application of Information and Communication
Tec hno log ie s , 2013.
[6] S. Cucerzan and D. Yarowsky, “Language independent named entity
recognition combining morphological and contextual evidence,” in
Proceedings of the Joint SIGDAT Conference on EMNLP and VLC,
1999.
[7] E.T.Erguvanlı,The function of word order in Turkish Grammar.
Oxford: Oxford University Press, 1984.
[8] O. Gorgun and O. T. Yildiz, “A novel approach to morphological
disambiguation for Turkish,” in International Conference on Computer
and Information Science, 2011, pp. 77–83.
[9] D. Küçük and R. Steinberger, “Experiments to improve named entity
recognition on Turkish tweets,” arXiv:1410.8668, 2014.
[10] O. Levy, Y. Goldberg, and I. Dagan, “Improving distributional similarity
with lessons learned from word embeddings,” Transactions of the
Association for Computational Linguistics, vol. 3, p. 211–225, 2015.
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[12] K. Oflazer, B. Say, D. Hakkani-Tür, and G. Tür, “Building a Turkish
treebank,” in Building and exploiting syntactically-annotated corpora.
Dordrecht: Kluwer, 2003.
[13] E. Okur, Named Entity Recognition for Turkish Microblog Texts Using
Semi-Supervised Learning with Word Embeddings, 2011.
[14] J. R. Quinlan, C4.5: Programs for Machine Learning.SanMeteo,CA:
Morgan Kaufmann, 1993.
[15] G. A. Seker and G. Eryigit, “Initial explorations on using crfs for
Turkish named entity recognition,” in COLING, 2012.
[16] E. Sezer, “On non-final stress in Turkish,” Journal of Turkish Studies,
vol. 5, pp. 61–69, 1981.
[17] S. Tatar and I. Cicekli, “Automatic rule learning exploiting morpho-
logical features for named entity recognition in Turkish,” Journal o f
Information Science, vol. 37, pp. 137–151, 2011.
[18] G. Tür, D. Hakkani-Tür, and K. Oflazer, “A statistical information
extraction system for Turkish,” Natural Language Engineering,vol.9,
pp. 181–210, 2003.
[19] O. T. Yildiz, E. Solak, O. Gorgun, and R. Ehsani, “Constructing a
Turkish-English parallel treebank,” in Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics. Baltimore,
Maryland: Association for Computational Linguistics, June 2014, pp.
112–117.
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ 
... NER play vital role to get semantic information, words relationships and meaningful entities from text. Past NER research models take input in different form and identity specific entities [4]. ...
Article
Full-text available
A huge amount of textual information is available on Web, Facebook, blogs and Wikipedia, everyday rising new techniques, algorithms and tools extract the useful information. Therefore, Named Entity Recognition (NER) is very important technique to recognize the noun entities like such as names, date or time, location, medicine names etc. Many researchers have proposed many techniques in different languages and domains for extract information from text that techniques are help to developed new NER applications. Here, we discuss NER techniques: rule-based, learning-based and hybrid approaches and their application and systems. We also present advantages and disadvantages of different libraries and their tools using Java, Python, and Cython programming languages which are SpaCy, Apache OpenNLP, StanfordNLP and tensorflow. Few libraries served a NER pre-built models that we use for comparison. We compare these few libraries on the basedon training accuracy, model size, time prediction, training loss data and F-measure. The data set is same for all libraries during training and testing, Spacy library provided a higher performance accuracy and good results as compare the other models. 1 Introduction Nowadays, a large amount of digital form data (such as email, social applications, newspapers and Instagram) is available in different languages. This information is collected in structured and unstructured form to process the data for extracting useful information but it is biggest challenge to extract meaningful knowledge such king of big data. The main focus of NLP is to get useful information of the human local and information languages so that machine can batter perform after understand human languages information [1]. NLP many information extraction systems are developed that processes question answering [2] and summarized text automatically by using machine well explained by [3]. NER play vital role to get semantic information, words relationships and meaningful entities from text. Past NER research models take input in different form and identity specific entities [4]. By taking in the view the use of NLP we cannot neglect the practical implementation of the NER in various text formats of different languages. The task dependent specification of the NLP is getting more important and the NER is already special purpose it does not work on general problems like the medical one [6]. Where the patient name disease and medicine name are the important information's that must be extracted using the NER. On the other end
... A CRF based NER system (Yeniterzi, 2011) highlighted the impact of morphology on tagging process and benefited from roots and morphological features of words as separate tokens instead of words. An automated rule learning system (Tatar and Cicekli, 2011), a CRF based system relying on the use of gazeteer and hand crafted morphology dependent features (Şeker and Eryiğit, 2012), and a classification system where six different models are trained with both discrete and continuous features of words (Ertopcu et al., 2017) are among recent Turkish NER studies. Although we use the same dataset for training and testing purposes (Tür et al., 2003), our work utilizes a neural network based solution and hence significantly differs from these earlier rule-based or statistical approaches. ...
Preprint
Full-text available
Named entity recognition (NER) is an extensively studied task that extracts and classifies named entities in a text. NER is crucial not only in downstream language processing applications such as relation extraction and question answering but also in large scale big data operations such as real-time analysis of online digital media content. Recent research efforts on Turkish, a less studied language with morphologically rich nature, have demonstrated the effectiveness of neural architectures on well-formed texts and yielded state-of-the art results by formulating the task as a sequence tagging problem. In this work, we empirically investigate the use of recent neural architectures (Bidirectional long short-term memory and Transformer-based networks) proposed for Turkish NER tagging in the same setting. Our results demonstrate that transformer-based networks which can model long-range context overcome the limitations of BiLSTM networks where different input features at the character, subword, and word levels are utilized. We also propose a transformer-based network with a conditional random field (CRF) layer that leads to the state-of-the-art result (95.95\% f-measure) on a common dataset. Our study contributes to the literature that quantifies the impact of transfer learning on processing morphologically rich languages.
... Usually companies have own names, not following any spelling rules, that makes it hard to validate the result. This issue is part of the research field of named entity recognition (Ertopcu et al., 2017;Zukov-Gregoric, Bachrach, Minkovsky, Coope, & Maksak, 2018). ...
Conference Paper
Full-text available
Intelligent Personal Assistants are getting more important in business-processes. To address this requirement the Prototype V-IP-A was developed to work out three different workflows with speech-support in a business context: Briefing, Search, Explanation. As a key finding from previous researches, the task attraction was identified as the most important factor for user-acceptance. Based on this finding and the Prototype a questionnaire for evaluating the user-acceptance of different features was derived and presented in this paper to give an outlook for further research on this topic. Keywords: •
Article
Named entity recognition (NER) is an extensively studied task that extracts and classifies named entities in a text. NER is crucial not only in downstream language processing applications such as relation extraction and question answering but also in large scale big data operations such as real-time analysis of online digital media content. Recent research efforts on Turkish, a less studied language with morphologically rich nature, have demonstrated the effectiveness of neural architectures on well-formed texts and yielded state-of-the art results by formulating the task as a sequence tagging problem. In this work, we empirically investigate the use of recent neural architectures (Bidirectional long short-term memory (BiLSTM) and Transformer-based networks) proposed for Turkish NER tagging in the same setting. Our results demonstrate that transformer-based networks which can model long-range context overcome the limitations of BiLSTM networks where different input features at the character, subword, and word levels are utilized. We also propose a transformer-based network with a conditional random field (CRF) layer that leads to the state-of-the-art result (95.95% f-measure) on a common dataset. Our study contributes to the literature that quantifies the impact of transfer learning on processing morphologically rich languages.
Chapter
This paper presents an intelligent application that runs on Android phones to extract important information from Vietnamese passports. Instead of processing information from the passport by manually entering important information into the computer, this application can scan and analyze to get the information fields on the passport automatically. The main algorithm flow includes five parts: 1) capture the passport by the camera, 2) preprocess the image, 3) recognize optical characters 4) extract important information, 5) linguistic processing. After all processing steps, the application will return to a screen with full information extracted. Information fields in cases that are extracted with low confidence will be highlighted for users to easily see and modify. The result has proved that our application has a definite advantage in recognizing the Vietnamese passport (being presented in bilingual English and Vietnamese) when compared to the existing commercial application on the Google Play store.
Article
Named entity recognition (NER) is a natural language processing tool for information extraction from unstructured text data such as e-mails, newspapers, blogs, etc. NER is the process of identifying nouns like people, place, organization, etc., that are mentioned in the string of the text, sentence, or paragraph. For building the NER system, many different libraries and natural language processing tools using Java, Python, and Cython languages are available. All these tools have pretrained NER models that can be imported, used and can be modified or customized according to requirements. This paper explains different NLP libraries including Python’s SpaCy, Apache OpenNLP, and TensorFlow. Some of these libraries provide a pre-build NER model that can be customized. The comparison of these libraries is done based on training accuracy, F-score, prediction time, model size, and ease of training. The training and testing data are the same for all the models. When considering the overall performance of all the models, Python’s Spacy gives a higher accuracy and the best result.
Conference Paper
Full-text available
Recently, due to the increasing popularity of social media, the necessity for extracting information from informal text types, such as microblog texts, has gained significant attention. In this study, we focused on the Named Entity Recognition (NER) problem on informal text types for Turkish. We utilized a semi-supervised learning approach based on neural networks. We applied a fast unsupervised method for learning continuous representations of words in vector space. We made use of these obtained word embeddings, together with language independent features that are engineered to work better on informal text types, for generating a Turkish NER system on microblog texts. We evaluated our Turkish NER system on Twitter messages and achieved better F-score performances than the published results of previously proposed NER systems on Turkish tweets. Since we did not employ any language dependent features, we believe that our method can be easily adapted to microblog texts in other morphologically rich languages.
Conference Paper
Full-text available
Named Entity Recognition (NER) is a well-studied area in natural language processing (NLP) and the reported results in the literature are generally very high (~>%95) for most of the languages. Today, the focus area of most practical natural language applications (i.e. web mining, sentiment analysis, machine translation) is real natural language data such as Web2.0 or speech data. Nevertheless, the NER task is rarely investigated on this type of data which differs severely from formal written text. In this paper, we present 3 new Turkish data sets from different domains (on this focused area; namely from Twitter, a Speech-to-Text Interface and a Hardware Forum) annotated specifically for NER and report our first results on them. We believe, the paper draws light to the difficulty of these new domains for NER and the possible future work.
Conference Paper
Full-text available
Social media texts are significant information sources for several application areas including trend analysis, event monitoring, and opinion mining. Unfortunately, existing solutions for tasks such as named entity recognition that perform well on formal texts usually perform poorly when applied to social media texts. In this paper, we report on experiments that have the purpose of improving named entity recognition on Turkish tweets, using two different annotated data sets. In these experiments, starting with a baseline named entity recognition system, we adapt its recognition rules and resources to better fit Twitter language by relaxing its capitalization constraint and by diacritics-based expansion of its lexical resources, and we employ a simplistic normalization scheme on tweets to observe the effects of these on the overall named entity recognition performance on Turkish tweets. The evaluation results of the system with these different settings are provided with discussions of these results.
Conference Paper
Full-text available
In this paper, we report our preliminary ef-forts in building an English-Turkish paral-lel treebank corpus for statistical machine translation. In the corpus, we manually generated parallel trees for about 5,000 sentences from Penn Treebank. English sentences in our set have a maximum of 15 tokens, including punctuation. We con-strained the translated trees to the reorder-ing of the children and the replacement of the leaf nodes with appropriate glosses. We also report the tools that we built and used in our tree translation task.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Recent trends suggest that neural-network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.
Conference Paper
Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics block. Despite the buzz surrounding these models, the literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches. In this paper, we perform such an extensive evaluation, on a wide range of lexical semantics tasks and across many parameter settings. The results, to our own surprise, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counterparts.