Content uploaded by Gokhan Ercan
Author content
All content in this area was uploaded by Gokhan Ercan on Jun 24, 2021
Content may be subject to copyright.
$1HZ$SSURDFKIRU1DPHG(QWLW\5HFRJQLWLRQ
$GOÕ9DUOÕN7DQÕPDGD<HQL%LU<DNODúÕP
%XUDN(UWRSoX$OL%X÷UD.DQEXUR÷OX2]DQ7RSVDNDO2QXU$oÕNJ|]
Ali Tunca Gürkan1,BerkeÖzenç
1,˙
Ilker Çam1
BegümAvar2,GökhanErcan1,OlcayTanerYıldız1
EXUDNHUWRSFX#LVLNHGXWUEXJUDNDQEXURJOX#LVLNHGXWUR]DQWRSVDNDO#LVLNHGXWURQXUDFLNJR]#LVLNHGXWU
DOLJXUNDQ#LVLNHGXWUEHUNHR]HQF#LVLNHGXWULONHUFDP#LVLNHGXWU
EHJXPDYDU#ERXQHGXWUJRNKDQHUFDQ#LVLNHGXWUROFD\WDQHU#LVLNXQHGXWU
1'HSDUWPHQWRI&RPSXWHU(QJLQHHULQJ,úÕN8QLYHUVLW\øVWDQEXO7XUNH\
2'HSDUWPHQWRI/LQJXLVWLFV%R÷D]LoL8QLYHUVLW\øVWDQEXO7XUNH\
g]HWoH²%LUoRN FPOHQLQ LON EDNÕúWD LQVDQODUD YHUGL÷L
L]OHQLPOHU YDUGÕU %X L]OHQLPOHU ELU WDNÕP YDUOÕNODU VD\HVLQGH
LQVDQODUD RNXGXNODUÕ úH\OHULQ QH KDNNÕQGD ROGX÷XQX NDYUDPDGD
\DUGÕPFÕ ROXU %XQXQ 'R÷DO 'LO *HOLúWLUPH¶GHNL NDUúÕOÕ÷Õ $GOÕ
9DUOÕN 7DQÕPDGÕU $97 $97 DOJRULWPDODUÕ WHPHO RODUDN
FPOHGHNLúL\HU]DPDQWDULKVDDWYH\DSDUDJLELELUoRNYDUOÕ÷Õ
WDUD\DELOLU%XLúOHPOHUGHNLHQE\NSUREOHPELULVPLQNLúL\HPL
\RNVDELU\HUH PL YH\DRUJDQL]DV\RQDPÕ YH\D ELU VD\ÕQÕQWDULKH
PL\RNVD SDUD\DPÕDLWROGX÷XJLELVRUXODUGÕU%XoDOÕúPDGD$GOÕ
9DUOÕN 7DQÕPD DOJRULWPDODUÕQD \HQL ELU PRGHO WDVDUODGÕN %XQX
ROXúWXUGX÷XPX]YHULVHWLQGH oDOÕúWÕUÕS HOGH HGLOHQ VRQXoODUÕGL÷HU
PRGHOOHULQNLOHUOH NDUúÕODúWÕUGÕN 6RQXo RODUDN UHWWL÷LPL]
FPOHOLNYHULVHWLQGHND\GDGH÷HUEXOJXODUHOGHHWWLN
$QDKWDU6|]FNOHU²'R÷DO'LO*HOLúWLUPH9HULdHNPH
$GOÕ9DUOÕN7DQÕPD
Abstract—Many sentences create certain impressions on peo-
ple. These impressions help the reader to have an insight about
the sentence via some entities. In NLP, this process corresponds
to Named Entity Recognition (NER). NLP algorithms can trace
a lot of entities in the sentence like person, location, date, time
or money. One of the major problems in these operations are
confusions about whether the word denotes the name of a person,
a location or an organisation, or whether an integer stands for
a date, time or money. In this study, we design a new model for
NER algorithms. We train this model in our predefined dataset
and compare the results with other models. In the end we get
considerable outcomes in a dataset containing 1400 sentences.
.H\ZRUGV²1DWXUDO/DQJXDJH3URFHVVLQJ,QIRUPDWLRQ
([WUDFWLRQ1DPHG(QWLW\5HFRJQLWLRQ
, ,1752'8&7,21
1DWXUDO ODQJXDJH SURFHVVLQJ 1/3 LV D ¿HOG LQ FRPSXWHU
VFLHQFHZKLFKLQYHVWLJDWHVKRZDFRPSXWHUFDQXQGHUVWDQGDQG
PDQLSXODWHODQJXDJH LQLWVZULWWHQ DQGVSRNHQIRUP /DQJXDJH
DELOLW\ LV RQH RI WKH PRVW LPSRUWDQW FKDUDFWHULVWLFV RI KXPDQ
EHLQJVWKDWVKHGVOLJKWRQWKHIXQFWLRQLQJRIWKHKXPDQEUDLQ,I
KXPDQODQJXDJHFDQEHPRGHOOHGLQ FRPSXWHU HQYLURQPHQW LW
FDQEHXVHGIRUDGYDQFHGDQGHIIHFWLYHFRPPXQLFDWLRQWDVNV
7KH IRXQGDWLRQV RI 1/3 OLH LQ D QXPEHU RI GLVFLSOLQHV
QDPHO\FRPSXWHUDQGLQIRUPDWLRQVFLHQFHVOLQJXLVWLFVPDWK
HPDWLFVHOHFWULFDODQGHOHFWURQLFHQJLQHHULQJDUWL¿FLDOLQWHOOL
JHQFH DQG URERWLFV DQG SV\FKRORJ\ 6RPH 1/3 DSSOLFDWLRQV
LQFOXGH VWXGLHV VXFK DV PDFKLQH WUDQVODWLRQ QDWXUDO ODQJXDJH
WH[W SURFHVVLQJ DQG VXPPDULVDWLRQ XVHU LQWHUIDFHV PXOWLOLQ
JXDO DQG FURVVODQJXDJH LQIRUPDWLRQ UHWULHYDO &/,5 VSHHFK
UHFRJQLWLRQDUWL¿FLDOLQWHOOLJHQFHDQGH[SHUWV\VWHPV
1/3 VWXGLHV LQ JHQHUDO IRFXV RQ ¿JXULQJ RXW KRZ WR UH
VROYHWKHUXOHVRIQDWXUDOODQJXDJHVIRUWKHPDFKLQHV7KURXJK
VXFK UHVROXWLRQ PDQ\ SURFHVVHV VXFK DV WUDQVODWLRQ LQIRU
PDWLRQ H[WUDFWLRQ IURP VWUXFWXUDOO\ LUUHJXODU WH[WV TXHVWLRQ
DQVZHULQJ WH[W VXPPDULVDWLRQ DUH DLPHG WR EH SHUIRUPHG
DXWRPDWLFDOO\E\PDFKLQHV
1DPHGHQWLW\ UHFRJQLWLRQ 1(5 LVRQH RI WKH VXEWDVNVLQ
1/38VLQJSUHYLRXVO\H[LVWLQJRUSXEOLVKHGLQIRUPDWLRQ1(5
DLPV WR UHFRJQLVH ZRUGV VXFK DV SHUVRQ LQVWLWXWLRQ
HVWDEOLVKPHQWSODFHQDPHVWLPHH[SUHVVLRQVDQGFXUUHQFLHVLQ
DZULWWHQ WH[W ,Q WKLVVWXG\ ZH FUHDWH DQHZ PRGHO IRU 1(5
:HWUDLQWKLVPRGHO LQ RXU SUHGH¿QHG GDWDVHW ZKLFK FRQWDLQV
VHQWHQFHVDQGFRPSDUHWKHUHVXOWVZLWKRWKHUPRGHOV
7KLVSDSHULVRUJDQLVHGDVIROORZV:HGH¿QH1(5SUREOHP
LQ 6HFWLRQ ,, DQG JLYH WKH SUHYLRXV ZRUN LQ 6HFWLRQ ,,, ,Q
FRQWLQXRXV PRGHOV ZH UHSUHVHQW ZRUGV ZLWK FRQWLQXRXV
YHFWRUVQDPHO\ZRUGHPEHGGLQJV$EULHILQWURGXFWLRQWRZRUG
HPEHGGLQJVLVJLYHQ LQ 6HFWLRQ ,9 7KH GHWDLOV RIRXUGDWDVHW
DQGKRZLWLVFRQVWUXFWHGDUH JLYHQLQ6HFWLRQ9 :HJLYHRXU
H[SHULPHQW PHWKRGRORJ\ LQ 6HFWLRQ 9, DQG UHVXOWV LQ 6HFWLRQ
9,, /DVWO\ZHFRQFOXGHLQ6HFWLRQ9,,,
II. 1$0('(17,7< 5(&2*1,7,21
$Q\WKLQJ WKDW LV GHQRWHG E\ D SURSHU QDPH L H IRU
LQVWDQFHDSHUVRQDORFDWLRQRUDQRUJDQL]DWLRQLVFRQVLGHUHG
WR EH D QDPHG HQWLW\ ,Q DGGLWLRQQDPHG HQWLWLHV DOVR LQFOXGH
WKLQJV OLNH GDWHV WLPHV RU PRQH\ +HUH LV D VDPSOH WH[W
ZLWKQDPHGHQWLWLHVPDUNHG
[ORG Türk Hava Yolları] bu [TIME Pazartesi’den]
itibaren [LOC ˙
Istanbul] [LOC Ankara] güzergahı için indirimli
satı¸slarını [MONEY 90 TL’den] ba¸slataca ˘
gını açıkladı.
7KLVVHQWHQFHFRQWDLQVQDPHGHQWLWLHVLQFOXGLQJZRUGV
ODEHOHGDV25*$1,=$7,21ZRUGVODEHOHGDV
/2&$7,21ZRUGODEHOHGDV7,0(DQGZRUGODEHOHGDV
021(<7DEOH,VKRZVW\SLFDOJHQHULFQDPHGHQWLW\W\SHV
978-1-5386-0930-9/17/$31.00 ©2017 IEEE
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Tab le I. L IST OF NAMED ENTITY TYPES WITH THE KINDS OF ENTITIES
THEY BELONG TO
Tag Sample Categories Example
PER people, characters Atatürk yurdu dü¸smanlardan kurtardı.
ORG companies, teams IMKB günü 60 puan yükselerek kapattı.
LOC regions, mountains, seas Ülkemizin ba¸skenti Ankara’dır.
TIME time expressions Cuma günü tatil yapaca˘
gım.
MONEY monetarial expressions Geçen gün 3000 TL kazandık.
,Q QDPHG HQWLW\ UHFRJQLWLRQ RQH WULHV WR ¿QG WKH VWULQJV
ZLWKLQDWH[WWKDWFRUUHVSRQGWRSURSHUQDPHVH[FOXGLQJ7,0(
DQG021(<DQG FODVVLI\WKHW\SHRIHQWLW\ GHQRWHGE\WKHVH
VWULQJV7KHSUREOHP LVGLI¿FXOWSDUWO\GXH WRWKHDPELJXLW\LQ
VHQWHQFH VHJPHQWDWLRQ RQH QHHGV WR H[WUDFW ZKLFK ZRUGV
EHORQJ WR D QDPHG HQWLW\ DQG ZKLFK QRW $QRWKHU GLI¿FXOW\
RFFXUV ZKHQ VRPH ZRUG PD\ EH XVHG DV D QDPH RI HLWKHU D
SHUVRQDQ RUJDQL]DWLRQRUDORFDWLRQ)RUH[DPSOH'HQL]PD\
EHXVHGDVWKH QDPH RI D SHUVRQ RU ZLWKLQDFRPSRXQG LW
FDQUHIHUWR D ORFDWLRQ0DUPDUD'HQL]L³0DUPDUD6HD´RU DQ
RUJDQL]DWLRQ'HQL]7DÜVÕPDFÕOÕN³'HQL]7UDQVSRUWDWLRQ´
7KH VWDQGDUG DSSURDFK IRU 1(5 LV D ZRUGE\ZRUG FODV
VL¿FDWLRQZKHUH WKH FODVVL¿HULV WUDLQHG WR ODEHOWKH ZRUGV LQ
WKHWH[W ZLWKWDJVWKDWLQGLFDWHWKHSUHVHQFHRISDUWLFXODUNLQGV
RI QDPHG HQWLWLHV $IWHU JLYLQJ WKH FODVV ODEHOV QDPHG HQWLW\
WDJVWRRXUWUDLQLQJGDWDWKHQH[WVWHSLVWRVHOHFWDJURXSRI
IHDWXUHVWRGLVFULPLQDWHGLIIHUHQWQDPHGHQWLWLHV IRUHDFKLQSXW
ZRUG 7DEOH ,, VKRZV WKH VDPSOH WH[W UHSUHVHQWHG ZLWK WDJ
ODEHOVDQGWKUHH SRVVLEOHIHDWXUHVQDPHO\WKH URRWIRUPRIWKH
ZRUGWKHSDUWRIVSHHFK326WDJRIWKHZRUGDQGDERROHDQ
IHDWXUHIRUFKHFNLQJWKHFDSLWDOFDVH
Table II. NAMED ENTITY TAGGING AS CLASSIFICATION PROBLEM
Wor d Features Label
Root Pos Capital ...
Türk Türk Noun True ... ORGANIZATION
Hava Hava Noun True ... ORGANIZATION
Yolları Yol Noun True ... ORGANIZATION
bu bu Pronoun False ... NONE
Pazartesi’den Pazartesi Noun True ... TIME
itibaren itibaren Adverb False ... NONE
˙
Istanbul ˙
Istanbul Noun True ... LOCATION
Ankara Ankara Noun True ... LOCATION
güzergahı güzergah Noun False ... NONE
için için Adverb False ... NONE
indirimli indirimli Adjective False ... NONE
satı¸slarını sat Noun False ... NONE
90 90 Number False ... MONEY
TL’den TL Noun True ... MONEY
ba¸slataca˘
gını ba¸slat Noun False ... NONE
açıkladı açıkla Ver b False ... NONE
. . Punctuation False ... NONE
*LYHQVXFKWUDLQLQJGDWDDFODVVL¿HUOLNHQHXUDOQHWZRUNRU
GHFLVLRQ WUHH FDQ EH WUDLQHG WR ODEHO QHZ VHQWHQFHV )LJXUH
VKRZVWKHRSHUDWLRQ RI VXFKDFODVVL¿HUDWWKHSRLQW ZKHUH WKH
ZRUG $QNDUD LV QH[W WR EH ODEHOHG )RU WKLV FODVVL¿HU WKH
ZLQGRZ VL]H LV WKDW LV ZH DVVXPH D FRQWH[W ZLQGRZ WKDW
LQFOXGHVWZRSUHFHGLQJZRUGVDQGWZRVXFFHHGLQJZRUGV
III. 35(9,286:25.
$/LQJXLVWLF%DFNJURXQG
:KLOH1(5LVDUDWKHUXQSUREOHPDWLFWDVNDPRQJ1/3
VWXGLHVLQZHOOVWXGLHGODQJXDJHVOLNH(QJOLVKLWIDFHVFHUWDLQ
FKDOOHQJHVZKHQGHDOLQJZLWKDODQJXDJHOLNH7XUNLVK2QHRI
...
itibaren
Adverb
F
a
l
se
itibaren
˙
Istanbul
Noun
Tr
ue
˙
Istanbul
Ankara
Noun
Tr
ue
Ankara
güzergah
Noun
F
a
l
se
güzergahı
için
Adverb
F
a
l
se
için ...
CLASSIFIER
... NONE LOC ?...
)LJXUH &ODVVL¿HU EDVHG DSSURDFK WR QDPHG HQWLW\ UHFRJQLWLRQ 7KH WDJJHU
VOLGHVDFRQWH[WZLQGRZRYHUWKHVHQWHQFHFODVVLI\LQJZRUGVDVLWSURFHHGV$W
WKLVSRLQWWKHFODVVL¿HULVDWWHPSWLQJWR ODEHO $QNDUD )HDWXUHV GHULYHG IURP
WKHFRQWH[WW\SLFDOO\LQFOXGHWKHZRUGVSDUWRIVSHHFKWDJVHWF
WKHFHQWUDOIRFLRIOLWHUDWXUHRQ7XUNLVKOLQJXLVWLFVKDVEHHQWKH
FRPSOH[LW\RI 7XUNLVK PRUSKRORJ\ 7XUNLVK LVDWH[WERRN H[
DPSOHIRUDQDJJOXWLQDWLYHODQJXDJHLHZRUGVLQWKHLUVXUIDFH
IRUPPD\FRQWDLQYDULRXV PRUSKHPHVHVSHFLDOO\VXI¿[HV7KH
PDLQSUREOHPSRVHGE\VXFKODQJXDJHVLVOH[LFDOVSDUVLW\>@
ZKLFKLQJHQHUDO FDQ EH FRQVLGHUHGDVDFKDOOHQJHIRU HDUOLHU
VWHSVVXFKDVWKH 0RUSKRORJLFDO $QDO\VLVDQG'LVDPELJXDWLRQ
WDVNVZKLFKIHHG1(5
1RWRQO\WKHPRUSKRORJ\EXWDOVRWKHV\QWDFWLFVWUXFWXUHRI
7XUNLVK LV D IHUWLOH VRXUFH IRU FKDOOHQJLQJ LVVXHV 7XUNLVK LV
W\SLFDOO\WUHDWHGDVDVXEMHFWREMHFWYHUE629ODQJXDJH\HWLW
KDVDUDWKHUIUHHZRUGRUGHU,QRWKHUZRUGVWKHFRQVWLWXHQWVRI
DVHQWHQFHFDQRFFXULQDQ\RUGHUZLWK VOLJKW PRGL¿FDWLRQV LQ
WKHVHQWHQWLDO PHDQLQJ$SDUWLFXODURUGHULVFKRVHQPDLQO\ RQ
SUDJPDWLFJURXQGV>@7KHUHIRUHWKHSRVLWLRQRIDZRUGZLWKLQ
D VHQWHQFH GRHV QRW SURYLGH DQ\ FOXHV DERXW ZKHWKHU LW LV D
1DPHG(QWLW\RUQRW
$VDQDGGLWLRQDOSUREOHPVSHFL¿FWR1(5WKHUHH[LVWPDQ\
SURSHU QRXQV LQ 7XUNLVK ZKLFK DUH GHULYHG IURP FRPPRQ
QDPHV WKURXJK VXI¿[DWLRQ VXFK DV WKH IROORZLQJ QDPHV RI
FLWLHV LQ 7XUNH\ 'HQL]OL GHULYHG IURP GHQL] ³VHD´ FRP
SRXQGLQJVXFKDV%DOÕNHVLUIURPEDOÕNHVLU³¿VKSULVRQHU´RU
]HURGHULYDWLRQ VXFK DV 2UGX IURP RUGX ³DUP\´ ,Q VSHHFK
SURVRGLFFXHVDUHKHOSIXO LQGLVWLQJXLVKLQJSURSHUQRXQVHVSH
FLDOO\SODFHQDPHVIURPFRPPRQ QRXQV GXH WR WKHLU LGLRV\Q
FUDWLFVWUHVVSDWWHUQ>@<HWWKHUHLVQRRUWKRJUDSKLFFRUUHODWHV
WR VWUHVV ,QVWHDG LQ IRUPDO WH[WV DW OHDVW RUWKRJUDSKLF FOXHV
FDQ EH KHOSIXO LQ GLVWLQJXLVKLQJ D SURSHU QRXQ ZKLFK VWDUWV
ZLWKDFDSLWDO OHWWHU IURP D FRPPRQ QRXQ ,QVHQWHQFHLQLWLDO
SRVLWLRQ KRZHYHU DOO ZRUGV EHJLQ ZLWK D FDSLWDO OHWWHU DQG
KHQFHWKLVFOXHLVQRWDYDLODEOH
,Q VKRUW VHYHUDO OLQJXLVWLF IHDWXUHV RI 7XUNLVK VXFK DV
LWV ULFK PRUSKRORJ\ IUHH ZRUGRUGHU DV ZHOO DV GHULYDWLRQ
DV D ZRUGIRUPDWLRQ SURFHVV IUHTXHQWO\ HPSOR\HG LQ IRUPLQJ
SURSHUQDPHV\LHOGSUREOHPVIRU1(5WDVNV
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
B. Computational Background
6WXG\ RQ 1(5 KDV D ELJ DWWHQWLRQ LQ WKH OLWHUDWXUH
+RZHYHU PRVW RI WKH PRGHOV DUH VSHFL¿F WR WKH ODQJXDJH RI
IRFXV $ ODQJXDJH LQGHSHQGHQW PHWKRG LV SURSRVHG LQ >@
ZKLFKLV DERRWVWUDSSLQJDOJRULWKPEDVHGRQLWHUDWLYHOHDUQLQJ
7KHPHWKRGUHOLHV RQ ZRUGLQWHUYDODQGFRQWH[WXDOFOXHV7KLV
ZRUNLVWHVWHGRQ7XUNLVK5RPDQLDQ(QJOLVK*UHHNDQG+LQGL
DQGFDQEHFRQVLGHUHGDVWKH¿UVWZRUNRQ7XUNLVK1(5
7XUNLVKVSHFL¿F VWXGLHV DUH UHODWLYHO\ VFDUFH ZKHQ FRP
SDUHGWRODQJXDJHVWKDW KDYH D ZLGHU JOREDO GLVWULEXWLRQ )LUVW
ZRUN RQ 7XUNLVK 1(5 >@ SUHVHQWV D VWXG\ EDVHG RQ
LQIRUPDWLRQ H[WUDFWLRQ >@ IRFXVHV RQ FRQGLWLRQDO UDQGRP
¿HOGVXVLQJPRUSKRORJLFDODQGOH[LFDOIHDWXUHV 7KHLU EDVHOLQH
LV LQÀXHQFHG E\ D OHDGLQJ ZRUN KRZHYHU WKH GDWDVHW XVHG LV
JDWKHUHGIURPUHDOQDWXUDOODQJXDJHGDWD7DNLQJ7XUNLVK1(5
VWXGLHVLQWR DFFRXQWPRGHOVEDVHGRQ +LGGHQ0DUNRY0RGHOV
+00V >@ &RQGLWLRQDO 5DQGRP )LHOGV &5)V >@ DQG
UXOH EDVHG >@ VWXGLHV DUH SUHVHQWHG RQ GDWD JDWKHUHG IURP
QHZV UHSRUWV )RUPDO WH[WV OLNH QHZV DUH ZULWWHQ RQ FHUWDLQ
UXOHVRIODQJXDJH$XWKRUVRIVXFKWH[WVIROORZWKHJUDPPDWLFDO
DQGRUWKRJUDSKLFDOUXOHVRIWKHODQJXDJHLQTXHVWLRQ DQG WKXV
JHQHUDWHVWDWLVWLFDOO\OHVVYRODWLOHGDWD7ZLWWHULVZLGHO\XVHGLQ
1/3 VWXGLHV DV ZHOO \HW LQ WZHHWV WKHUH LV QR QHHG WR IROORZ
VSHOOLQJ UXOHV DQG ZRUGV DQG HYHQ OHWWHUV LQ WKH FDVH RI
HPRWLFRQV DQG VRPH LQIRUPDO DEEUHYLDWLRQV FDQ EH XVHG LQ
GLIIHUHQW VHQVHV WKDQ XVXDO $Q H[SHULPHQWEDVHG VWXG\ RQ
WZHHWV >@ VKRZV WKH GLIIHUHQFH EHWZHHQ SURFHVVLQJ D IRUPDO
V\QWD[RYHUDVRFLDOPHGLDWH[W
IV. :25'(0%('',1*6
7UDGLWLRQDOUHSUHVHQWDWLRQV RI ZRUGV LHRQHKRW YHFWRUV
DUHEDVHG RQ ZRUGZRUG :î : FRRFFXUUHQFH VSDUVH PD
WULFHVZKHUH:LVWKHQXPEHURIGLVWLQFWZRUGVLQ WKH FRUSXV
2QWKHRWKHUKDQGGLVWULEXWHGZRUGUHSUHVHQWDWLRQV'5VLH
ZRUG HPEHGGLQJV DUH ZRUGFRQWH[W : î & GHQVH PDWULFHV
ZKHUH&:DQG&LVWKHQXPEHURIFRQWH[WGLPHQVLRQVZKLFK
DUH GHWHUPLQHG E\ XQGHUO\LQJ PRGHO DVVXPSWLRQV 'HQVH
UHSUHVHQWDWLRQV DUH DUJXDEO\ EHWWHU DW FDSWXULQJ JHQHUDOL]HG
LQIRUPDWLRQ DQG PRUH UHVLVWDQW WR RYHU¿WWLQJ GXH WR FRQWH[W
YHFWRUV UHSUHVHQWLQJ VKDUHG SURSHUWLHV RI ZRUGV '5V DUH UHDO
YDOXHG YHFWRUV ZKHUH HDFK FRQWH[W FDQ EH FRQVLGHUHG DV D
FRQWLQXRXVIHDWXUHRIDZRUG'XHWRWKHLUDELOLW\WRUHSUHVHQW
DEVWUDFW IHDWXUHV RI D ZRUG '5V DUH FRQVLGHUHG DV UHXVDEOH
DFURVVKLJKHUOHYHOWDVNVLQHDVH HYHQ LI WKH\ DUH WUDLQHGZLWK
WRWDOO\GLIIHUHQWGDWDVHWV
3UHGLFWLRQ EDVHG '5 PRGHOV JDLQHG PXFK DWWHQWLRQ DIWHU
0LNRORY HW DO¶V QHXUDO QHWZRUN EDVHG 6NLS*UDP PRGHO
LQ >@ 7KH VHFUHW EHKLQG WKH SUHGLFWLRQ EDVHG PRGHOV
LV VLPSOH QHYHU EXLOG D VSDUVH PDWUL[ DW DOO 3UHGLFWLRQ
EDVHG PRGHOV FRQVWUXFW GHQVH PDWUL[ UHSUHVHQWDWLRQV GLUHFWO\
LQVWHDG RI UHGXFLQJ VSDUVH RQHV WR GHQVH RQHV 7KHVH
PRGHOV DUH WUDLQHG OLNH DQ\ RWKHU VXSHUYLVHG OHDUQLQJ WDVN
E\ JLYLQJ ORWV RI SRVLWLYH DQG QHJDWLYH VDPSOHV ZLWKRXW
DGGLQJ DQ\ KXPDQVXSHUYLVLRQ FRVWV $LP RI WKHVH PRGHOV
LV WR PD[LPL]H WKHSUREDELOLW\ RI HDFK FRQWH[W F ZLWK WKH
VDPH GLVWULEXWLRQDO DVVXPSWLRQV RQ ZRUGFRQWH[W FR
RFFXUUHQFHVVLPLODU WR FRXQWEDVHG PRGHOV >@ >@
V. DATA
2XU GDWDVHW LV FROOHFWHG IURP 3HQQ7UHHEDQN FRUSXV DQG
HDFK VHQWHQFH RI WKLV GDWDVHW LV WUDQVODWHG LQWR 7XUNLVK >@
7KLV GDWDVHW LQFOXGHV VHQWHQFHV DQG ZRUGV LQ
FOXGLQJ SXQFWXDWLRQ PDUNV 2XU GDWD IRUPDW LV VKRZQ LQ
)LJXUH
)LJXUH'DWDFRQWHQWDVWUDLQ¿OH
$0RUSKRORJLFDO'LVDPELJXDWLRQ
7XUNLVK LV DQ DJJOXWLQDWLYH ODQJXDJH LQ ZKLFK ZRUGV DUH
IRUPHGE\DWWDFKLQJGHULYDWLRQDODQGLQÀHFWLRQDOVXI¿[HVWRWKH
URRWV 0RUSKHPHV DGGHG WR D ZRUG FDQ FKDQJH LWV SDUW RI
VSHHFK LH IRU LQVWDQFH FRQYHUW D QRXQ WR D YHUE RU YLFH
YHUVDRUFDQFUHDWHDGYHUEVIURPDGMHFWLYHV0RUHRYHUGXULQJ
ZRUG IRUPDWLRQ VRPH OHWWHUV FDQ EH FKDQJHG RU XQGHUJR
GHOHWLRQ +HQFH ZLWKRXW GHWHUPLQLQJ WKH OHPPD RI D ZRUG
IURPLWVVXUIDFH IRUP EDVHG RQLWVLQWHQGHGPHDQLQJLWLV QRW
SRVVLEOH WR LGHQWLI\ WKH ZRUG FRUUHFWO\ DQG H[WUDFW FDQGLGDWH
VHQVHVIURPDGLFWLRQDU\
)LJXUH 0RUSKRORJLFDO GLVDPELJXDWLRQ WRRO
)ROORZLQJWKHWUDQVODWLRQWKHFRUSXVKDVEHHQPRUSKRORJ
LFDOO\GLVDPELJXDWHG,QWKDWZRUN KXPDQDQQRWDWRUVVHOHFWHG
WKHFRUUHFW PRUSKRORJLFDOSDUVLQJ IURPPXOWLSOHSRVVLEOHDQDO
\VHV UHWXUQHG IURP WKH DXWRPDWLF SDUVHU 6HH )LJXUH IRU
WKH PRUSKRORJLFDO GLVDPELJXDWLRQWRRO XVHG 7KH WDJ VHW DQG
PRUSKRORJLFDOUHSUHVHQWDWLRQZDVDGRSWHGIURPWKHVWXG\>@
(DFK RXWSXW RI WKH SDUVHU FRPSULVHV WKH URRW RI WKH ZRUG LWV
SDUWRIVSHHFKWDJDQGDVHWRIPRUSKHPHVHDFKVHSDUDWHGZLWK
D¶¶ VLJQ
%1HU7DJJLQJ
,Q WKH VHFRQG VWDJH ZH KDG UDQGRPO\ VHOHFWHG
VHQWHQFHV DQG DQQRWDWHG WKHP PDQXDOO\ E\ XVLQJ WKH ³1(5
$QQRWDWLRQ 7RRO´ RI ,ÜVÕN 8QLYHUVLW\ 7KH WRRO LV VKRZQ LQ
)LJXUH
)LJXUH $QQRWDWLRQ WRRO IRU 1(5
1(5 WDJV WKDW DUH XVHG DV FODVV ODEHOV LQ WKLV VWXG\ DUH
3(5621 IRU SHUVRQ¶V QDPHV /2&$7,21 IRU SODFH QDPHV
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
25*$1,=$7,21 IRU JRYHUQPHQWDO RU FLYLO RUJDQL]DWLRQV
¿UPVDVVRFLDWLRQVHWF 7,0( IRU VSHFL¿F SRLQWV LQ WLPHVXFK
DV GDWHV \HDUV KRXUV HWF 021(< IRU ¿QDQFLDO DPRXQWV RU
FXUUHQFLHVDQG121(IRU HYHU\WKLQJHOVH%\WKHVHFDWHJRULHV
RIFODVV ODEHOV WKH GLVWULEXWLRQRI WKH GDWDLV VKRZQ EHORZ LQ
7DEOH ,,, $QQRWDWHG GDWDVHW DQG VRXUFHV FRGHV DUH IUHHO\
DYDLODEOH
7DEOH,,, ',675,%87,212) 7+( '$7$
Label Count
PERSON 606
LOCATION 235
ORGANIZATION 685
MONEY 387
TIME 299
NONE 10982
9, 0(7+2'2/2*<
7KHVHYHQNQRZQFODVVL¿FDWLRQDOJRULWKPVZHXVHDUH
'XPP\ 'HFLGHV EDVHG RQ WKH SULRUFODVV SUREDELOLW\
ZLWKRXW ORRNLQJ DW WKH LQSXW $OO WHVW LQVWDQFHV DUH
DVVLJQHGWRWKH FODVV ZLWK WKHPD[LPXPSULRU
&7KHDUFKHW\SDOGHFLVLRQ WUHH PHWKRG >@
.QQ.1HDUHVW1HLJKERUFODVVL¿FDWLRQDOJRULWKPWKDW
XVHVWKH(XFOLGHDQGLVWDQFH
/S/LQHDUSHUFHSWURQZLWKVRIWPD[RXWSXWVWUDLQHGE\
JUDGLHQWGHVFHQWWR PLQLPL]HFURVVHQWURS\>@
0OS:HOONQRZQPXOWLOD\HUSHUFHSWURQFODVVL¿FDWLRQ
DOJRULWKP>@
1E&ODVVLF1DLYH %D\HVFODVVL¿HUZKHUHHDFKIHDWXUH
LV DVVXPHG WR EH *DXVVLDQ GLVWULEXWHG >@ DQG HDFK
IHDWXUHLVLQGHSHQGHQWIURPRWKHUIHDWXUHV
5I 5DQGRP )RUHVW PHWKRG LPSURYHV EDJJLQJ LGHD
ZLWK UDQGRPL]LQJ IHDWXUHV DW HDFK GHFLVLRQ QRGH>@
DQGFDOOHGWKHVH UDQGRP GHFLVLRQWUHHVDVZHDNOHDUQ
HUV ,Q WKH SUHGLFWLRQ WLPH WKHVH ZHDN OHDUQHUV DUH
FRPELQHGXVLQJFRPPLWWHHEDVHGSURFHGXUHV
$'LVFUHWH0RGHO
)HDWXUHV :HXVHGWKHIROORZLQJIHDWXUHVLQ RXU'LVFUHWH
0RGHOV
&DVH$WWULEXWH & 7KLV LV D GLVFUHWH DWWULEXWH IRU D
JLYHQZRUG ,I WKH ODVWLQÀHFWLRQDO JURXS RI WKHZRUG
FRQWDLQVFDVH LQIRUPDWLRQ WKH DWWULEXWHZLOO KDYH WKDW
FDVHYDOXH2WKHUZLVHWKHDWWULEXWHZLOOKDYH WKHYDOXH
QXOO
,V&DSLWDO$WWULEXWH ,& 7KLV LV D ELQDU\ DWWULEXWH ,W
FKHFNVHDFKZRUGLQDVHQWHQFH,IWKH¿UVWFKDUDFWHULV
XSSHUFDVH WKH DWWULEXWH UHWXUQV WUXH RWKHUZLVH LW
UHWXUQVIDOVH
,V'DWH$WWULEXWH ,' 7KLV LV D ELQDU\ DWWULEXWH WKDW
FKHFNV ZKHWKHU D ZRUG LV ZULWWHQ LQ GDWH IRUPDW ,I
WKH ZRUG LV LQ GDWH IRUPDW LW UHWXUQV WUXH RWKHUZLVH
LWUHWXUQVIDOVH
KWWSKD\GXWLVLNXQHGXWUQOSWRRONLWKWPO
,V)UDFWLRQ$WWULEXWH,)7KLV LV D ELQDU\DWWULEXWHWKDW
FKHFNV ZKHWKHU D ZRUG LV D IUDFWLRQDO QXPEHU ,I WKH
ZRUGLVDIUDFWLRQDOQXPEHULWUHWXUQVWUXHRWKHUZLVH
LWUHWXUQVIDOVH
,V+RQRUL¿F$WWULEXWH,+7KLV LV DELQDU\DWWULEXWH,W
FKHFNVHDFKZRUG LQ DVHQWHQFH,IWKHZRUG HTXDOVWR
RQHRIWKHWLWOHV³ED\´³ED\DQ´³VD\ÕQ´³GU´³SURI´
RU³GRo´LWUHWXUQVWUXHRWKHUZLVHLWUHWXUQVIDOVH
,V0RQH\$WWULEXWH ,0 7KLV LV D ELQDU\ DWWULEXWH ,W
FKHFNV ZKHWKHU D ZRUG LQ D VHQWHQFH GHQRWHV D
FXUUHQF\,IWKHZRUGLV³GRODUÕ´³OLUDVÕ´RU³DYUR´WKH
DWWULEXWHUHWXUQVWUXHRWKHUZLVHLWUHWXUQVIDOVH
,V1XP$WWULEXWH ,1 7KLV DWWULEXWH FKHFNV ZKHWKHU
WKHZRUGKDV QXP QXPEHU WDJ
,V2UJDQL]DWLRQ$WWULEXWH ,2 7KLV LV D ELQDU\ DW
WULEXWH ,W FKHFNV D JLYHQ ZRUG DQG LI LW HTXDOV WR
³FRUS´ ³LQF´ RU ³FR´ LW UHWXUQV WUXH RWKHUZLVH LW
UHWXUQVIDOVH
,V3URS$WWULEXWH ,3 7KLV DWWULEXWH FKHFNV ZKHWKHU
WKHZRUGKDV SURS SURSHU QDPH WDJ
,V5HDO$WWULEXWH ,5 7KLV LV D ELQDU\ DWWULEXWH WKDW
FKHFNVZKHWKHUD ZRUG LV D UHDO QXPEHU RUQRW,IWKH
ZRUG LV D UHDO QXPEHU LW UHWXUQV WUXH RWKHUZLVH LW
UHWXUQVIDOVH
,V7LPH$WWULEXWH ,7 7KLV LV D ELQDU\ DWWULEXWH WKDW
FKHFNV ZKHWKHU D ZRUG LV ZULWWHQ LQ WLPH IRUPDW ,I
WKH ZRUG LV LQ WLPH IRUPDW LW UHWXUQVWUXH RWKHUZLVH
LWUHWXUQVIDOVH
0DLQ3RV$WWULEXWH037KLVLV D GLVFUHWH DWWULEXWH
,WUHWXUQVWKHPDLQSDUWRI VSHHFK RI WKH ZRUG
/DVW,*&RQWDLQV3RVVHVVLYH$WWULEXWH 3 7KLV LV D EL
QDU\DWWULEXWH,IWKHODVWLQÀHFWLRQDOJURXSRIWKHZRUG
FRQWDLQV SRVVHVVLYH LQIRUPDWLRQ WKH DWWULEXWH ZLOO
UHWXUQWUXHRWKHUZLVHLWZLOOUHWXUQIDOVH
5RRW)RUP$WWULEXWH 5) 7KLV LV D GLVFUHWH DWWULEXWH
,WUHWXUQVWKHURRW IRUP RI D JLYHQ ZRUG
5RRW3RV$WWULEXWH 53 7KLV LV D GLVFUHWH DWWULEXWH ,W
UHWXUQVWKHSDUWRIVSHHFKRIWKHURRWIRUPRI D JLYHQ
ZRUG
6XUIDFH)RUP$WWULEXWH6)7KLVLVDGLVFUHWHDWWULEXWH
,WUHWXUQVWKHVXUIDFH IRUP RI WKH ZRUG
0HWKRGV )LUVW RI DOO ZH DQQRWDWHG VHQWHQFHV WR
JHQHUDWHRXUGDWDVHW7KHDQQRWDWLRQZDVGRQHPDQXDOO\DQGLW
ZDV D YHU\ VLJQL¿FDQW SRLQW IRU WKH H[SHULPHQW EHFDXVH ZH
QHHGHGDQHIIHFWLYHGDWDVHWWRJHWUHOLDEOHUHVXOWV
:HJLYHWKHGHWDLOVRIWKHPHWKRGVVXFKDVWKHDWWULEXWHV
WKHFODVVL¿HUDQGLWVSDUDPHWHUVLQ7DEOH,9
%&RQWLQXRXV0RGHO
:H XVHG 0LNRORY HW DO¶V 6NLS*UDP DQG &%2: PRGHOV
IRUOHDUQLQJ FRQWLQXRXV UHSUHVHQWDWLRQ RI ZRUGV :H XVHG D
QHZV FRUSXV FRQVLVWLQJ RI PLOOLRQ 0 XQDQQRWDWHG
VHQWHQFHV IRUXQVXSHUYLVHG WUDLQLQJ FROOHFWHG IURP QHZV DQG
DUWLFOHV IURP
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Table IV. D ETAILS OF METHODS USED IN DISCRETE MODEL
Method Attributes Window Classifier Parameters Author
METHOD1IC,ID,IR,IT,MP 4Rf attribute subset size = 4, ensemble size = 20 B.Ö.
METHOD2IC,ID,IF,IT,MP,RF,SF 1Mlp learning rate = 0.1, number of hidden nodes = 30 B.E.
METHOD3IH,IM,MP,RF,SF 0Mlp learning rate = 0.1, number of hidden nodes = 50 A.T.G.
METHOD4C,IC,IH,IO,IR,MP,P,RF,RP,SF 2Lp learning rate = 0.1 A.B.K.
METHOD5IC,IN,IP,MP,RF,SF 1Mlp learning rate = 0.1, number of hidden nodes = 5, 8 O.T.
METHOD6IC,IP,MP,RF,SF 1Mlp learning rate = 0.1, number of hidden nodes = 10 O.A.
Tab le V. E RROR RATES OF METHODS USING STRATIFIED 10-FOLD CROSS-VAL I D AT I O N
Classifier DUMMY METHOD1 METHOD2 METHOD3 METHOD4 METHOD5 METHOD6
Error Rate 14.89 14.71 7.64 12.19 7.65 8.45 9.28
QHZVSDSHUV :H WUDLQHG PRGHOV ZLWK WKH VDPH VHWWLQJV LQ
WKUHHVHSDUDWH VL]HV . . 0 '5 PRGHOV JHQHUDOO\
KDYH WHQV RI K\SHUSDUDPHWHUV:H OLVWHG WKH PRVW LPSRUWDQW
K\SHUSDUDPHWHUV RQO\ DORQJ ZLWK GHIDXOW YDOXHV ZH XVHG
LQ RXUH[SHULPHQWVEHORZ
&RQWH[W ZLQGRZ ZLQ 6L]H RI D FRQWH[W ZLQGRZ
ZKHUH WKH SUHGLFWLRQ PRGHO JDWKHUV FRRFFXUUHQFH
LQIRUPDWLRQ IURP 7KH UHDGHU VKRXOG QRWH WKDW HYHQ
WKRXJKFRQFHSWV DUHVLPLODUWKLVSDUDPHWHU VKRXOGQRW
EHFRQIXVHG ZLWKWKHFRQWH[WZLQGRZVL]HZHXVHG RQ
H[HFXWLQJFODVVL¿HUV:HXVHGDVWKHGHIDXOWVHWWLQJ
'LPHQVLRQ G 6L]H RI D & FRQWH[W GLPHQVLRQV RI
YHFWRUV IRU ZRUGFRQWH[WGHQVH PDWULFHV : î& :H
XVHGDV WKH GHIDXOW VHWWLQJ
'HOHWLQJ LQIUHTXHQW ZRUGV GHO 0RVW '5 PRG
HOV LJQRUH GHOHWH LQIUHTXHQW ZRUGV LQ WKH FRUSXV
ZLWK VRPH WKUHVKROGLQJ PHFKDQLVP DVVXPLQJ WKDW
LQIUHTXHQW ZRUGV DUH LQIRUPDWLYH :H GLG QRW LJQRUH
LQIUHTXHQW ZRUGV GXH WR WKH QDWXUH RI WKH 1(5 WDVN
UHTXLULQJ PRUH VDPSOHV RI LQIUHTXHQW LQVWDQFHV VXFK
DV2KLR,%0
8QOLNHWKH '5WH[WSUHSURFHVVLQJFRQYHQWLRQVZHGLGQRW
ORZHUFDVHWRNHQVDQGGLGQRW¿OWHURXWDQ\SXQFWXDWLRQVHJ
FRPPDVGRWVK\SKHQVGXHWR1(5WDVNUHTXLULQJLQFOXVLRQRI
SURSHU ZRUGV DQG SXQFWXDWLRQV 7KHUHIRUH GLVWULEXWLRQ RI
SXQFWXDWLRQVDUHDOVROHDUQWDORQJZLWKZRUGV)RUH[DPSOHDV
H[SHFWHG RXU 0 VHQWHQFHG PRGHO UHWXUQV ³´ DV VSDWLDOO\
FORVHVW WRNHQ WR ³´ $QRWKHU LQWHUHVWLQJ H[DPSOH LV WKDW RXU
PRGHO UHWXUQV WKH WRNHQV ³RUDQÕQGD´ ³JHULOHPLÜV´ ³\]GH´
³´ ³RUDQ´ ZRUGV UHODWHG WR SHUFHQWDJH LQ 7XUNLVK DV WKH
PRVWVSDWLDOO\FORVHVWRQHVWRWKHWRNHQ³´
:H WUDLQHG RXU '5 PRGHOVLQ WZR NLQGV RI ZRUG IRUPV
6XUIDFH )RUP 68 1DWXUDO IRUP RI D ZRUG ZKLFK
DSSHDUHG LQ D WH[W DV LW LV ([ *]HOJ|]O
WXUQDODUJ|oWOHU
5RRW)RUP52 5RRWRIDZRUGXVHG LQ'5WUDLQLQJ
EDVHGRQPRUSKRORJLFDOGLVDPELJXDWLRQ RI HYHU\ VHQ
WHQFHSURYLGHGE\WKH³0RUSKRORJLFDO'LVDPELJXDWLRQ
/LEUDU\´RI,ÜVÕN8QLYHUVLW\ >@ ([ *]HOJ|]WXUQD
J|o
:H XVHG RXU WUDLQHG '5 RXWSXWV : î & WR IHHG RXU
FODVVL¿HUV DV FRQWLQXRXV IHDWXUHV RI ZRUGV ZLWKRXW GRLQJ DQ\
DGGLWLRQDO WDVNVSHFL¿F IHDWXUH KDQGFUDIWLQJ 7KHUHIRUH ZH
IHHGHYHU\ZRUGZLWK&GFRQWLQXRXVIHDWXUHVE\XVLQJ&
dimensional word vectors of our generated representations. If
generated DRs don’t have a word vector for a word instance
requested by the classifier model, we provided zero-vector
in C dimensions to represent a null feature value. DRs are
real valued vectors normalized between -1 and 1. We also
reported missing word requests with each experiment as out-
of-vocabulary (OOV) rate due to the fact that classifier perfor-
mances are highly dependent on OOV rates in experiments.
9,, (;3(5,0(176
$,QWHUDQQRWDWRU$JUHHPHQW
)RU HYDOXDWLRQ RI WKH DQQRWDWHG GDWDVHW ZH XVHG LQWHU
DQQRWDWRUDJUHHPHQW PHDVXUH 7ZR GLIIHUHQW JURXS RI DQQR
WDWRUV DQQRWDWHG VDPH VHQWHQFHV 'XH WR D ODFN RI WLPH ZH
FRXOGRQO\UHDQQRWDWHRIWKH WRWDO RI VHQWHQFHV DQG
JRWRILQWHUDQQRWDWRUDJUHHPHQW$V DFRPSDULVRQWKH
H[SHFWHG LQWHUDQQRWDWRU DJUHHPHQW DVVXPLQJ WKH DQQRWDWRUV
DQQRWDWHGFRPSOHWHO\UDQGRPO\LV
%'LVFUHWH0RGHO
:HHYDOXDWHGWKHSURSRVHGDSSURDFK XVLQJ D VWUDWL¿HG
IROGFURVVYDOLGDWLRQRQWKHGDWD7KHDFFXUDFLHVREWDLQHGIURP
HDFKFODVVL¿HUDUHVKRZQLQ7DEOH9
&&RQWLQXRXV0RGHO
:H ¿[HG 6NLS*UDP 6* PRGHO FRQ¿JXUDWLRQ DV GHIDXOW
ZLWK K\SHUSDUDPHWHUV ZLQ G GHO WR XVH DV D
EDVHOLQHIRUDOO RWKHU FRQ¿JXUDWLRQV 6*ZLQGGHO:H
FKRVHWKHGHIDXOWFRQ¿JXUDWLRQEDVHGRQ RXU REVHUYDWLRQV WKDW
LW SHUIRUPV EHWWHU RQ DYHUDJH FRPSDUHG WR DOO RWKHU YDULRXV
FRQ¿JXUDWLRQ RSWLRQV ZH SUHH[SHULPHQWHG :H WUDLQHG
PRGHOVLQWRWDOIRUGLIIHUHQWFRUSXVVL]HV..0
DQG GLIIHUHQW ZRUG IRUPV 6XUIDFH 5RRW XVLQJ GHIDXOW
FRQ¿JXUDWLRQ :H UDQ RXU PRGHOV ZLWK 'XPP\ /S ZLWK
OHDUQLQJUDWH 0OS ZLWK OHDUQLQJ UDWH KLGGHQQRGHV
.QQ ZLWKN &ZLWKSUXQLQJDQG1EFODVVL¿HUVZLWK
IROGFURVVYDOLGDWLRQRQGDWD:HUDQDOOFODVVL¿HUVZLWKWKH
ZLQGRZ VL]H RI 7DEOH 9, VKRZV FODVVL¿HU HUURU UDWHV DQG
229SHUFHQWDJHVRIRXUH[SHULPHQWV
2XUFRQWLQXRXVPRGHOH[SHULPHQWVVKRZWKDW
8QVXSHUYLVHG'5 PRGHOV FDQSURYLGH YDOXDEOH LQIRU
PDWLRQ DV IHDWXUHV RI ZRUGV IRU 1(5 FODVVL¿FDWLRQ
WDVNV DQG FDQ SHUIRUP DV JRRG DV VXSHUYLVHG PDQ
XDOO\ KDQGFUDIWHG GLVFUHWH IHDWXUHV 2XU '5 PRGHOV
RXWSHUIRUPHG GLVFUHWH WDVNV RQ 0OS /S .QQ ZLWK D
VPDOOPDUJLQHUURUUDWHV
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Tab le VI . ERROR RATES OF DIFFERENT METHODS USING 10-FOLD
STRATIFIED CROSS-VALIDATIONOF CONTINUOUS MODEL EXPERIMENTS
OOV% Dummy Lp Mlp Nb Knn C45
SURFACE (SU)
100K 32.4 9.1 8.13 8.25 35.79 9.4 9.13
500K 23.05 10.75 8.06 7.81 35.18 10.24 10.61
1M 20.31 11.18 8.05 7.94 31.57 10.23 9.96
ROOT (RO)
100K 20.77 9.79 6.87 6.74 15.22 9.3 7.85
500K 17.49 11.1 6.96 6.62 20.43 14.8 9.55
1M 16.33 11.56 6.7 6.56 9.52 16.82 9.87
OTHER CONF.
1M-RO-d300 16.33 11.56 6.96 6.65 10.04 21.67 9.54
1M-RO-d20 16.33 11.56 7.47 6.76 11.49 7.92 9.33
1M-RO-d10 16.33 11.56 8.3 7.24 13.13 7.4 10.55
MIN 16.33 9.1 6.7 6.56 9.52 7.4 7.85
•Word embeddings with basic morphological root form
enrichment can provide better results than simple
surface form word embeddings (1.2pt reduction from 8
to 6.8 on average in Lp and Mlp). This improvement
is linearly related with the reduction of OOV rates
provided by the root form representation ability of
models.
•Nb works significantly better with root forms com-
pared to surface forms (reduced 31.57 to 9.52 error
rate).
•Increasing corpus size did not increase the NER
classification performances; it was even observed to
decrease OOV rates.
•SkipGram performed better than CBOW almost on all
experiments.
•Increasing vector dimensions (d= 300) for classifica-
tion tasks did not increase NER performance. On the
contrary, vectors in low dimensions (10, 20) performed
significantly better with Knn.
VIII. CONCLUSION
Named entity recognition is the problem of detecting the
type of named entities in sentences, such as person, organi-
sation, time or money. The problem in NER lies in deciding
whether a noun denotes a person, an organisation or a location
and whether a number denotes time or money. As a new
approach to NER, we produced six different discrete models
with various features and tested these models on the tagged
NER data created by us. The data consist of 1400 Turkish
sentences which are manually tagged by 7 different annotators.
We also used Mikolov et al.’s SkipGram and CBOW
models for learning continuous representation of words. Then
we used our trained continuous representation of words to
feed our classifiers as continuous features of words without
doing any feature enhancement. Our results show that, these
continuous models for NER classification tasks can perform
as good as supervised, manually handcrafted discrete features.
As in many natural language processing subfields, words
have been used as the atomic unit of language in distributional
semantics (DS) modeling field for the sake of model simplicity.
Even though word-based models yield good results for lan-
guages with limited vocabulary such as English, these models
are pretty ineffective for morphologically rich languages with
unlimited vocabularies such as Turkish.
ACKNOWLEDGEMENTS
This work was supported by I¸sık University BAP projects
14B206 and 15B201. All authors contributed equally to this
work. O. A., ˙
I. Ç., B. E., A. T. G., A. B. K., B. Ö., O. T.
designed and implemented discrete model experiments. They
also labeled the data and wrote the manuscript. G. E. designed
and implemented continuous model experiments. B. A. wrote
the manuscript. O. T. Y. supervised the project, gave conceptual
advice and wrote the manuscript.
REFERENCES
[1] E. Alpaydın, Introduction to Machine Learning,3rded. TheMIT
Press, 2010.
[2] M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! a sys-
tematic comparison of context-counting vs. context-predicting semantic
vectors,” in ACLC, 2014, pp. 238–247.
[3] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford
University Press, 1995.
[4] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32,
2001.
[5] G. Celikkaya, D. Torunoglu, and G. Eryigit, “Named entity recognition
on real data: a preliminary investigation for Turkish,” in 7th Interna-
tional Conference on Application of Information and Communication
Tec hno log ie s , 2013.
[6] S. Cucerzan and D. Yarowsky, “Language independent named entity
recognition combining morphological and contextual evidence,” in
Proceedings of the Joint SIGDAT Conference on EMNLP and VLC,
1999.
[7] E.T.Erguvanlı,The function of word order in Turkish Grammar.
Oxford: Oxford University Press, 1984.
[8] O. Gorgun and O. T. Yildiz, “A novel approach to morphological
disambiguation for Turkish,” in International Conference on Computer
and Information Science, 2011, pp. 77–83.
[9] D. Küçük and R. Steinberger, “Experiments to improve named entity
recognition on Turkish tweets,” arXiv:1410.8668, 2014.
[10] O. Levy, Y. Goldberg, and I. Dagan, “Improving distributional similarity
with lessons learned from word embeddings,” Transactions of the
Association for Computational Linguistics, vol. 3, p. 211–225, 2015.
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[12] K. Oflazer, B. Say, D. Hakkani-Tür, and G. Tür, “Building a Turkish
treebank,” in Building and exploiting syntactically-annotated corpora.
Dordrecht: Kluwer, 2003.
[13] E. Okur, Named Entity Recognition for Turkish Microblog Texts Using
Semi-Supervised Learning with Word Embeddings, 2011.
[14] J. R. Quinlan, C4.5: Programs for Machine Learning.SanMeteo,CA:
Morgan Kaufmann, 1993.
[15] G. A. Seker and G. Eryigit, “Initial explorations on using crfs for
Turkish named entity recognition,” in COLING, 2012.
[16] E. Sezer, “On non-final stress in Turkish,” Journal of Turkish Studies,
vol. 5, pp. 61–69, 1981.
[17] S. Tatar and I. Cicekli, “Automatic rule learning exploiting morpho-
logical features for named entity recognition in Turkish,” Journal o f
Information Science, vol. 37, pp. 137–151, 2011.
[18] G. Tür, D. Hakkani-Tür, and K. Oflazer, “A statistical information
extraction system for Turkish,” Natural Language Engineering,vol.9,
pp. 181–210, 2003.
[19] O. T. Yildiz, E. Solak, O. Gorgun, and R. Ehsani, “Constructing a
Turkish-English parallel treebank,” in Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics. Baltimore,
Maryland: Association for Computational Linguistics, June 2014, pp.
112–117.
8%0.¶QG,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ