Erdinç Uzun

Erdinç Uzun
Namık Kemal Üniversitesi · Computer Engineering

PhD

About

32
Publications
18,318
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
211
Citations
Introduction
Erdinç Uzun currently works at the Department of Computer Engineering, University of Tekirdağ Namık Kemal. His research interests are on more efficient and effective algorithms and approaches for web data extraction (web scraping).

Publications

Publications (32)
Chapter
A new data world which never get deformed, can be reached from anywhere, continuously stream and multiply, emerged with the evolution of technology. The data, in particular, created by business firms, scientific research centers, and automation systems reached great amounts. It has become the main target of many data analysts to reach meaningful, u...
Article
Full-text available
Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a web page....
Article
Full-text available
Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a singl...
Conference Paper
Full-text available
Yapay Zeka (Artificial Intelligence) teknolojileri günümüzde giderek artan çok boyutlu karmaşık verileri anlamada ve anlamlı bilgiyi elde etmede etkili çözümler sağlayabilmektedir. Özellikle verinin öznitelik sayısı problemin karmaşıklığı ve boyutu arttıkça anlamlı ilişkilerin keşfi, verinin anlaşılırlığı ve yorumlanması giderek güçleşir. Bu zorluğ...
Article
Full-text available
Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree...
Conference Paper
Full-text available
Özet Bu çalışmada, Türkçe dili için metin sınıflandırma problemlerinden tür tanıma problemine çözüm olabilecek modelleme çalışmalarında kullanılmak üzere, büyük boyutlu 7 veri seti oluşturulmuştur, ve bu veri setleri, bir gazetenin yazarlarına ait 24.04.2019 tarihine kadarki tüm geçmiş köşe yazılarından oluşmaktadır. Yazarların, köşe yazılarının, m...
Article
Full-text available
Yazılım performansını etkileyen en önemli faktörlerden biri veritabanı tasarımında yapılabilecek iyileştirmelerdir. Veritabanı tasarımında sıklıkla ilişkisel veritabanı teorisi olan normalizasyon işlemi kullanılır. Fakat veri miktarı arttıkça normalizasyon işleminden kaynaklı performans sorunları ortaya çıkmaya başlar. Performans sorunlarını ortada...
Article
Full-text available
An entity relationship diagram (ERD) is a visual helper for database design. ERD gives information about the relations of entity sets and the logical structure of databases. In this paper, we introduce an open source JavaScript Library named EntRel.JS in order to design sophisticated ERDs by writing simple codes. This library, which we have develop...
Article
Full-text available
There are several libraries for extracting useful data from web pages in Python. In this study, we compare three different well-known extraction libraries including BeautifulSoup, lxml and regex. The experimental results indicate that regex achieves the best results with an average of 0.071 ms. However, it is difficult to generate correct extractio...
Conference Paper
Full-text available
With the DOM, programming languages can access and change all the HTML elements of a web page. There are several libraries for instantiating the DOM. In this study, we compare three different well-known .NET libraries, including HAP (Html Agility Pack), AngleSharp and MS_HtmlDocument to extract content from web pages. The experimental results indic...
Conference Paper
Full-text available
String matching algorithms try to find position/s where one or more patterns (also called strings) are occurred in text. In this study, we compare 31 different pattern matching algorithms in web documents. In web documents, searching is crucial process for content extraction process. Therefore, lengths of html tags are examined for determining whic...
Article
Full-text available
Assignments are one of the most important parts of education process of students. In the classical assignment evaluation process, an assignment can be evaluated whether it is correct or not. However, for the assignments to give better contribution to education, plagiarisms committed by students should be considered. Detection of plagiarism and its...
Article
Extracting the user reviews in websites such as forums, blogs, newspapers, commerce, trips, etc. is crucial for text processing applications (e.g. sentiment analysis, trend detection/monitoring and recommendation systems) which are needed to deal with structured data. Traditional algorithms have three processes consisting of Document Object Model (...
Article
Full-text available
Elektrik ve güneş enerjisi ile çalışan taşıtlar hareket enerjilerini elektrik motorları aracılığı ile almaktadır. Bu motorlara enerji ise bataryalar tarafından sağlanmaktadır. Elektrikli araçların batarya durumu, sıcaklığı ve anlık hızı gibi veriler kalan yol miktarı gibi bilgilerin hesaplanması için oldukça önemlidir. Bu çalışmada bu verilerin ölç...
Article
SUMMARY Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process....
Conference Paper
In this study, the impact of term weighting on author detection as a type of text classification is investigated. The feature vector being used to represent texts, consists of stem words as features and their weight values, which are obtained by applying 14 different term weighting schemes. The performances of these feature vectors for 3 different...
Conference Paper
Via information extraction techniques, web pages are able to generate datasets for various studies such as natural language processing, and data mining. However, nowadays the uninformative sections like advertisement, menus, and links are in increase. The cleaning of web pages from uninformative sections, and extraction of informative content has b...
Conference Paper
The dimensions of the feature vectors being used at the classification methods in the literature affect directly the time performance. In this study, how to reduce the dimension of the feature vector by using Turkish's grammar rules without compromising success rates is explained. The feature vector is weighted on the basis of the word frequency as...
Article
This study proposes a fuzzy ranking approach, designed for Turkish as an agglutinative language, that focuses on improving stemming techniques via using distances of characters in its search algorithm. Various studies focused on search engines are based on using stemming techniques in indexing process because of the higher percentage of relevancy t...
Conference Paper
Full-text available
In many web content extraction applications, parsing is crucial is-sue for obtaining the necessary information from web pages. A DOM parser is preferably used in this task. However, major problems with using a DOM par-ser are time and memory consumption. This parser is an inefficient solution for applications which need web content extraction betwe...
Conference Paper
Full-text available
Extracting the relevant contents on web pages is an important issue for researches on information retrieval, data mining and natural language processing. In this issue, contents of tags in same domain web pages can be used to discover unnecessary contents. However, little changes in tag contents of web pages can cause problems in extraction. Theref...
Conference Paper
Full-text available
To improve algorithms that are used in search engines, crawlers and indexers, the evolution of web pages should be examined. For this purpose, we developed a domain based crawler, namely SET Crawler, which collects the web archives between 1998 and 2008 of three Turkish daily popular newspapers (Hurriyet, Milliyet and Sabah). After completion of th...
Conference Paper
Full-text available
İnternet sayfaları,kullanıcının görmek istediği asıl içerik dışında birçok gereksiz bilgiiçermektedir. Bu çalışmada, kural tabanlı bir sistem kullanarakgereksiz içerikleriçıkarabilen birakıllıtarayıcı anlatılmıştır. Bu tarayıcı, 6 web sitesi için 5defa eğitilmiştir. Birinci eğitimden sonra gereksiz içeriğin %79,28 temizlenirken beşinci eğitimden so...
Article
Full-text available
Özet: WWW (World Wide Web) kavraminin ortaya çikişi ile birlikte bilginin ve görselliğin saklandiği HTML (HyperText Markup Language) işaretleme dili İnternetin temelini oluşturmuştur. HTML'in veriyi göstermedeki yetersizlikleri nedeniyle XML (Extensible Markup Language) işaretleme dili İnternet dünyasindaki yerini almaya başlamiştir. XML ile birlik...
Thesis
Bu tez, Türkçe için alt öğeleme listelerinin otomatik olarak elde edilmesi görevini gerçekleştirmek için planlanan web-tabanlı bir sistemi sunar. Zamir düşmesi, seyrek gösterimli bir dil ve serbest sıralaması özellikleri olan Türkçe doğal dil işleme görevleri için ilginç ve zorlukları olan bir uygulama alanı sağlar. Tez; bilgi erişimi, doğal dil iş...
Conference Paper
Full-text available
The World Wide Web can be used as a source of machine-readable text for corpora. Search engines, programs that search documents for specified keywords and return a list of the documents, are the main tools by which such texts can be collected. However, the usefulness of results returned by search engines is limited at least by the sheer amount of n...
Conference Paper
Full-text available
This paper presents the results of two experiments conducted by applying the purely statistical methods log likelihood and t-score to Turkish in order to acquire subcategorization frames for this language. The results achieved are compared with some results reported for other languages. The comparison is evaluated in terms of language typology.

Network

Cited By