Conference Paper

SHADOW: A framework for Systematic Heuristic Analysis and Detection of Observations on the Web

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

The cyberspace contains vast amounts of information that are crucial for cybersecurity professionals to gather threat intelligence, prevent cyberattacks, and secure organizational networks. Unlike earlier and less targeted attacks, modern cyber-attacks are more organized and sophisticated, often targeting specific groups, which leaves many users unaware of the vulnerable resources within the cyberspace. The increasing freedom on information access in the deep and dark web has led many organizations to identify their data loose on these spaces. Therefore, creating methods to crawl and extract valuable information from the deep web is a critical concern. Some deep web content can be accessed through the surface web by submitting query forms to retrieve the needed information, but it is not as simple in all cases. This paper proposes a system of framework to identify these leaks and notify relevant parties on the same in-time.
Content may be subject to copyright.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The time window between the disclosure of a new cyber vulnerability and its use by cybercriminals has been getting smaller and smaller over time. Recent episodes, such as Log4j vulnerability, exemplifies this well. Within hours after the exploit being released, attackers started scanning the internet looking for vulnerable hosts to deploy threats like cryptocurrency miners and ransomware on vulnerable systems. Thus, it becomes imperative for the cybersecurity defense strategy to detect threats and their capabilities as early as possible to maximize success of prevention actions. Although crucial, discovering new threats is a challenging activity for security analysts due to the immense volume of data and information sources to be analyzed for signs that a threat is emerging. In this sense, we present a framework for automatic identification and profiling of emerging threats using Twitter messages as a source of events and MITRE ATT&CK as a source of knowledge for threat characterization. The framework comprises three main parts: identification of cyber threats and their names; profiling the identified threat in terms of its intentions or goals by employing two machine learning layers to filter and classify tweets; and alarm generation based on the threat’s risk. The main contribution of our work is the approach to characterize or profile the identified threats in terms of its intentions or goals, providing additional context on the threat and avenues for mitigation. In our experiments the profiling stage reached a F1 score of 77% in correctly profiling discovered threats.
Article
Full-text available
The goal of the Business Intelligence data extractor (BID- Extractor) tool is to offer high-quality, usable data that is freely available to the public. To assist companies across all industries in achieving their objectives, we prefer to use cuttingedge, business-focused web scraping solutions. The World wide web contains all kinds of information of different origins; some of those are social, financial, security, and academic. Most people access information through the internet for educational purposes. Information on the web is available in different formats and through different access interfaces. Therefore, indexing or semantic processing of the data through websites could be cumbersome. Web Scraping/Data extracting is the technique that aims to address this issue. Web scraping is used to transform unstructured data on the web into structured data that can be stored and analyzed in a central local database or spreadsheet. There are various web scraping techniques including Traditional copy-and-paste, Text capturing and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Vertical aggregation platforms, Semantic annotation recognition, and Computer vision webpage analyzers. Traditional copy and paste is the basic and tiresome web scraping technique where people need to scrap lots of datasets. Web scraping software is the easiest scraping technique since all the other techniques except traditional copy and pastes require some form of technical expertise. Even though there are many webs scraping software available today, most of them are designed to serve one specific purpose. Businesses cannot decide using the data. This research focused on building web scraping software using Python and NLP. Convert the unstructured data to structured data using NLP. We can also train the NLP NER model. The study's findings provide a way to effectively gauge business impact
Article
Full-text available
Information extraction from e-commerce platform is a challenging task. Due to significant increase in number of ecommerce marketplaces, it is difficult to gain good accuracy by using existing data mining techniques to systematically extract key information. The first step toward recognizing e-commerce entities is to design an application that detects the entities from unstructured text, known as the Named Entity Recognition (NER) application. The previous NER solutions are specific for recognizing entities such as people, locations, and organizations in raw text, but they are limited in e-commerce domain. We proposed a Bi-directional LSTM with CNN model for detecting e-commerce entities. The proposed model represents rich and complex knowledge about entities and groups of entities about products sold on the dark web. Different experiments were conducted to compare state-of-the-art baselines. Our proposed approach achieves the best performance accuracy on the Dark Web dataset and Conll-2003. Results show good accuracy of 96.20% and 92.90% for the Dark Web dataset and the Conll-2003 dataset, which show good performance compared to other cutting-edge approaches.
Conference Paper
Full-text available
Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks and is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG). There is a strong need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to efficiently and accurately extract meaningful insights from CTI. We have created an initial unstructured CTI corpus from a variety of open sources that we are using to train and test cybersecurity entity models using the spaCy framework and exploring self-learning methods to automatically recognize cybersecurity entities. We also describe methods to apply cybersecurity domain entity linking with existing world knowledge from Wikidata. Our future work will survey and test spaCy NLP tools, and create methods for continuous integration of new information extracted from text.
Article
Full-text available
The dark web is a section of the Internet that is not accessible to search engines and requires an anonymizing browser called Tor. Its hidden network and anonymity pave the way for illegal activities and help cybercriminals to execute well-planned, coordinated, and malicious cyberattacks. Cyber security experts agree that online criminal activities are increasing exponentially, and they are also becoming more rampant and intensified. These illegal cyber activities include various destructive crimes that may target a single person or a whole nation, for example, data breaches, ransomware attacks, black markets, mafias, and terrorist attacks. So, maintaining data privacy and secrecy is the new dilemma of the era. This paper has extensively reviewed various attacks and attack patterns commonly applied in the dark web. We have also classified these attacks in our unique trilogies classification system. Furthermore, a detailed overview of existing threat detection techniques and their limitations is discussed for anonymity providing services like Tor, I2P, and Freenet. Finally, the paper has identified significant weaknesses that make the dark web vulnerable to different attacks.
Article
Full-text available
Network texts have become important carriers of cybersecurity information on the Internet. These texts include the latest security events such as vulnerability exploitations, attack discoveries, advanced persistent threats, and so on. Extracting cybersecurity entities from these unstructured texts is a critical and fundamental task in many cybersecurity applications. However, most Named Entity Recognition (NER) models are suitable only for general fields, and there has been little research focusing on cybersecurity entity extraction in the security domain. To this end, in this paper, we propose a novel cybersecurity entity identification model based on Bidirectional Long Short-Term Memory with Conditional Random Fields (Bi-LSTM with CRF) to extract security-related concepts and entities from unstructured text. This model, which we have named XBiLSTM-CRF, consists of a word-embedding layer, a bidirectional LSTM layer, and a CRF layer, and concatenates X input with bidirectional LSTM output. Via extensive experiments on an open-source dataset containing an office security bulletin, security blogs, and the Common Vulnerabilities and Exposures list, we demonstrate that XBiLSTM-CRF achieves better cybersecurity entity extraction than state-of-the-art models.
Article
Full-text available
Web scraping, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. In this paper, among others kind of scraping, we focus on those techniques that extract the content of a Web page. In particular, we adopt scraping techniques in the Web e-commerce field. To this end, we propose a solution aimed at analyzing data extraction to exploiting Web scraping using python and scrapy framework .
Article
Full-text available
Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpose Crawling, Focused crawling and Distributed Crawling. In this paper, the applicability of Web Crawler in the field of web search and a review on Web Crawler to different problem domains in web search is discussed.
Article
This paper describes algorithms which rerank the top N hypotheses from a maximum-entropy tagger, the application being the recovery of named-entity boundaries in a corpus of web data. The first approach uses a boosting algorithm for ranking problems. The second approach uses the voted perceptron algorithm.
Analyzing The Bittorrent Ecosystem Of Central Asia
  • , S Dias Azhigulov
  • Augustine Kairatova
  • Madina Ukaegbu
  • Myrzalieva
Dias Azhigulov, S. Kairatova, Ikechi Augustine Ukaegbu, and Madina Myrzalieva, "Analyzing The Bittorrent Ecosystem Of Central Asia," Aug. 2018, doi: https://doi.org/10.1109/coconet.2018.8476894.
Feasibility of Proof of Authority as a Consensus Protocol Model
  • S Joshi
S. Joshi, "Feasibility of Proof of Authority as a Consensus Protocol Model," arXiv.org, Aug. 30, 2021. https://arxiv.org/abs/2109.02480 (accessed Nov. 10, 2023).
Recognizing and Extracting Cybersecurtity-relevant Entities from Text
  • C Hanks
  • M Maiden
  • P Ranade
  • T Finin
  • A Joshi
C. Hanks, M. Maiden, P. Ranade, T. Finin, and A. Joshi, "Recognizing and Extracting Cybersecurtity-relevant Entities from Text," Aug. 2022, doi: https://doi.org/10.48550/arxiv.2208.01693.