Conference Paper

Predicting Impending Exposure to Malicious Content from User Behavior

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Many computer-security defenses are reactive---they operate only when security incidents take place, or immediately thereafter. Recent efforts have attempted to predict security incidents before they occur, to enable defenders to proactively protect their devices and networks. These efforts have primarily focused on long-term predictions. We propose a system that enables proactive defenses at the level of a single browsing session. By observing user behavior, it can predict whether they will be exposed to malicious content on the web seconds before the moment of exposure, thus opening a window of opportunity for proactive defenses. We evaluate our system using three months' worth of HTTP traffic generated by 20,645 users of a large cellular provider in 2017 and show that it can be helpful, even when only very low false positive rates are acceptable, and despite the difficulty of making "on-the-fly'' predictions. We also engage directly with the users through surveys asking them demographic and security-related questions, to evaluate the utility of self-reported data for predicting exposure to malicious content. We find that self-reported data can help forecast exposure risk over long periods of time. However, even on the long-term, self-reported data is not as crucial as behavioral measurements to accurately predict exposure.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Among such factors, it is known that specific users profiles are more prone to the risk of encountering malware [5], [6], [3], [7], [8], [9]. Although recent works underlined that risk is profile-dependent and that a "one-size-fit-all" solution is inappropriate [10], [8], profiling has been only applied to the potential victims, ignoring that the same 'one size' may not fit threats too. ...
... Other research focuses on risk prediction on mobile devices [8], [9]. Dambra et al. [8] focuses on the risk of encountering malware and PUAs across different profiles of users considering volume, diversity, and geographic features on Android applications. ...
... Dambra et al. [8] focuses on the risk of encountering malware and PUAs across different profiles of users considering volume, diversity, and geographic features on Android applications. Sharif et al. [9] developed a system to predict if user behaviors may lead to exposure to malicious content ahead of time based only on HTTP traffic. ...
Article
Full-text available
The behavior of enterprise users (e.g. browsing at night or visiting gambling sites) is a potential factor that might increase the chances of malware encounters (e.g. coinminers vs ransomware) on the field. We report a case-control study on telemetry data collected by Trend Micro, a global cybersecurity vendor, to identify users’ behavioral characteristics that can be used to differentiate cybersecurity risks profiles. Our results show that different types of ‘patients zero’ are vulnerable to different types of epidemics. The odds ratio of encountering malware such as PUAs, trojans, and hacktools is higher for a variety of network and system behavior (e.g. number, types, and diversity of visited web sites, visit of gambling sites, etc.) but it is not significant for other factors such as browsing at night. Other type of malware such as coinminers have an increase in the odds ratio only for few type of factors (e.g. gambling web sites). We also present a specific methodology tailored for investigating self-propagating malware such as ransomware in which one is infected by one’s neighbor. With this approach, we observed a more accurate characterization of the odds of encountering ransomware based on system-based behaviors than with a standard case-control study setup. Experiments with different vendors may be needed to generalize the results and offset potential bias due to differences in market share.
... VT aggregates scanning results (i.e., a detection result and the category of malicious entity such as phishing and malware-hosting) from up to 95 various detection scanners and provides the security research community (academia and industry) with the aggregated results. These results are heavily utilized to label malicious entities and predict new security threats [10,11,24,33,44,46,49]. ...
... VT as Ground Truth. VT has been used to build a ground truth in various domains including malware files [21,23,55] and IP/URLs [24,28,33,44,49] detection. In doing so, the most common approach is an unweighted threshold-based method employing a heuristically chosen number of scanners by which the entity is marked as malicious. ...
... In doing so, the most common approach is an unweighted threshold-based method employing a heuristically chosen number of scanners by which the entity is marked as malicious. While there is no consensus on such a number [54,56], surprisingly, small thresholds such as 1 or 2 have been widely used in the literature [28,44,49,54,56]. Only a few papers set aggressive thresholds [21,55]. ...
Preprint
VirusTotal (VT) is a widely used scanning service for researchers and practitioners to label malicious entities and predict new security threats. Unfortunately, it is little known to the end-users how VT URL scanners decide on the maliciousness of entities and the attack types they are involved in (e.g., phishing or malware-hosting websites). In this paper, we conduct a systematic comparative study on VT URL scanners' behavior for different attack types of malicious URLs, in terms of 1) detection specialties, 2) stability, 3) correlations between scanners, and 4) lead/lag behaviors. Our findings highlight that the VT scanners commonly disagree with each other on their detection and attack type classification, leading to challenges in ascertaining the maliciousness of a URL and taking prompt mitigation actions according to different attack types. This motivates us to present a new highly accurate classifier that helps correctly identify the attack types of malicious URLs at the early stage. This in turn assists practitioners in performing better threat aggregation and choosing proper mitigation actions for different attack types.
... Indeed, some studies have found a link between victimization and online activities. Online conduct (such as targeted information searching on the Internet and emailing) and exposure (i.e., being online longer, e.g., untargeted surfing, watching videos and using social media) appear to be positively related to victimization (Bergmann et al. 2018;Guerra and Ingram 2020;Holt et al. 2020;Reyns, Randa, and Henson 2016;Sharif et al. 2018;van Wilsem 2013b;Williams 2016). For example, the fact that young people are more likely to become victims may simply be partly explained by the fact that they are online more often than older people (Büchi, Just, and Latzer 2016;Ngo et al. 2020). ...
... All these variables were significantly related to the number of malware (attempts) on the devices. Sharif et al. (2018) analyzed logged online behavior of over 20,000 participants for three months and compared participants who were in that timeframe exposed to malware or phishing URLs to participant who had not. Exposed participants were online more and browsed the internet more frequently at night. ...
... Similarly, most routine activities do not seem to contribute to cybercrime victimization either. While this is contrary to some previous findings (Bergmann et al. 2018;Guerra and Ingram 2020;Holt et al. 2020;Reyns, Randa, and Henson 2016;Sharif et al. 2018;van Wilsem 2013b;Williams 2016) it is in line with other previous studies (Bossler and Holt 2009;Holt and Bossler 2013;Jansen et al. 2013;Leukfeldt 2014;Mesch and Dodel 2018;Paulissen and van Wilsem 2015). Moreover, in line with previous studies (Leukfeldt and Yar 2016;van Wilsem 2013b), routine activities that do seem significant, appear only to be related to a certain type of cybercrime victimization. ...
Article
With the increasing prevalence of cybercrime victimization there is a growing need for prevention. Previous studies have attempted to uncover risk factors associated with cybercrime victimization in the areas of personal characteristics and online routine activities. This article aims to take the field a step further by including actual self-protective online behavior, obtained through a population-based survey experiment (N = 1886), as a risk factor for cybercrime victimization. In wave 1 of our longitudinal design, personal characteristics, online routine activities, and actual self-protective online behavior concerning password strength, clicking behavior, sharing personal information, and handling phishing emails were measured. In wave 2, cybercrime victimization of several types of cyber-enabled and cyber-dependent cybercrimes was measured one year later. Results indicate that few personal characteristics, online routine activities, and self-protective online behaviors are related to the odds of becoming a cybercrime victim. This furthermore illustrates the heterogeneity of cybercrime victimization, since most significant factors only seem to be related to the risk of one particular type of cybercrime. These results indicate that to explain cybercrime victimization, the research field needs to shift its focus and adapt to new online developments.
... Few studies have focused on the characteristics of browsing patterns from mobile users. In [5], the HTTP logs of mobile devices were analyzed and used to provide a statistical analysis of the malicious websites visited by users, and a predictive model was developed to determine the probability of a user being exposed to malicious content within a given period. However, web-browsing activities from mobile devices are interwoven with other activities, e.g., receiving an SMS message or installing an application, and such activities generate logs that can be analyzed to improve the security of mobile users. ...
... User behaviors, including their knowledge about mobile security, have also been investigated [5], [8], [9], [25]- [29]. Some of these studies included participants who consented to install a sensor for monitoring purposes. ...
... In addition, surveys can suffer from insufficient sample sizes or low response rates [5], [26]. A seminal work on user behavior prediction was proposed by Canali et al. [9], where the web histories of 100,000 users obtained from data provided by a major antivirus software were analyzed. ...
Article
Full-text available
Using mobile devices to browse the Internet has become increasingly popular over the years. However, the risk of being exposed to malicious content, such as online scams or malware installations, has also increased significantly. In this study, we collected smartphone data from volunteer users by monitoring their use of the Web and the applications they install on their devices. However, the collected data is sometimes incomplete due to the technical limitations of mobile devices. Thus, we propose a data repair scheme to restore incomplete data by inferring missing attributes. Here, the restored data represent the browsing history of a mobile user, which can be used to determine if and how the user has been the victim of web or mobile-specific attacks to compromise their sensitive data. The accuracy of the proposed data repair scheme was evaluated using a machine learning algorithm, and the results demonstrate that the proposed scheme properly reconstructed a user’s browsing history data with an accuracy of 95%. The usability of the repaired data is demonstrated by a practical use case. The user’s browsing history was correlated with other types of data, such as received SMSs and the applications installed by the user. The results demonstrate that a user can fall victim to SMS-based phishing (SMShing) attacks, where the attacker sends an SMS message to a user to trick them install a malicious application. We also present a case of a social engineering attack, where the victim was manipulated to provide their Amazon credentials and credit card details.
... However, there is no consensus among the community on the appropriate threshold. For example, prior work using VirusTotal has chosen thresholds of 1 [2], [3], [4], 2 [5], [6], and 5 [7]. Setting a too-high threshold would miss a lot of malicious cases, while a too-low threshold would introduce many false positives. ...
... Aggregating Security Intelligence. The most popular approach for aggregating the scan reports is the unweighted threshold strategy that marks the entity as malicious if the number of positive labels is more than a heuristically chosen threshold [6], [2], [4], [41], [42], [43]. The thresholds are arbitrarily chosen and could vary from 1 [2], [3], [4], 2 [5], [6], and 5 [7]. ...
... The most popular approach for aggregating the scan reports is the unweighted threshold strategy that marks the entity as malicious if the number of positive labels is more than a heuristically chosen threshold [6], [2], [4], [41], [42], [43]. The thresholds are arbitrarily chosen and could vary from 1 [2], [3], [4], 2 [5], [6], and 5 [7]. However, this approach is limited as it ignores different qualities of sources, including coverage and accuracy [19], [44]. ...
Conference Paper
High-quality intelligence of Internet threat (e.g., malware files, malicious domains, phishing URLs and malicious IPs) are important for both security practitioners and the research community. Given the agility of attackers, the scale of the Internet, and the fast-evolving landscape of threats, one could not rely solely on a single source (such as an anti-malware engine or an IP blacklist) for obtaining accurate, up-to-date, and comprehensive threat analysis. Instead, we need to aggregate the analysis from multiple sources. However, it is non-trivial to do such aggregation effectively. A common practice is to label an indicator (malware, domains, URLs, etc.) as malicious if it is marked by a number of sources above an ad-hoc certain threshold. Often, this results in sub-optimal performance as it assumes that all sources are of similar quality/expertise, independent, and temporally stable, which unfortunately are often not true in practice. A natural alternative is to train a supervised machine learning model. However, this approach needs a sufficiently large amount of manually labeled ground truth, which is time-consuming to collect and has to be updated frequently, resulting in substantial recurring costs. In this paper, we propose SIRAJ, a novel framework for aggregating the detection output of various intelligence sources such as anti-malware engines. SIRAJ is based on the pre-train and fine-tune paradigm. Specifically, we use self-supervised learning-based approaches to learn a pre-trained embedding model that converts multi-source inputs into a high-dimensional embedding. The embeddings are learned through three carefully designed pretext tasks that imbue them with knowledge about dependencies between scanners and their temporal dynamics. The learned embeddings could be used for diverse downstream machine learning tasks. SIRAJ is designed to be general and can be used for diverse domains such as URLs, malware, and IPs. Further, SIRAJ works well even when there is limited to no labeled data available. Through extensive experiments, we show that our learned representations can produce results comparable to supervised methods while only requiring as little as 100 labeled samples. Importantly, the results show that SIRAJ accurately detects threat indicators much earlier than the baseline algorithms, a feat that is critical against short-lived indicators like Phishing URLs.
... VT aggregates scanning results (i.e., a detection result and the category of malicious entity such as phishing and malware-hosting) from up to 95 various detection scanners and provides the security research community (academia and industry) with the aggregated results. These results are heavily utilized to label malicious entities and predict new security threats [10,11,24,33,44,46,49]. ...
... VT as Ground Truth. VT has been used to build a ground truth in various domains including malware files [21,23,55] and IP/URLs [24,28,33,44,49] detection. In doing so, the most common approach is an unweighted threshold-based method employing a heuristically chosen number of scanners by which the entity is marked as malicious. ...
... In doing so, the most common approach is an unweighted threshold-based method employing a heuristically chosen number of scanners by which the entity is marked as malicious. While there is no consensus on such a number [54,56], surprisingly, small thresholds such as 1 or 2 have been widely used in the literature [28,44,49,54,56]. Only a few papers set aggressive thresholds [21,55]. ...
Preprint
VirusTotal (VT) provides aggregated threat intelligence on various entities including URLs, IP addresses, and binaries. It is widely used by researchers and practitioners to collect ground truth and evaluate the maliciousness of entities. In this work, we provide a comprehensive analysis of VT URL scanning reports containing the results of 95 scanners for 1.577 Billion URLs over two years. Individual VT scanners are known to be noisy in terms of their detection and attack type classification. To obtain high quality ground truth of URLs and actively take proper actions to mitigate different types of attacks, there are two challenges: (1) how to decide whether a given URL is malicious given noisy reports and (2) how to determine attack types (e.g., phishing or malware hosting) that the URL is involved in, given conflicting attack labels from different scanners. In this work, we provide a systematic comparative study on the behavior of VT scanners for different attack types of URLs. A common practice to decide the maliciousness is to use a cut-off threshold of scanners that report the URL as malicious. However, in this work, we show that using a fixed threshold is suboptimal, due to several reasons: (1) correlations between scanners; (2) lead/lag behavior; (3) the specialty of scanners; (4) the quality and reliability of scanners. A common practice to determine an attack type is to use majority voting. However, we show that majority voting could not accurately classify the attack type of a URL due to the bias from correlated scanners. Instead, we propose a machine learning-based approach to assign an attack type to URLs given the VT reports.
... During 2015, the FBI received over 280,000 internet cybercrimes complaints, which led to a total loss of over one billion dollars [1]. The statistical office of the European Union further suggested that over 30% of internet users from all over Europe became infected by malware during 2010 [2]. Similar trends were identified by other recent studies [3,4]. ...
... First, some researchers examined the differences between users regarding their browsing habits and the corresponding needs for personalized protection [1,[4][5][6][7][8][9][10][11]. Second, several systems addressed the benefit of having a proactive or predictive approach rather than a reactive one; taking action only a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 once the attack occurred could lower the effectiveness of blocking the malware [3,4,8,[12][13][14][15]. Third, some projects considered the need to maintain a low level of privacy invasion while supplying means of protection to the users [8]. ...
... For a security system to be proactive, it needs to predict the level of risk or forecast an upcoming malware attack. The subject of predicting the risk of infection has been addressed in several studies [3,4,8,[12][13][14][15]. Bilge et al. [13] achieved excellent results in predicting infections by analyzing the binaries installed on users' computers and identifying ones that are more likely to be attacked. ...
Article
Full-text available
The internet is flooded with malicious content that can come in various forms and lead to information theft and monetary losses. From the ISP to the browser itself, many security systems act to defend the user from such content. However, most systems have at least one of three major limitations: 1) they are not personalized and do not account for the differences between users, 2) their defense mechanism is reactive and unable to predict upcoming attacks, and 3) they extensively track and use the user’s activity, thereby invading her privacy in the process. We developed a methodological framework to predict future exposure to malicious content. Our framework accounts for three factors–the user’s previous exposure history, her co-similarity to other users based on their previous exposures in a conceptual network, and how the network evolves. Utilizing over 20,000 users’ browsing data, our approach succeeds in achieving accurate results on the infection-prone portion of the population, surpassing common methods, and doing so with as little as 1/1000 of the personal information it requires.
... This enables more effective detection and prevention of repetitive misbehavior that increases security vulnerabilities (Espinha Gasiba et al., 2021;Tan et al., 2020). More specifically, AI-based approaches can help identify atrisk groups based on misbehavior for targeted interventions (Ansari, 2022;Nguyen et al., 2024) or to predict phishing susceptibility (Al-Mashhour & Alhogail, 2023; Sharif et al., 2018;Zhang et al., 2022). ...
... Prediction of Susceptibility to Deception: AI predicts users' susceptibility to cyber threats based on behavioral factors (Al-Mashhour & Alhogail, 2023;Sharif et al., 2018;Zhang et al., 2022). ...
Conference Paper
Full-text available
The use of cybersecurity tools powered by artificial intelligence (AI) continues to gain traction in the financial services industry. On the one hand, they can strengthen an organization's technical cybersecurity posture. On the other hand, even if cybercriminals also leverage AI to exploit human weaknesses, there are early indications that AI can help equip the workforce against evolving threats. Based on a structured literature review (SLR) and a Delphi study, this article identifies the most promising end-user-focused use cases in which AI can assist financial institutions in combating cybersecurity threats and gearing their workforce up to thwart cyberattacks. For information security executives and researchers alike, this study provides a first set of general directions on which AI-powered and user-centric tools and solutions to focus on in the near future.
... Dit komt deels doordat deze studies meestal gebaseerd zijn op zelf-gerapporteerd online gedrag. Wat mensen beweren online te doen, komt echter niet altijd overeen met wat zij daadwerkelijk online doen (Parry et al. 2021;Wilcockson et al. 2018;Sharif et al. 2018;Van 't Hoff-de Goede et al., 2019). Onderzoeken die gebaseerd zijn op zelf-gerapporteerd gedrag vinden vaak dat groepen die gemiddeld aangeven zich veilig te gedragen online, minder vaak rapporteren slachtoffer te zijn geworden van cybercriminaliteit (Bergmann et al. 2018;Chen et al. 2017). ...
... Hierin werd onder andere gekeken naar door ICTsystemen vastgelegde (gelogde) online activiteiten, zoals de daadwerkelijke tijd die iemand online doorbrengt, het aantal bezochte websites en het type gedownloade bestanden. Resultaten suggereren dat frequente downloads en het bezoeken van verschillende websites en netwerken de kans op slachtofferschap van malware en phishing vergroten (Lévesque et al. 2018;Sharif et al. 2018;Ovelgönne et al. 2017). Een longitudinale studie waarin het daadwerkelijke gebruik van sterke wachtwoorden, downloaden van onveilige software en delen van persoonlijke informatie online werd gemeten, vond echter geen verband met het risico op slachtofferschap van cybercriminaliteit in het jaar erna (Van 't Hoffde Goede, 2023b). ...
... Essentially, it allows convenient querying of suspicious URLs across multiple vendors. While this offers simplicity, it is noted in prior works that vendor labels do not always agree, and that VT based defensive mechanisms are not sufficient [46][47][48][49]. As such, we carefully distill the labels within our dataset, and reinforce them with results from Google Safe Browsing (GSB) [50]. ...
... We cannot rely on original vendor labels as there is contention as to the correct threshold to apply. [26,36,38,[51][52][53][54] have set their thresholds for maliciousness to = 1, with = 2 being used in the previous works [11,46,55], and even = 5 in the prior work [56]. Further, there is an opportunity to compare existing labels against present day data. ...
Preprint
Full-text available
The daily deluge of alerts is a sombre reality for Security Operations Centre (SOC) personnel worldwide. They are at the forefront of an organisation's cybersecurity infrastructure, and face the unenviable task of prioritising threats amongst a flood of abstruse alerts triggered by their Security Information and Event Management (SIEM) systems. URLs found within malicious communications form the bulk of such alerts, and pinpointing pertinent patterns within them allows teams to rapidly deescalate potential or extant threats. This need for vigilance has been traditionally filled with machine-learning based log analysis tools and anomaly detection concepts. To sidestep machine learning approaches, we instead propose to analyse suspicious URLs from SIEM alerts via the perspective of malicious URL campaigns. By first grouping URLs within 311M records gathered from VirusTotal into 2.6M suspicious clusters, we thereafter discovered 77.8K malicious campaigns. Corroborating our suspicions, we found 9.9M unique URLs attributable to 18.3K multi-URL campaigns, and that worryingly, only 2.97% of campaigns were found by security vendors. We also confer insights on evasive tactics such as ever lengthier URLs and more diverse domain names, with selected case studies exposing other adversarial techniques. By characterising the concerted campaigns driving these URL alerts, we hope to inform SOC teams of current threat trends, and thus arm them with better threat intelligence.
... Previous work has measured how often people update their computer systems [19,34,90], what security settings they use on their computers [19,34], whether they are infected with malware [19,34], and the presence of third-party applications [90]. Prior work has also measured how often people click on unsafe links [19,82], their private-browsing [40] and password-reuse habits [32,71,90], and whether they install security-enhancing extensions [90]. We focus on security behaviors related to web usage, as we are specifically studying the use of web-based media in spreading information. ...
... Previous work that extracted security behaviors from real data has collected data in multiple ways. One set of researchers partnered with an internet service provider that recorded all HTTP traffic of consenting participants [82]. Others asked study participants to install a tool that collected their system logs and information about the passwords they entered on web pages [90]. ...
... Nonetheless, the platform has been extensively used by researchers to either label existing data or collect datasets for training and evaluation of algorithms. While past focus has been on both files [27][28][29][30][31][32][33][34][35][36] as well as suspicious IP addresses and URLs [37][38][39][40][41][42][43][44][45][46][47][48], the work herein aims to cluster and understand the dynamics of submitted URLs within the VirusTotal platform. That is, we are primarily concerned about characterising the URLs themselves via the metadata available for each submission, as opposed to the raw content that they point to. ...
... That is, we are primarily concerned about characterising the URLs themselves via the metadata available for each submission, as opposed to the raw content that they point to. [37-39, 42, 43, 45, 48] has cast wider nets by setting = 1, some have been more conservative by setting = 2, 3 [40,44,46], or even = 5 [16]. In contrast, the work herein is not directly concerned with labelling particular URLs as benign or malicious, and do not set arbitrary thresholds to filter submissions. ...
Preprint
Full-text available
URLs are central to a myriad of cyber-security threats, from phishing to the distribution of malware. Their inherent ease of use and familiarity is continuously abused by attackers to evade defences and deceive end-users. Seemingly dissimilar URLs are being used in an organized way to perform phishing attacks and distribute malware. We refer to such behaviours as campaigns, with the hypothesis being that attacks are often coordinated to maximize success rates and develop evasion tactics. The aim is to gain better insights into campaigns, bolster our grasp of their characteristics, and thus aid the community devise more robust solutions. To this end, we performed extensive research and analysis into 311M records containing 77M unique real-world URLs that were submitted to VirusTotal from Dec 2019 to Jan 2020. From this dataset, 2.6M suspicious campaigns were identified based on their attached metadata, of which 77,810 were doubly verified as malicious. Using the 38.1M records and 9.9M URLs within these malicious campaigns, we provide varied insights such as their targeted victim brands as well as URL sizes and heterogeneity. Some surprising findings were observed, such as detection rates falling to just 13.27% for campaigns that employ more than 100 unique URLs. The paper concludes with several case-studies that illustrate the common malicious techniques employed by attackers to imperil users and circumvent defences.
... Other studies have focused on user behavior [33][34][35][36][37]. A study analyzes traffic on a mobile cellular network to predict whether a user will visit a malicious URL within a month based on past browsing activities and a questionnaire [34]. ...
... Other studies have focused on user behavior [33][34][35][36][37]. A study analyzes traffic on a mobile cellular network to predict whether a user will visit a malicious URL within a month based on past browsing activities and a questionnaire [34]. The study also predicts whether a user would access a malicious URL within a session based on information from past records in the same session. ...
Conference Paper
Full-text available
Web access exposes users to various attacks, such as malware infections and social engineering attacks. Despite ongoing efforts by security and browser vendors to protect users, some users continue to access malicious URLs. To provide better protection, we need to know how users reach such URLs. In this work, we collect web access records of users from their using our browser extension. Differing from data collection on the network, user-side data collection enables us to discern users and web browser tabs, facilitating efficient data analysis. Then, we propose a scheme to extract an entire web access path to a malicious URL, called a hazardous path, from the access records. With all the hazardous paths extracted from the access records, we analyze web access activities of users considering initial accesses on the hazardous paths, risk levels of bookmarked URLs, time required to reach malicious URLs, and the number of concurrently active browser tabs when reaching such URLs. In addition, we propose a preemptive domain filtering scheme, which identifies domains leading to malicious URLs, called hazardous domains. We demonstrate the effectiveness of the scheme by identifying hazardous domains that are not included in blacklists.
... Sharif et al. [28] proposed a system for proactive identification of malicious content over the web. The model employed machine learning classification tools to analyze web browsing for the HTTP data generated in 3 months duration for over twenty thousand users. ...
Preprint
Full-text available
With the vigorous development of cloud computing, most organizations have shifted their data and applications to the cloud environment for storage, computation, and sharing purposes. During storage and data sharing across the participating entities, a malicious agent may gain access to outsourced data from the cloud environment. A malicious agent is an entity that deliberately breaches the data. This information accessed might be misused or revealed to unauthorized parties. Therefore, data protection and prediction of malicious agents have become a demanding task that needs to be addressed appropriately. To deal with this crucial and challenging issue, this paper presents a Malicious Agent Identification-based Data Security (MAIDS) Model which utilizes XGBoost machine learning classification algorithm for securing data allocation and communication among different participating entities in the cloud system. The proposed model explores and computes intended multiple security parameters associated with online data communication or transactions. Correspondingly, a security-focused knowledge database is produced for developing the XGBoost Classifier-based Malicious Agent Prediction (XC-MAP) unit. Unlike the existing approaches, which only identify malicious agents after data leaks, MAIDS proactively identifies malicious agents by examining their eligibility for respective data access. In this way, the model provides a comprehensive solution to safeguard crucial data from both intentional and non-intentional breaches, by granting data to authorized agents only by evaluating the agents behavior and predicting the malicious agent before granting data.
... Sharif et al. [28] proposed a system for proactive identification of malicious content over the web. The model employed machine learning classification tools to analyze web browsing for the HTTP data generated in 3 months duration for over twenty thousand users. ...
Article
Full-text available
With the vigorous development of cloud computing, most organizations have shifted their data and applications to the cloud environment for storage, computation, and sharing purposes. During storage and data sharing across the participating entities, a malicious agent may gain access to outsourced data from the cloud environment. A malicious agent is an entity that deliberately breaches the data. This information accessed might be misused or revealed to unauthorized parties. Therefore, data protection and prediction of malicious agents have become a demanding task that needs to be addressed appropriately. To deal with this crucial and challenging issue, this paper presents a Malicious Agent Identification-based Data Security (MAIDS) Model which utilizes XGBoost machine learning classification algorithm for securing data allocation and communication among different participating entities in the cloud system. The proposed model explores and computes intended multiple security parameters associated with online data communication or transactions. Correspondingly, a security-focused knowledge database is produced for developing the XGBoost Classifier-based Malicious Agent Prediction (XC-MAP) unit. Unlike the existing approaches, which only identify malicious agents after data leaks, MAIDS proactively identifies malicious agents by examining their eligibility for respective data access. In this way, the model provides a comprehensive solution to safeguard crucial data from both intentional and non-intentional breaches, by granting data to authorized agents only by evaluating the agent’s behavior and predicting the malicious agent before granting data. The performance of the proposed model is thoroughly evaluated by accomplishing extensive experiments, and the results signify that the MAIDS model predicts the malicious agents with high accuracy, precision, recall, and F1-scores up to 95.55%, 95.30%, 95.50%, and 95.20%, respectively. This enormously enhances the system’s security in terms of authorized data access accuracy up to 55.49%, precision up to 43.15%, recall up to 55.49%, and F1-score up to 39.96%, respectively, as compared to state-of-the-art work.
... Sharif et al. [29] devised an approach for malicious payload identification proactively. Hyper-text transfer protocol (HTTP) data of 20k users turned out over the web in 3 months duration was analyzed using machine learning classification algorithms. ...
Preprint
Full-text available
Cloud computing is flourishing at a rapid pace. Significant consequences related to data security appear as a malicious user may get unauthorized access to sensitive data which may be misused, further. This raises an alarm-ringing situation to tackle the crucial issue related to data security and proactive malicious user prediction. This article proposes a Federated learning driven Malicious User Prediction Model for Secure Data Distribution in Cloud Environments (FedMUP). This approach firstly analyses user behavior to acquire multiple security risk parameters. Afterward, it employs the federated learning-driven malicious user prediction approach to reveal doubtful users, proactively. FedMUP trains the local model on their local dataset and transfers computed values rather than actual raw data to obtain an updated global model based on averaging various local versions. This updated model is shared repeatedly at regular intervals with the user for retraining to acquire a better, and more efficient model capable of predicting malicious users more precisely. Extensive experimental work and comparison of the proposed model with state-of-the-art approaches demonstrate the efficiency of the proposed work. Significant improvement is observed in the key performance indicators such as malicious user prediction accuracy, precision, recall, and f1-score up to 14.32%, 17.88%, 14.32%, and 18.35%, respectively.
... However, malicious websites might often not have visible credential fields (e.g., in multi-level phishing attacks [40], [41] or websites distributing malware [42], [43]), and thus we considered URLs which have two or more detections on VirusTotal, and then manually verified these URLs. This threshold of two or more detections is proposed in prior literature [44], [45]. We collected 42,194 Figure 1. ...
Preprint
Full-text available
While phishing attacks have evolved to utilize several obfuscation tactics to evade prevalent detection measures, implementing said measures often requires significant technical competence and logistical overhead from the attacker's perspective. In this work, we identify a family of phishing attacks hosted over Free web-Hosting Domains (FHDs), which can be created and maintained at scale with very little effort while also effectively evading prevalent anti-phishing detection and resisting website takedown. We observed over 8.8k such phishing URLs shared on Twitter and Facebook from February to August 2022 using 24 unique FHDs. Our large-scale analysis of these attacks shows that phishing websites hosted on FHDs remain active on Twitter and Facebook for at least 1.5 times longer than regular phishing URLs. In addition, on average, they have 1.7 times lower coverage from anti-phishing blocklists than regular phishing attacks, with a coverage time also being 3.8 times slower while only having half the number of detections from anti-phishing tools. Moreover, only 23.6% of FHD URLs were removed by the hosting domain a week after their first appearance, with a median removal time of 12.2 hours. We also identified several gaps in the prevalent anti-phishing ecosystem in detecting these threats. Based on our findings, we developed FreePhish, an ML-aided framework that acts as an effective countermeasure to detect and mitigate these URLs automatically and more effectively. By regularly reporting phishing URLs found by FreePhish to FHDs and hosting registrars over a period of two weeks, we note a significant decrease in the time taken to remove these websites. Finally, we also provide FreePhish as a free Chromium web extension that can be utilized to prevent end-users from accessing potential FHD-based phishing attacks.
... The main intuition is to learn how to do so by training a model using historical data (i.e., metadata profiling previous security postures or historical security events collected between 0 and ). At timestamp +1 , the model produce a binary prediction outcome (i.e., if a breach or an attack is likely to happen) using present data (i.e., data collected between +1 and ) [6,34,47,50,58]. ...
Conference Paper
Modern defenses against cyberattacks increasingly rely on proactive approaches, e.g., to predict the adversary's next actions based on past events. Building accurate prediction models requires knowledge from many organizations; alas, this entails disclosing sensitive information, such as network structures, security postures, and policies, which might often be undesirable or outright impossible. In this paper, we explore the feasibility of using Federated Learning (FL) to predict future security events. To this end, we introduce Cerberus, a system enabling collaborative training of Recurrent Neural Network (RNN) models for participating organizations. The intuition is that FL could potentially offer a middle-ground between the non-private approach where the training data is pooled at a central server and the low-utility alternative of only training local models. We instantiate Cerberus on a dataset obtained from a major security company's intrusion prevention product and evaluate it vis-a-vis utility, robustness, and privacy, as well as how participants contribute to and benefit from the system. Overall, our work sheds light on both the positive aspects and the challenges of using FL for this task and paves the way for deploying federated approaches to predictive security.
... The higher this value is for a given URL, the more likely the URL is malicious. Prior work that relies on VT for malicious URL detection has used a threshold value ranging from 1 to 5 [23,26,27,31]. As there is no robust way to decide which scanner is more accurate [32], Dizzy flags a URL as malicious if it is flagged by at least one scanner. ...
Preprint
With nearly 2.5m users, onion services have become the prominent part of the darkweb. Over the last five years alone, the number of onion domains has increased 20x, reaching more than 700k unique domains in January 2022. As onion services host various types of illicit content, they have become a valuable resource for darkweb research and an integral part of e-crime investigation and threat intelligence. However, this content is largely un-indexed by today's search engines and researchers have to rely on outdated or manually-collected datasets that are limited in scale, scope, or both. To tackle this problem, we built Dizzy: An open-source crawling and analysis system for onion services. Dizzy implements novel techniques to explore, update, check, and classify hidden services at scale, without overwhelming the Tor network. We deployed Dizzy in April 2021 and used it to analyze more than 63.3m crawled onion webpages, focusing on domain operations, web content, cryptocurrency usage, and web graph. Our main findings show that onion services are unreliable due to their high churn rate, have a relatively small number of reachable domains that are often similar and illicit, enjoy a growing underground cryptocurrency economy, and have a topologically different graph structure than the regular web.
... The main intuition is to learn how to do so by training a model using historical data (i.e., metadata profiling previous security postures or historical security events collected between 0 and ). At timestamp +1 , the model produce a binary prediction outcome (i.e., if a breach or an attack is likely to happen) using present data (i.e., data collected between +1 and ) [6,34,47,50,58]. ...
Preprint
Full-text available
Modern defenses against cyberattacks increasingly rely on proactive approaches, e.g., to predict the adversary's next actions based on past events. Building accurate prediction models requires knowledge from many organizations; alas, this entails disclosing sensitive information, such as network structures, security postures, and policies, which might often be undesirable or outright impossible. In this paper, we explore the feasibility of using Federated Learning (FL) to predict future security events. To this end, we introduce Cerberus, a system enabling collaborative training of Recurrent Neural Network (RNN) models for participating organizations. The intuition is that FL could potentially offer a middle-ground between the non-private approach where the training data is pooled at a central server and the low-utility alternative of only training local models. We instantiate Cerberus on a dataset obtained from a major security company's intrusion prevention product and evaluate it vis-a-vis utility, robustness, and privacy, as well as how participants contribute to and benefit from the system. Overall, our work sheds light on both the positive aspects and the challenges of using FL for this task and paves the way for deploying federated approaches to predictive security.
... Moreover, Cronbach's α = 0.81 is also high which is a very good indicator of its internal reliability. The self-reported data from SeBIS have helped forecast long-term exposure risks and resulted in moderate accurate predictions (Sharif et al. 2018). It's a comparatively short instrument and has the potential to give reliable results thus giving a true image of the human component in cyber-security behavior research compared to other bigger scales (Velki and Šolić 2019). ...
Article
Full-text available
Cyber-security behavior research is scant with even scarce studies carried out in developing countries. We examine the cyber-security and risky Internet behaviors of undergraduate students from Pakistan, taking into account the diversity of these students in terms of demographics, socioeconomic status, and the digital divide. Data were collected using a survey questionnaire. A total of 294 students belonging to six different cities of Pakistan were surveyed employing multistage stratified sampling in face-to-face interaction. The results indicated significant differences of cyber-security posture in terms of gender, age and digital divide variables. The profiles of students based on cyber-security and risky Internet behaviors indicate three groups with a majority of them falling into group that exhibits more risk-averse yet low cyber-security behavior. Moreover, proactive cyber-security awareness behavior has a positive impact on high risk-averse behavior. The implications of the findings are studied in terms of providing customized training and awareness. The future directions are laid out for further explorations in terms of cultural differences within and cross-country contexts.
... Recent efforts have attempted to predict security and privacy threats before materializing to protect users' mobile devices and networks. Sharif et al. propose a system that observes user behavior and predicts whether users will be exposed to malicious content on the web seconds before the moment of exposure, thus opening a window of opportunity for proactive defenses [76]. Abbasi et al. predicted user susceptibility to phishing as real-time protection based on user characteristics, phishing threat, and anti-phishing tool-related factors [1]. ...
Article
Full-text available
Peer support is a powerful tool in improving the digital literacy of older adults. However, while existing literature investigated reactive support, this paper examines proactive support for mobile safety. To predict moments that users need support, we conducted a user study to measure the severity of mobile scenarios (n=300) and users' attitudes toward receiving support in a specific interaction around safety on a mobile device (n=150). We compared classification methods and showed that the random forest method produces better performance than other regression models. We show that user anxiety, openness to social support, self-efficacy, and security awareness are important factors to predict willingness to receive support. We also explore various age variations in the training sample on moments users need support prediction. We find that training on the youngest population produces inferior results for older adults, and training on the aging population produces poor outcomes for young adults. We illustrate that the composition of age can affect how the sample impacts model performance. We conclude the paper by discussing how our findings can be used to design feasible proactive support applications to provide support at the right moment.
... Although these studies analyzed the infrastructure and traditional distribution techniques for fake AV software-such as drive-by downloads and fake infection alerts-new distribution tactics using FRAD sites have not been revealed. There is also related work that describes case studies of fake AV software distribution from social engineering aspects [4], [11], [12], [15], [16], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. In most studies, they analyzed fake infection alerts via advertisements that threaten or attract users to install fake AV software. ...
Article
Full-text available
Fake antivirus (AV) software is a type of malware that disguises as legitimate antivirus software and causes harm to users and their devices. Fake removal information advertisement (FRAD) sites, which introduce fake removal information for cyber threats, have emerged as platforms for distributing fake AV software. Although FRAD sites seriously threaten users who have been suffering from cyber threats and need information for removing them, little attention has been given to revealing these sites. In this paper, we propose a system to automatically crawl the web and identify FRAD sites. To shed light on the pervasiveness of this type of attack, we performed a comprehensive analysis of both passively and actively collected data. Our system collected 2, 913 FRAD sites in 31 languages, which have 73.5 million visits per month in total. We show that FRAD sites occupy search results when users search for cyber threats, thus preventing the users from obtaining the correct information.
... The core idea of these research work is to extract a set of pre-defined features from historical data, and train a machine learning model to predict, in a binary format, if a vulnerability is likely to be exploited [11], if a PoC of a vulnerability would be devised in the real world [64], if an enterprise would be breached using publicly available security incident datasets [43], the volume of actual malware infections in a country [33], the likelihood of endpoints at risk of infection in the future [8], etc. In recent years, researchers also leveraged deep learning techniques such as RNN [66] and CNN [65] to predict the exact upcoming security events. Our approach focuses on learning low dimensional representations of global malware installation graph and predict emerging malware installations. ...
Preprint
Full-text available
Android's security model severely limits the capabilities of anti-malware software. Unlike commodity anti-malware solutions on desktop systems, their Android counterparts run as sandboxed applications without root privileges and are limited by Android's permission system. As such, PHAs on Android are usually willingly installed by victims, as they come disguised as useful applications with hidden malicious functionality, and are encountered on mobile app stores as suggestions based on the apps that a user previously installed. Users with similar interests and app installation history are likely to be exposed and to decide to install the same PHA. This observation gives us the opportunity to develop predictive approaches that can warn the user about which PHAs they will encounter and potentially be tempted to install in the near future. These approaches could then be used to complement commodity anti-malware solutions, which are focused on post-fact detection, closing the window of opportunity that existing solutions suffer from. In this paper we develop Andruspex, a system based on graph representation learning, allowing us to learn latent relationships between user devices and PHAs and leverage them for prediction. We test Andruspex on a real world dataset of PHA installations collected by a security company, and show that our approach achieves very high prediction results (up to 0.994 TPR at 0.0001 FPR), while at the same time outperforming alternative baseline methods. We also demonstrate that Andruspex is robust and its runtime performance is acceptable for a real world deployment.
... By forecasting web events based on user behavior, we can provide better recommendation, caching, pre-fetching, and load balancing for enterprise web applications [8,37]. Web event forecasting has been investigated for many years [2,32,35]. However, two critical challenges exist in characterizing the sequence of web events: 1) Although many works have been conducted in forecasting web events, very few works investigated enterprise web applications. ...
Chapter
Full-text available
Recently web applications have been widely used in enterprises to assist employees in providing effective and efficient business processes. Forecasting upcoming web events in enterprise web applications can be beneficial in many ways, such as efficient caching and recommendation. In this paper, we present a web event forecasting approach, DeepEvent, in enterprise web applications for better anomaly detection. DeepEvent includes three key features: web-specific neural networks to take into account the characteristics of sequential web events, self-supervised learning techniques to overcome the scarcity of labeled data, and sequence embedding techniques to integrate contextual events and capture dependencies among web events. We evaluate DeepEvent on web events collected from six real-world enterprise web applications. Our experimental results demonstrate that DeepEvent is effective in forecasting sequential web events and detecting web based anomalies. DeepEvent provides a context-based system for researchers and practitioners to better forecast web events with situational awareness.
Article
VirusTotal (VT) is a widely used scanning service for researchers and practitioners to label malicious entities and predict new security threats. Unfortunately, it is little known to the end-users how VT URL scanners decide on the maliciousness of entities and the attack types they are involved in (e.g., phishing or malware-hosting websites). In this paper, we conduct a systematic comparative study on VT URL scanners' behavior for different attack types of malicious URLs, in terms of 1) detection specialties, 2) stability, 3) correlations between scanners, and 4) lead/lag behaviors. Our findings highlight that the VT scanners commonly disagree with each other on their detection and attack type classification, leading to challenges in ascertaining the maliciousness of a URL and taking prompt mitigation actions according to different attack types. This motivates us to present a new highly accurate classifier that helps correctly identify the attack types of malicious URLs at the early stage. This in turn assists practitioners in performing better threat aggregation and choosing proper mitigation actions for different attack types.
Article
Cloud computing is flourishing at a rapid pace. Significant consequences related to data security appear as a malicious user may get unauthorized access to sensitive data which may be misused, further. This raises an alarm-ringing situation to tackle the crucial issue related to data security and proactive malicious user prediction. This article proposes a Federated learning driven Malicious User Prediction Model for Secure Data Distribution in Cloud Environments (FedMUP). This approach firstly analyses user behavior to acquire multiple security risk parameters. Afterward, it employs the federated learning-driven malicious user prediction approach to reveal doubtful users, proactively. FedMUP trains the local model on their local dataset and transfers computed values rather than actual raw data to obtain an updated global model based on averaging various local versions. This updated model is shared repeatedly at regular intervals with the user for retraining to acquire a better, and more efficient model capable of predicting malicious users more precisely. Extensive experimental work and comparison of the proposed model with state-of-the-art approaches demonstrate the efficiency of the proposed work. Significant improvement is observed in the key performance indicators such as malicious user prediction accuracy, precision, recall, and f1-score up to 14.32%, 17.88%, 14.32%, and 18.35%, respectively.
Article
Malware is still a widespread problem and it is used by malicious actors to routinely compromise the security of computer systems. Consumers typically rely on a single AV product to detect and block possible malware infections, while corporations often install multiple security products, activate several layers of defenses, and establish security policies among employees. However, if a better security posture should lower the risk of malware infections, the actual extent to which this happens is still under debate by risk analysis experts. Moreover, the difference in risks encountered by consumers and enterprises has never been empirically studied by using real-world data. In fact, the mere use of third-party software, network services, and the interconnected nature of our society necessarily exposes both classes of users to undiversifiable risks: independently from how careful users are and how well they manage their cyber hygiene, a portion of that risk would simply exist because of the fact of using a computer, sharing the same networks, and running the same software. In this work, we shed light on both systemic (i.e., diversifiable and dependent on the security posture) and systematic (i.e., undiversifiable and independent of the cyber hygiene) risk classes. Leveraging the telemetry data of a popular security company, we compare, in the first part of our study, the effects that different security measures have on malware encounter risks in consumer and enterprise environments. In the second part, we conduct exploratory research on systematic risk, investigate the quality of nine different indicators we were able to extract from our telemetry, and provide, for the first time, quantitative indicators of their predictive power. Our results show that even if consumers have a slightly lower encounter rate than enterprises (9.8% vs 12.0%), the latter do considerably better when selecting machines with an increasingly higher uptime (89% vs 53%). The two segments also diverge when we separately consider the presence of Adware and Potentially Unwanted Applications (PUA), and the generic samples detected through behavioral signatures: while consumers have an encounter rate for Adware and PUA that is 6 times higher than enterprise machines, those on average match behavioral signatures two times more frequently than the counterpart. We find, instead, similar trends when analyzing the age of encountered signatures, and the prevalence of different classes of traditional malware (such as Ransomware and Cryptominers). Finally, our findings show that the amount of time a host is active, the volume of files generated on the machine, the number and reputation of vendors of the installed applications, the host geographical location and its recurrent infected state carry useful information as indicators of systematic risk of malware encounters. Activity days and hours have a higher influence in the risk of consumers, increasing the odds of encountering malware of 4.51 and 2.65 times. In addition, we measure that the volume of files generated on the host represents a reliable indicator, especially when considering Adware. We further report that the likelihood of encountering Worms and Adware is much higher (on average 8 times in consumers and enterprises) for those machines that already reported this kind of signatures in the past.
Article
Users are encouraged to adopt a wide array of technologies and behaviors to reduce their security risk. However, the adoption of these "best practices," ranging from the use of antivirus products to keeping software updated, is not well understood, nor is their practical impact on security risk well established. To explore these issues, we conducted a large-scale measurement of 15,000 computers over six months. We use passive monitoring to infer and characterize the prevalence of various security practices as well as a range of other potentially security-relevant behaviors. We then explore the extent to which differences in key security behaviors impact the real-world outcomes (i.e., that a device shows clear evidence of having been compromised).
Article
We propose to identify compromised mobile devices from a network administrator’s point of view. Intuitively, inadvertent users (and thus their devices) who download apps through untrustworthy markets are often allured to install malicious apps through in-app advertisements or phishing. We thus hypothesize that devices sharing similar apps would have a similar likelihood of being compromised, resulting in an association between a compromised device and its apps. We propose to leverage such associations to identify unknown compromised devices using the guilt-by-association principle. Admittedly, such associations could be relatively weak as it is hard, if not impossible, for an app to automatically download and install other apps without explicit user initiation. We describe how we can magnify such associations by carefully choosing parameters when applying graph-based inferences. We empirically evaluate the effectiveness of our approach on real datasets provided by a major mobile service provider. Specifically, we show that our approach achieves nearly 98% AUC (area under the ROC curve) and further detects as many as 6 ∼ 7 times of new compromised devices not covered by the ground truth by expanding the limited knowledge on known devices. We show that the newly detected devices indeed present undesirable behavior in terms of leaking private information and accessing risky IPs and domains. We further conduct in-depth analysis of the effectiveness of graph inferences to understand the unique structure of the associations between mobile devices and their apps, and its impact on graph inferences, based on which we propose how to choose key parameters.
Article
Assessing the information security awareness (ISA) of users is crucial for protecting systems and organizations from social engineering attacks. Current methods do not consider the context of use when assessing users’ ISA, and therefore they cannot accurately reflect users’ actual behavior, which often depends on that context. In this study, we propose a novel context-based, data-driven, approach for assessing the ISA of users. In this approach, different behavioral and contextual factors, such as spatio-temporal information and browsing habits, are used to assess users’ ISA. Since defining each context explicitly is impractical for a large context space, we utilize a deep neural network to represent users’ contexts implicitly from contextual factors. We evaluate our approach empirically using a real-world dataset of users’ activities collected from 120 smartphone users. The results show that the proposed method and context information improve ISA assessment accuracy significantly.
Article
Recently, sophisticated attacks on cyberspace have occurred frequently, causing severe damage to the Internet. Predicting potential threats can assist security engineers in deploying corresponding defenses in advance to reduce the damage. Thus, threat prediction has drawn attention in communities recently. Previous works utilized merely historical security event sequences to predict the subsequent event through the recurrent neural network (RNN), yielding inaccurate results when the input sequence is corrupted by false reports from underlying detection logs. In this paper, we develop a joint predictor for security events and time intervals through attention-based LSTM(Long Short-Term Memory). To enhance the event predicting performance for corrupted input sequences, time intervals between events are incorporated into the input tuple, providing more distinguishing features. Moreover, a time discretization method is proposed to transform the skewed long-tail dwell time distribution into a predictable distribution of the time interval. In addition, the joint optimization function enables the model to predict the occurrence time of the next event simultaneously, which is supportive for security managers to select appropriate defenses. Our model is proved to be effective on four real-world datasets, outperforming previous methods on both event and time prediction. Moreover, the empirical results also validate the model’s stability.
Article
With the ubiquity of web tracking, information on how people navigate the internet is abundantly collected yet, due to its proprietary nature, rarely distributed. As a result, our understanding of user browsing primarily derives from small-scale studies conducted more than a decade ago. To provide an broader updated perspective, we analyze data from 257 participants who consented to have their home computer and browsing behavior monitored through the Security Behavior Observatory. Compared to previous work, we find a substantial increase in tabbed browsing and demonstrate the need to include tab information for accurate web measurements. Our results confirm that user browsing is highly centralized, with 50% of internet use spent on 1% of visited websites. However, we also find that users spend a disproportionate amount of time on low-visited websites, areas with a greater likelihood of containing risky content. We then identify the primary gateways to these sites and discuss implications for future research.
Chapter
We present MORTON, a method that identifies compromised devices in enterprise networks based on the existence of routine DNS communication between devices and disreputable host names. With its compact representation of the input data and use of efficient signal processing and a neural network for classification, MORTON is designed to be accurate, robust, and scalable. We evaluate MORTON using a large dataset of corporate DNS logs and compare it with two recently proposed beaconing detection methods aimed at detecting malware communication. The results demonstrate that while MORTON ’s accuracy in a synthetic experiment is comparable to that of the other methods, it outperforms those methods in terms of its ability to detect sophisticated bot communication techniques, such as multistage channels. Additionally, MORTON was the most efficient method, running at least 13 times faster than the other methods on large-scale datasets, thus reducing the time to detection. In a real-world evaluation, which includes previously unreported threats, MORTON and the two compared methods were deployed to monitor the (unlabeled) DNS traffic of two global enterprises for a week-long period; this evaluation demonstrates the effectiveness of MORTON in real-world scenarios where it achieved the highest F1-score.
Chapter
A software update is a critical but complicated part of software security. Its delay poses risks due to vulnerabilities and defects of software. Despite the high demand to shorten the update lag and keep the software up-to-date, software updates involve factors such as human behavior, program configurations, and system policies, adding variety in the updates of software. Investigating these factors in a real environment poses significant challenges such as the knowledge of software release schedules from the software vendors and the deployment times of programs in each user’s machine. Obtaining software release plans requires information from vendors which is not typically available to public. On the users’ side, tracking each software’s exact update installation is required to determine the accurate update delay. Currently, a scalable and systematic approach is missing to analyze these two sides’ views of a comprehensive set of software. We performed a long term system-wide study of update behavior for all software running in an enterprise by translating the operating system logs from enterprise machines into graphs of binary executable updates showing their complex, and individualized updates in the environment. Our comparative analysis locates risky machines and software with belated or dormant updates falling behind others within an enterprise without relying on any third-party or domain knowledge, providing new observations and opportunities for improvement of software updates. Our evaluation analyzes real data from 113,675 unique programs used by 774 computers over 3 years.
Article
Full-text available
The Security Behavior Observatory (SBO) is a longitudinal field-study of computer security habits that provides a novel dataset for validating computer security metrics. This paper demonstrates a new strategy for validating phishing detection ability metrics by comparing performance on a phishing signal detection task with data logs found in the SBO. We report: (1) a test of the robustness of performance on the signal detection task by replicating Canfield, Fischhoff, and Davis (2016), (2) an assessment of the task's construct validity, and (3) evaluation of its predictive validity using data logs. We find that members of the SBO sample had similar signal detection ability compared to members of the previous mTurk sample and that performance on the task correlated with the Security Behavior Intentions Scale (SeBIS). However, there was no evidence of predictive validity, as the signal detection task performance was unrelated to computer security outcomes in the SBO, including the presence of malicious software, URLs, and files. We discuss the implications of these findings and the challenges of comparing behavior on structured experimental tasks to behavior in complex real-world settings.
Conference Paper
Full-text available
The current evolution of the cyber-threat ecosystem shows that no system can be considered invulnerable. It is therefore important to quantify the risk level within a system and devise risk prediction methods such that proactive measures can be taken to reduce the damage of cyber attacks. We present RiskTeller, a system that analyzes binary file appearance logs of machines to predict which machines are at risk of infection months in advance. Risk prediction models are built by creating, for each machine, a comprehensive profile capturing its usage patterns, and then associating each profile to a risk level through both fully and semi-supervised learning methods. We evaluate RiskTeller on a year-long dataset containing information about all the binaries appearing on machines of 18 enterprises. We show that RiskTeller can use the machine profile computed for a given machine to predict subsequent infections with the highest prediction precision achieved to date.
Conference Paper
Full-text available
Several studies have shown that the network traffic that is generated by a visit to a website over Tor reveals information specific to the website through the timing and sizes of network packets. By capturing traffic traces between users and their Tor entry guard, a network eavesdropper can leverage this meta-data to reveal which website Tor users are visiting. The success of such attacks heavily depends on the particular set of traffic features that are used to construct the fingerprint. Typically, these features are manually engineered and, as such, any change introduced to the Tor network can render these carefully constructed features ineffective. In this paper, we show that an adversary can automate the feature engineering process, and thus automatically deanonymize Tor traffic by applying our novel method based on deep learning. We collect a dataset comprised of more than three million network traces, which is the largest dataset of web traffic ever used for website fingerprinting, and find that the performance achieved by our deep learning approaches is comparable to known methods which include various research efforts spanning over multiple years. The obtained success rate exceeds 96% for a closed world of 100 websites and 94% for our biggest closed world of 900 classes. In our open world evaluation, the most performant deep learning model is 2% more accurate than the state-of-the-art attack. Furthermore, we show that the implicit features automatically learned by our approach are far more resilient to dynamic changes of web content over time. We conclude that the ability to automatically construct the most relevant traffic features and perform accurate traffic recognition makes our deep learning based approach an efficient, flexible and robust technique for website fingerprinting.
Article
Full-text available
Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data.
Conference Paper
Full-text available
A major inhibitor of the effectiveness of security warnings is habituation: decreased response to a repeated warning. Although habituation develops over time, previous studies have examined habituation and possible solutions to its effects only within a single experimental session, providing an incomplete view of the problem. To address this gap, we conducted a longitudinal experiment that examines how habituation develops over the course of a five-day workweek and how polymorphic warnings decrease habituation. We measured habituation using two complementary methods simultaneously: functional magnetic resonance imaging (fMRI) and eye tracking. Our results show a dramatic drop in attention throughout the workweek despite partial recovery between workdays. We also found that the polymorphic warning design was substantially more resistant to habituation compared to conventional warnings, and it sustained this advantage throughout the five-day experiment. Our findings add credibility to prior studies by showing that the pattern of habituation holds across a workweek, and indicate that cross-sectional habituation studies are valid proxies for longitudinal studies. Our findings also show that eye tracking is a valid measure of the mental process of habituation to warnings.
Conference Paper
Full-text available
The behavior of the least-secure user can influence security and privacy outcomes for everyone else. Thus, it is important to understand the factors that influence the security and privacy of a broad variety of people. Prior work has suggested that users with differing socioeconomic status (SES) may behave differently; however, no research has examined how SES, advice sources, and resources relate to the security and privacy incidents users report. To address this question, we analyze a 3,000 respondent, census-representative telephone survey. We find that, contrary to prior assumptions, people with lower educational attainment report equal or fewer incidents as more educated people, and that users' experiences are significantly correlated with their advice sources, regardless of SES or resources.
Conference Paper
Full-text available
In this paper we study the implications of end-user behavior in applying software updates and patches on information-security vulnerabilities. To this end we tap into a large data set of measurements conducted on more than 400,000 Windows machines over four client-side applications, and separate out the impact of user and vendor behavior on the vulnerability states of hosts. Our modeling of users and the empirical evaluation of this model over vulnerability states of hosts reveal a peculiar relationship between vendors and end-users: the users’ promptness in applying software patches, and vendors’ policies in facilitating the installation of updates, while both contributing to the hosts’ security posture, are overshadowed by other characteristics such as the frequency of vulnerability disclosures and the vendors’ swiftness in deploying patches.
Article
Full-text available
This article aims to understand if, and to what extent, business details about an organization can help to assess a company’s risk in experiencing data breach incidents, as well its distribution of risk over multiple incident types, in order to provide guidelines to effectively protect, detect, and recover from different forms of security incidents. Existing work on prediction of data breach mainly focuses on network incidents, and studies that analyze the distribution of risk across different incident categories, most notably Verizon’s latest Data Breach Investigations Report, provide recommendations based solely on business sector information. In this article, we leverage a broader set of publicly available business details to provide a more fine-grained analysis on incidents involving any form of data breach and data loss. Specifically, we use reports collected in the VERIS Community Database (VCDB), as well as data from Alexa Web Information Service (AWIS), the Open Directory Project (ODP), and Neustar Inc., to train and test a sequence of classifiers/predictors. Our results show that our feature set can distinguish between victims of data breaches, and nonvictims, with a 90% true positive rate, and 11% false positive rate, making them an effective tool in evaluating an entity’s cyber-risk. Furthermore, we show that compared to using business sector information alone, our method can derive a more accurate risk distribution for specific incident types, and allow organizations to focus on a sparser set of incidents, thus achieving the same level of protection by spending less resources on security through more judicious prioritization. Keywords : data breach; resource allocation; risk assessment.
Article
Full-text available
The 2013 National Security Agency revelations of pervasive monitoring have lead to an "encryption rush" across the computer and Internet industry. To push back against massive surveillance and protect users privacy, vendors, hosting and cloud providers have widely deployed encryption on their hardware, communication links, and applications. As a consequence, the most of web traffic nowadays is encrypted. However, there is still a significant part of Internet traffic that is not encrypted. It has been argued that both costs and complexity associated with obtaining and deploying X.509 certificates are major barriers for widespread encryption, since these certificates are required to established encrypted connections. To address these issues, the Electronic Frontier Foundation, Mozilla Foundation, and the University of Michigan have set up Let's Encrypt (LE), a certificate authority that provides both free X.509 certificates and software that automates the deployment of these certificates. In this paper, we investigate if LE has been successful in democratizing encryption: we analyze certificate issuance in the first year of LE and show from various perspectives that LE adoption has an upward trend and it is in fact being successful in covering the lower-cost end of the hosting market.
Article
Full-text available
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.
Conference Paper
Full-text available
While individual differences in decision-making have been examined within the social sciences for several decades, they have only recently begun to be applied by computer scientists to examine privacy and security attitudes (and ultimately behaviors). Specifically, several researchers have shown how different online privacy decisions are correlated with the "Big Five" personality traits. In this paper, we show that the five factor model is actually a weak predictor of privacy attitudes, and that other well-studied individual differences in the psychology literature are much stronger predictors. Based on this result, we introduce the new paradigm of psychographic targeting of privacy and security mitigations: we believe that the next frontier in privacy and security research will be to tailor mitigations to users' individual differences. We explore the extensive work on choice architecture and "nudges," and discuss the possible ways it could be leveraged to improve security outcomes by personalizing privacy and security mitigations to specific user traits.
Conference Paper
Full-text available
In this paper, we present AMICO, a novel system for measuring and detecting malware downloads in live web traffic. AMICO learns to distinguish between malware and benign file downloads from the download behavior of the network users themselves. Given a labeled dataset of past benign and malware file downloads, AMICO learns a provenance classifier that can accurately detect future malware downloads based on information about where the downloads originated from. The main intuition is that to avoid current countermeasures, malware campaigns need to use an “agile” distribution infrastructure, e.g., frequently changing the domains and/or IPs of the malware download servers. We engineer a number of statistical features that aim to capture these fundamental characteristics of malware distribution campaigns. We have deployed AMICO at the edge of a large academic network for almost nine months, where we continuously witness hundreds of new malware downloads per week, including many zero-days. We show that AMICO is able to accurately detect malware downloads with up to 90% true positives at a false positives rate of 0.1% and can detect zero-day malware downloads, thus providing an effective way to complement current malware detection tools.
Conference Paper
Full-text available
Recent years have seen extensive growth of services enabling free broadcasts of live streams on the Web. Free live streaming (FLIS) services attract millions of viewers and make heavy use of deceptive advertisements. Despite the immense popularity of these services, little is known about the parties that facilitate it and maintain webpages to index links for free viewership. This paper presents a comprehensive analysis of the FLIS ecosystem by mapping all parties involved in the anonymous broadcast of live streams, discovering their modus operandi, and quantifying the consequences for common Internet users who utilize these services. We develop an infrastructure that enables us to perform more than 850,000 visits by identifying 5,685 free live streaming domains, and analyze more than 1 Terabyte of traffic to map the parties that constitute the FLIS ecosystem. On the one hand, our analysis reveals that users of FLIS websites are generally exposed to deceptive advertisements, malware, malicious browser extensions, and fraudulent scams. On the other hand, we find that FLIS parties are often reported for copyright violations and host their infrastructure predominantly in Europe and Belize. At the same time, we encounter substandard advertisement setups by the FLIS parties, along with potential trademark infringements through the abuse of domain names and logos of popular TV channels. Given the magnitude of the discovered abuse, we engineer features that characterize FLIS pages and build a classifier to identify FLIS pages with high accuracy and low false positives, in an effort to help human analysts identify malicious services and, whenever appropriate, initiate content-takedown requests.
Conference Paper
Full-text available
Despite the plethora of security advice and online education materials offered to end-users, there exists no standard measurement tool for end-user security behaviors. We present the creation of such a tool. We surveyed the most common computer security advice that experts offer to end-users in order to construct a set of Likert scale questions to probe the extent to which respondents claim to follow this advice. Using these questions, we iteratively surveyed a pool of 3,619 computer users to refine our question set such that each question was applicable to a large percentage of the population, exhibited adequate variance between respondents, and had high reliability (i.e., desirable psychometric properties). After performing both exploratory and confirmatory factor analysis, we identified a 16-item scale consisting of four sub-scales that measures attitudes towards choosing passwords, device securement, staying up-to-date, and proactive awareness.
Article
Full-text available
We present an epidemiological study of malware encounters in a large, multi-national enterprise. Our data sets allow us to observe or infer not only malware presence on enterprise computers, but also malware entry points, network locations of the computers (i.e., inside the enterprise network or outside) when the malware were encountered, and for some web-based malware encounters, web activities that gave rise to them. By coupling this data with demographic information for each host's primary user, such as his or her job title and level in the management hierarchy, we are able to paint a reasonably comprehensive picture of malware encounters for this enterprise. We use this analysis to build a logistic regression model for inferring the risk of hosts encountering malware; those ranked highly by our model have a > 3× higher rate of encountering malware than the base rate. We also discuss where our study confirms or refutes other studies and guidance that our results suggest.
Article
Full-text available
One question that arises when discussing the usefulness of web-based surveys is whether they gain the same response rates compared to other modes of collecting survey data. A common perception exists that, in general, web survey response rates are considerably lower. However, such unsystematic anecdotal evidence could be misleading and does not provide any useful quantitative estimate. Metaanalytic procedures synthesising controlled experimental mode comparisons could give accurate answers but, to the best of the authors' knowledge, such research syntheses have so far not been conducted. To overcome this gap, the authors have conducted a meta-analysis of 45 published and unpublished experimental comparisons between web and other survey modes. On average, web surveys yield an 11% lower response rate compared to other modes (the 95% confidence interval is confined by 15% and 6% to the disadvantage of the web mode). This response rate difference to the disadvantage of the web mode is systematically influenced by the sample recruitment base (a smaller difference for panel members as compared to one-time respondents), the solicitation mode chosen for web surveys (a greater difference for postal mail solicitation compared to email) and the number of contacts (the more contacts, the larger the difference in response rates between modes). No significant influence on response rate differences can be revealed for the type of mode web surveys are compared to, the type of target population, the type of sponsorship, whether or not incentives were offered, and the year the studies were conducted. Practical implications are discussed.
Conference Paper
Full-text available
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Conference Paper
Full-text available
Most modern malware download attacks occur via the browser, typically due to social engineering and drive-by downloads. In this paper, we study the " origin " of malware download attacks experienced by real network users, with the objective of improving malware down-load defenses. Specifically, we study the web paths followed by users who eventually fall victim to different types of malware downloads. To this end, we propose a novel incident investigation system, named WebWitness. Our system targets two main goals: 1) automatically trace back and label the sequence of events (e.g., visited web pages) preceding malware downloads, to highlight how users reach attack pages on the web; and 2) leverage these automatically labeled in-the-wild malware down-load paths to better understand current attack trends, and to develop more effective defenses. We deployed WebWitness on a large academic network for a period of ten months, where we collected and categorized thousands of live malicious download paths. An analysis of this labeled data allowed us to design a new defense against drive-by downloads that rely on injecting malicious content into (hacked) legitimate web pages. For example, we show that by leveraging the incident investigation information output by WebWitness we can decrease the infection rate for this type of drive-by downloads by almost six times, on average, compared to existing URL blacklisting approaches.
Conference Paper
Full-text available
Understanding what types of users and usage are more conducive to malware infections is crucial if we want to establish adequate strategies for dealing and mitigating the effects of computer crime in its various forms. Real-usage data is therefore essential to make better evidence-based decisions that will improve users' security. To this end, we performed a 4-month field study with 50 subjects and collected real-usage data by monitoring possible infections and gathering data on user behavior. In this paper, we present a first attempt at predicting risk of malware victimization based on user behavior. Using neural networks we developed a predictive model that has an accuracy of up to 80% at predicting user's likelihood of being infected.
Article
Full-text available
While individual differences in decision-making have been examined within the social sciences for several decades, this research has only recently begun to be applied by computer scientists to examine privacy and security attitudes (and ultimately behaviors). Specifically, several researchers have shown how different online privacy decisions are correlated with the "Big Five" personality traits. However, in our own research, we show that the five factor model is actually a weak predictor of privacy preferences and behaviors, and that other well-studied individual differences in the psychology literature are much stronger predictors. We describe the results of several experiments that showed how decision-making style and risk-taking attitudes are strong predictors of privacy attitudes, as well as a new scale that we developed to measure security behavior intentions. Finally, we show that privacy and security attitudes are correlated, but orthogonal.
Conference Paper
Full-text available
Smartphone users are often unaware of the data collected by apps running on their devices. We report on a study that evaluates the benefits of giving users an app permission manager and sending them nudges intended to raise their awareness of the data collected by their apps. Our study provides both qualitative and quantitative evidence that these approaches are complementary and can each play a significant role in empowering users to more effectively control their privacy. For instance, even after a week with access to the permission manager, participants benefited from nudges showing them how often some of their sensitive data was be-ing accessed by apps, with 95% of participants reassessing their permissions, and 58% of them further restricting some of their permissions. We discuss how participants interacted both with the permission manager and the privacy nudges, analyze the effective-ness of both solutions, and derive some recommendations.
Article
Full-text available
In this paper, we present a systematic study for the de-tection of malicious applications (or apps) on popular An-droid Markets. To this end, we first propose a permission-based behavioral footprinting scheme to detect new sam-ples of known Android malware families. Then we apply a heuristics-based filtering scheme to identify certain inher-ent behaviors of unknown malicious families. We imple-mented both schemes in a system called DroidRanger. The experiments with 204, 040 apps collected from five different Android Markets in May-June 2011 reveal 211 malicious ones: 32 from the official Android Market (0.02% infec-tion rate) and 179 from alternative marketplaces (infection rates ranging from 0.20% to 0.47%). Among those mali-cious apps, our system also uncovered two zero-day mal-ware (in 40 apps): one from the official Android Market and the other from alternative marketplaces. The results show that current marketplaces are functional and rela-tively healthy. However, there is also a clear need for a rigorous policing process, especially for non-regulated al-ternative marketplaces.
Article
Full-text available
Users are typically the final target of web attacks: criminals are interested in stealing their money, their personal information, or in infecting their machines with malicious code. However, while many aspects of web attacks have been carefully studied by researchers and security companies, the reasons that make certain users more "at risk" than others are still unknown. Why do certain users never encounter malicious pages while others seem to end up on them on a daily basis? To answer this question, in this paper we present a comprehensive study on the effectiveness of risk prediction based only on the web browsing behavior of users. Our analysis is based on a telemetry dataset collected by a major AntiVirus vendor, comprising millions of URLs visited by more than 100,000 users during a period of three months. For each user, we extract detailed usage statistics, and distill this information in 74 unique features that model different aspects of the user's behavior. After the features are extracted, we perform a correlation analysis to see if any of them is correlated with the probability of visiting malicious web pages. Afterwards, we leverage machine learning techniques to provide a prediction model that can be used to estimate the risk class of a given user. The results of our experiments show that it is possible to predict with a reasonable accuracy (up to 87%) the users that are more likely to be the victims of web attacks, only by analyzing their browsing history.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Conference Paper
Computer security problems often occur when there are disconnects between users' understanding of their role in computer security and what is expected of them. To help users make good security decisions more easily, we need insights into the challenges they face in their daily computer usage. We built and deployed the Security Behavior Observatory (SBO) to collect data on user behavior and machine configurations from participants' home computers. Combining SBO data with user interviews, this paper presents a qualitative study comparing users' attitudes, behaviors, and understanding of computer security to the actual states of their computers. Qualitative inductive thematic analysis of the interviews produced “engagement” as the overarching theme, whereby participants with greater engagement in computer security and maintenance did not necessarily have more secure computer states. Thus, user engagement alone may not be predictive of computer security. We identify several other themes that inform future directions for better design and research into security interventions. Our findings emphasize the need for better understanding of how users' computers get infected, so that we can more effectively design user-centered mitigations.
Conference Paper
A great deal of research has focused on algorithms for learning features from un- labeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning al- gorithms and deep models. In this paper, however, we show that several very sim- ple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning al- gorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (stride) be- tween extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itselfso critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-the- art performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure it- self, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.0% accuracy respectively).
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Conference Paper
The 2013 National Security Agency revelations of pervasive monitoring have led to an "encryption rush" across the computer and Internet industry. To push back against massive surveillance and protect users' privacy, vendors, hosting and cloud providers have widely deployed encryption on their hardware, communication links, and applications. As a consequence, most web connections nowadays are encrypted. However, there is still a significant part of Internet traffic that is not encrypted. It has been argued that both costs and complexity associated with obtaining and deploying X.509 certificates are major barriers for widespread encryption, since these certificates are required to establish encrypted connections. To address these issues, the Electronic Frontier Foundation, Mozilla Foundation, the University of Michigan and a number of partners have set up Let's Encrypt (LE), a certificate authority that provides both free X.509 certificates and software that automates the deployment of these certificates. In this paper, we investigate if LE has been successful in democratizing encryption: we analyze certificate issuance in the first year of LE and show from various perspectives that LE adoption has an upward trend and it is in fact being successful in covering the lower-cost end of the hosting market.
Conference Paper
Computer security tools usually provide universal solutions without taking user characteristics (origin, income level, ...) into account. In this paper, we test the validity of using such universal security defenses, with a particular focus on culture. We apply the previously proposed Security Behavior Intentions Scale (SeBIS) to 3,500 participants from seven countries. We first translate the scale into seven languages while preserving its reliability and structure validity. We then build a regression model to study which factors affect participants' security behavior. We find that participants from different countries exhibit different behavior. For instance, participants from Asian countries, and especially Japan, tend to exhibit less secure behavior. Surprisingly to us, we also find that actual knowledge influences user behavior much less than user self-confidence in their computer security knowledge. Stated differently, what people think they know affects their security behavior more than what they do know.
Conference Paper
It is common for researchers to use self-report measures (e.g. surveys) to measure people's security behaviors. In the computer security community, we don't know what behaviors people understand well enough to self-report accurately, or how well those self-reports correlate with what people actually do. In a six week field study, we collected both behavior data and survey responses from 122 subjects. We found that a relatively small number of behaviors -- mostly related to tasks that require users to take a specific, regular action -- have non-zero correlations. Since security is almost never a user's primary task for everyday computer users, several important security behaviors that we directly measured were not self-reported accurately. These results suggest that security research based on self-report is only reliable for certain behaviors. Additionally, a number of important security behaviors are not sufficiently salient to users that they can self-report accurately.
Article
Despite growing speculation about the role of human behavior in cyber-security of machines, concrete datadriven analysis and evidence have been lacking. Using Symantec's WINE platform, we conduct a detailed study of 1.6 million machines over an 8-month period in order to learn the relationship between user behavior and cyber attacks against their personal computers. We classify users into 4 categories (gamers, professionals, software developers, and others, plus a fifth category comprising everyone) and identify a total of 7 features that act as proxies for human behavior. For each of the 35 possible combinations (5 categories times 7 features), we studied the relationship between each of these seven features and one dependent variable, namely the number of attempted malware attacks detected by Symantec on the machine. Our results show that there is a strong relationship between several features and the number of attempted malware attacks. Had these hosts not been protected by Symantec's anti-virus product or a similar product, they would likely have been infected. Surprisingly, our results show that software developers are more at risk of engaging in risky cyber-behavior than other categories.