Image forgery/manipulation is one of the most alarming topics and becomes a major concern about different social media platforms regarding one’s privacy and safety. Therefore, the detection of the manipulated images is of immense interest to the researchers in the recent years. Despite the availability of numerous image forgery detection (IFD) datasets, very few particularly address the actual challenge by collecting the manipulated images from real-world scenario, e.g., collection of images from social media. Consequently, the contextual knowledge behind using the manipulated images remains unachieved. In order to address these issues, we propose an indigenous social media image forgery detection database, naming SMIFD-1000. This dataset provides rich annotations from several aspects: (a) image level: image regions that helps to classify pixel-level information; (b) forgery type: provide rich information about manipulation and (c) target and motif of manipulations: provide contextual rich knowledge about manipulation, which is significantly important from the perspective of social science. Finally, we would examine and benchmark the effectiveness of several publicly available algorithms on this dataset to demonstrate its usefulness. Results show that the dataset is highly challenging and will serve as an important benchmark for the existing and future IFD algorithms. Keywords: Image Manipulation, Digital Forensics, Image Dataset.
Conducting digital forensic investigations in a big data distributed file system environment presents significant challenges to an investigator given the high volume of physical data storage space. Presented is an approach from which the Hadoop Distributed File System logical file space is mapped to the physical data location. This approach uses metadata collection and analysis to reconstruct events in a finite time series.
ReFS is a modern file system that is developed by Microsoft and its internal structures and behavior is not officially documented. Even so there exist some analysis efforts in deciphering its data structures, some of these findings have yet become deprecated and cannot be applied to current ReFS versions anymore. In this work, general concepts and internal structures found in ReFS are examined and documented. Based on the structures and the processes by which they are modified, approaches to recover (deleted) files from ReFS formatted file systems are shown. We also evaluated our implementation and the allocation strategy of ReFS with respect to accuracy, runtime and the ability to recover older file states.
Passwords have been and still remain the most common method of authentication in computer systems. These systems are therefore privileged targets of attackers, and the number of data breaches in the last few years attests to that. A detailed analysis of such data can provide insight on password trends and patterns users follow when they create a password. To this end, this paper presents the largest and most comprehensive analysis of real-world passwords to date – associated with over 3.9 billion accounts from Have I Been Pwned. This analysis includes statistics on use and most common patterns found in passwords and innovates with a breakdown of the constituent fragments that make each password. Furthermore, a classification of these fragments according to their semantic meaning, provides insight on the role of context in password selection. Finally, we provide an in-depth analysis on the guessability of these real-world passwords.
In contrast to the common habit of taking full bitwise copies of storage devices before analysis, selective imaging promises to alleviate the problems created by the increasing capacity of storage devices. Imaging is selective if only selected data objects from an image that were explicitly chosen are included in the copied data. While selective imaging has been defined for post-mortem data acquisition, performing this process live, i.e., by using the system that contains the evidence also to execute the imaging software, is less well defined and understood. We present the design and implementation of a new live Selective Imaging Tool for Windows, called SIT, which is based on the DFIR ORC framework and uses AFF4 as a container format.
Victims of child sexual abuse suffer from physical, psychological, and emotional trauma. The detection and deletion of illicit online child sexual abuse material (CSAM) helps in reducing and even stopping the continuous re-victimization of children. Furthermore, automatic detection may also support legal authorities to search for and review the masses of suspected CSAM. Due to tech-savvy offenders and technological advances, continuous efforts in keeping up with current developments are crucial and need to be considered in the implementation of detection algorithms.
The present research provides a comprehensive synthesis and an interpretation of the current research accomplishments and challenges in the CSAM detection domain, explicitly considering the dimensions of policy and legal framework, distribution channels, and detection applications and implementations. Among other aspects, it reveals and aggregates knowledge related to image hash database, keywords, web-crawler, detection based on filenames and metadata, and visual detection. The findings suggest that CSAM detection applications yield the best results if multiple approaches are used in combination, such as deep-learning algorithms with multi-modal image or video descriptors merged together. Deep-learning techniques were shown to outperform other detection methods for unknown CSAM.
This paper presents a new method for Forensic Speaker Recognition (FSR). The new method is based on extracting accent and language information from short utterances. Accent Classification (AC) and Language Identification (LI) play important role in the identification of people of different groups, communities and origins due to different speaking styles and native languages. In a multilingual society, the forensic experts use AC and LI to reduce search space for suspect recognition to regional and ethnic groups. In this paper, we use different baseline and deep learning methods to automate this process. The baseline methods used are Gaussian Mixture Model-Universal Background Model (GMM-UBM), i-vector and Gaussian Mixture Model-Support Vector Machine (GMM-SVM). The Mel-Frequency Cepstral Coefficients (MFCC) are used as speech features in the baseline methods. The deep learning methods used are Convolutional Neural Network (CNN) and Deep Neural Network (DNN). The recently proposed CNN based methods like VGGVox and GMM-CNN are used. VGGVox and GMM-CNN use speech spectrograms. In case of DNN, x-vectors method is used, which is based on DNN embedding. The experimental results show that GMM-SVM demonstrates better FSR performance compared to GMM-UBM and i-vector methods. Whereas, x-vectors method performs better than GMM-CNN and VGGVox methods. It also performs better than GMM-SVM method. The experimental results show that x-vectors method demonstrates 80.4% FSR accuracy. With AC, it achieves 85.4% accuracy. With LI, its accuracy is 90.2%. Whereas by combining AC and LI it obtains 95.1% accuracy. This shows that the proposed method based on AC and LI gives promising results.
Malware analysis is a forensic process. After infection and the damage represented itself with the full scale, then the analysis of the attack, the structure of the executable and the aim of the malware can be discovered. These discoveries are converted into analysis reports and malware signatures and shared among antivirus databases and threat intelligence exchange platforms. This highly valuable information is then utilized in the detection mechanisms to prevent further dissemination and infections of malware. The types of analysis of the malware sample in this process can be grouped into two categories: static analysis and dynamic analysis. In static analysis, the executable file is reverted to the source code through disassemblers and reverse engineering software and analyzed whereas dynamic analysis includes running the sample in an isolated environment and analyzing its behavior. Both static and dynamic analysis have limitations such as packing, obfuscation, dead code insertion, sandbox detection, and anti-debugging techniques. Memory operations, on the other hand, are not possible to hide by these limitations and inevitable for any software since the inventions of the computational models. Therefore, in this research, memory operations and access patterns for the malicious acts are examined, and a contribution of a novel approach for extracting of memory access images is presented. In addition to extraction, methods of how these images can be used for detection and comparison is introduced through an image comparison technique.
Ransomware, the malicious software that encrypts user files to demand a ransom payment, is one of the most common and persistent threats. Cyber-criminals create new ransomware variants to evade protections shortly after anti-virus software vendors updated their signature (e.g., static feature obtained from binaries) database. Therefore, many ransomware detection systems today begin to employ behavioral features, or dynamic features, in addition to static features. However, even though ransomware detection using dynamic features can deal with ransomware variants, it has the following limitations: (1) it requires the ransomware to be executed, (2) ransomware may behave differently in a real environment that differs from the controlled environment, and (3) a ransomware sample can become deactivated when command and control (C&C) servers are taken down; hence, they make it impossible to compare multiple detection systems proposed by researchers under identical conditions.
To address the limitations, we present ransap, our new open dataset of ransomware storage access patterns. The dataset is currently available in a public repository. To our best knowledge, the dataset is one of the few open datasets consisting of dynamic features of ransomware.
Our new open dataset includes storage access patterns of 7 significant ransomware samples and 5 popular benign software samples on various types and conditions of storage devices. Moreover, the dataset provides access patterns of ransomware variants, those on a different version of an operating system, and those on storage devices with a full drive encryption function enabled. We first present a hypervisor-based monitoring system of storage access patterns followed by a design and an implementation of a feature extractor and machine learning models for ransomware detection. Next, a detailed analysis and evaluation of our dataset are presented. Finally, limitations of our new dataset, comparison with other dynamic analysis methods, state-of-the-art ransomware detection, and future research direction are presented.
It is increasingly becoming difficult to acquire meaningful information in the field of digital forensics through the traditional approach owing to advances in information security and anti-forensics techniques. To counteract anonymous services such as data in remote areas without authentication information, data encryption, device locks, and cryptocurrencies, it is important to acquire key information through live forensics at search and seizure sites. Thus, it is necessary to establish a response system that explores and processes credential information on site and extracts meaningful information based on the processed information. To this end, this study proposes a new digital forensics framework for application at search and seizure sites. The proposed framework is designed to allow expansion in the form of additional functions on account of a module type development for the system even when new services and digital devices appear in the future. We then explain its applicability through case studies of actual digital investigations.
Age is a soft biometric trait that can aid law enforcement in the identification of victims of Child Sexual Exploitation Material (CSEM) creation/distribution. Accurate age estimation of subjects can classify explicit content possession as illegal during an investigation. Automation of this age classification has the potential to expedite content discovery and focus the investigation of digital evidence through the prioritisation of evidence containing CSEM. In recent years, artificial intelligence based approaches for automated age estimation have been created, and many public cloud service providers offer this service on their platforms. The accuracy of these algorithms have been improving over recent years. These existing approaches perform satisfactorily for adult subjects, but perform wholly inadequately for underage subjects.
To this end, the largest underage facial age dataset, VisAGe, has been used in this work to train a ResNet50 based deep learning model, DeepUAge, that achieved state-of-the-art beating performance for age estimation of minors. This paper describes the design and implementation of this model. An evaluation, validation and comparison of the proposed model is performed against existing facial age classifiers resulting in the best overall performance for underage subjects.
Estimating the acquisition time of digital photographs is a challenging task in temporal image forensics, but the application is highly demanded for establishing temporal order among individual pieces of evidence and deduce the causal relationship of events in a court case. The forensic investigator needs to identify the timeline of events and look for some patterns to gain a clear overview of activities associated with a crime. This paper aims to explore the presence of defective pixels over time for estimating the acquisition date of digital pictures. We propose a technique to predict the acquisition timeslots of digital pictures using a set of candidate defective pixels in non-overlapping image blocks. First, potential candidate defective pixels are determined through related pixel neighbourhood and two proposed features, called the local variation features to best fit in a machine learning model. The machine learning approach is used to model the temporal behaviour of camera sensor defects in each block using the scores obtained from individually trained pixel defect locations and fused in a majority voting method. Interestingly, timeslot estimation using individual blocks has been shown to be more accurate when virtual sub-classes corresponding to halved timeslots are first considered prior to the reconstruction step. Finally, the last stage of the system consists of the combination of block scores in a second majority voting operation to further enhance performance. Assessed on the NTIF image dataset, the proposed system has been shown to reach very promising results with an estimated accuracy between 88% and 93% and clear superiority over a related state-of-the-art system.
During digital forensics investigations, smartphone application data are an important target, because they store personal user data, such as memos, images, and videos. Some applications use data hiding or encryption to protect application data, including personal user information. While these methods are excellent for data protection, they act as anti-forensics in digital forensic investigations. The LG smartphone provides Content Lock as a system application to protect the privacy of the user's memo and multimedia files. Content locked by Content Lock can only be accessed by entering the password specified by the user. In this paper, we identified the password verification process of Content Lock using reverse engineering, and recovery of the password input by the user. The original data in the locked file were acquired by analyzing two applications, QuickMemo+ and Gallery, that use Content Lock. No special data were required to obtain the original data. Our research enabled us to obtain original data hidden or encrypted by system apps on LG smartphones. Our research suggests that it is possible to obtain original data hidden or encrypted by system apps on LG smartphones.
In industrial control systems (ICS), programmable logic controllers (PLC) are the embedded devices that directly control and monitor critical industrial infrastructure processes such as nuclear plants and power grid stations. Cyberattacks often target PLCs to sabotage a physical process. A memory forensic analysis of a suspect PLC can answer questions about an attack, including compromised firmware and manipulation of PLC control logic code and I/O devices. Given physical access to a PLC, collecting forensic information from the PLC memory at the hardware-level is risky and challenging. It may cause the PLC to crash or hang since PLCs have proprietary, legacy hardware with heterogeneous architecture. This paper addresses this research problem and proposes a novel JTAG (Joint Test Action Group)-based framework, Kyros, for reliable PLC memory acquisition. Kyros systematically creates a JTAG profile of a PLC through hardware assessment, JTAG pins identification, memory map creation, and optimizing acquisition parameters. It also facilitates the community of interest (such as ICS owners, operators, and vendors) to develop the JTAG profiles of PLCs. Further, we present a case study of Kyros implementation over Allen-Bradley 1756-A10/B to help understand the framework's application on a real-world PLC used in industry settings. The sample PLC memory dumps are shared with the research community to facilitate further research.
Smartphones, which offer various features such as SMS/MMS, scheduling, messaging, and SNS, have become an integral part of modern life. Smartphones manage information intimately related to users in a self-contained manner, allowing them to provide such convenience efficiently. Such data, which can be used as key digital forensic evidence, are prime targets for investigators. However, exacting relevant data from smartphones with complicated structures requires considerable expertise. The analysis of smartphone backups is one approach to solving this problem. Smartphone manufacturers provide users with programs that include a backup protocol for backing up smartphone data. These programs allow investigators to easily extract smartphone data. Efficient smartphone data extraction is possible by integrating backup programs using different backup protocols into one framework. To achieve this integration, it is necessary to analyze each smartphone manufacturer's backup protocol. In this paper, we describe the results of analyzing the Huawei smartphone backup program HiSuite. HiSuite uses its backup protocol to produce backups of smartphones. We uncovered the entire process of the backup protocol through reverse engineering. We also experimentally verified that it is possible to obtain backup data from Huawei smartphones using the tool we developed to replace HiSuite based on our analysis. We believe this paper will help digital forensics investigators develop a better approaches to collecting data from smartphones.
An everyday growing number of malware variants target end-users and organizations. To reduce the amount of individual malware handling, security analysts apply techniques for finding similarities to cluster samples. A popular clustering method relies on similarity hashing functions, which create short representations of files and compare them to produce a score related to the similarity level between them. Despite the popularity of those functions, the limits of their application to malware samples have not been extensively studied so-far. To help in bridging this gap, we performed a set of experiments to characterize the application of these functions on long-term, realistic malware analysis scenarios. To do so, we introduce SHAVE, an ideal model of similarity hashing-based antivirus engine. The evaluation of SHAVE consisted of applying two distinct hash functions (ssdeep and sdhash) to a dataset of 21 thousand actual malware samples collected over four years. We characterized this dataset based on the performed clustering, and discovered that: (i) smaller groups are prevalent than large ones; (ii) the threshold value chosen may significantly change the conclusions about the prevalence of similar samples in a given dataset; (iii) establishing a ground-truth for similarity hashing functions comparison has its issues, since the clusters originated from traditional AV labeling routines may result from a completely distinct approach; (iv) the application of similarity hashing functions improves traditional AVs’ detection rates by up to 40%; and finally (v) taking specific binary regions into account (e.g., instructions), leads to better classification results than hashing the entire binary file.
Use of IP addresses by courts in their decisions is one of the issues with growing importance. This applies especially at the time of the increased use of the internet as a mean to violate legal provisions of both civil and criminal law. This paper focuses predominantly on two issues: (1) the use of IP addresses as digital evidence in criminal and civil proceedings and possible mistakes in courts' approach to this specific evidence, and (2) the anonymisation of IP addresses in cases when IP addresses are to be considered as personal data. This paper analyses the relevant judicial decisions of the Slovak Republic spanning the time period from 2008 to 2019, in which the relevant courts used the IP address as evidence. On this basis, the authors formulate their conclusions on the current state and developing trends in the use of digital evidence in judicial proceedings. The authors demonstrate the common errors that occur in the courts’ decisions as regards the use of IP addresses as evidence in the cases of the IP addresses anonymisation, usage of the in dubio pro reo principle in criminal proceedings, and the relationship between IP addresses and devices and persons.
The data found on mobile phones, SIM cards, micro-SD cards, or Internet of Things devices are often decisive for judicial investigations because they provide a wealth of information to guide investigations; not to say solve them. However, investigators have to deal with two problems that greatly complicate the extraction of data from digital equipment: encryption of data and damage to the evidence (explosion, immersion, deliberate destruction, air crash, accidents). In these cases, investigators often have to be creative in order to successfully extract data from electronic devices in a judicial setting. Using medical equipment for data extraction is a new way that perfectly illustrates this creativity, which is necessary otherwise the investigator will be blocked by the new technology of protection and encryption.
In this paper we will make use of four medical materials and equipment used routinely in the forensic autopsy field: the mobile 2D X-ray radiograph (used by dentists), the whole body 3D X-ray scanner, the dental control unit (burr and drill of dentists for legal odontolgy), and dental paste to model the teeth when identifying disaster victims. This work introduces medical materials and equipment that can be used by investigators for data extraction and introduces cheap alternatives for existing expensive solutions from the failure analysis industry. To demonstrate feasibility, we describe in detail experimental forensic cases in which medical devices could help data extraction: reverse-engineering, diagnostic samples, and preparation of mobile phones for forensic transplantation.
In the final part, we look at the legal medicine of the future. We believe that the autopsy of tomorrow will definitely have to be supplemented by analysis of the electronic components present in the body (pacemaker, bio-sensor). Medical examiners and experts in electronics must now work together to put in place the forensic procedures of tomorrow.
Publications in the digital forensics domain frequently come with tools – a small piece of functional software. These tools are often released to the public for others to reproduce results or use them for their own purposes. However, there has been no study on the tools to understand better what is available and what is missing. For this paper we analyzed almost 800 articles from pertinent venues from 2014 to 2019 to answer the following three questions (1) what tools (i.e., in which domains of digital forensics): have been released; (2) are they still available, maintained, and documented; and (3) are there possibilities to enhance the status quo? We found 62 different tools which we categorized according to digital forensics subfields. Only 33 of these tools were found to be publicly available, the majority of these were not maintained after development. In order to enhance the status quo, one recommendation is a centralized repository specifically for tested tools. This will require tool researchers (developers) to spend more time on code documentation and preferably develop plugins instead of stand-alone tools.
PDF malware remains as a major hacking technique. To distinguish malicious PDFs from massive PDF files poses a challenge to forensic investigation. Machine learning has become a mainstream technology for malicious PDF document detection either to help analysts in a forensic investigation or to prevent a system being attacked. However, adversarial attacks against malicious document classifiers have emerged. Crafted adversarial example based on precision manipulation may be easily misclassified. This poses a major threat to many detectors based on machine learning techniques. Various analysis or detection techniques have been available for specific attacks. The challenge from adversarial attacks is still not yet completely resolved. A major reason is that most of the detection methods are tailor-made for existing adversarial examples only. In this paper, based on an interesting observation that most of these adversarial examples were designed on specific models, we propose a novel approach to generate a group of mutated cross-model classifiers such that adversarial examples cannot pass all classifiers easily. Based on a Prediction Inversion Rate (PIR), we can effectively identify adversarial example from benign documents. Our mutated group of classifiers enhances the power of prediction inconsistency using multiple models and eliminate the effect of transferability (a technique to make the same adversarial example work for multiple models) because of the mutation. Our experiments show that we are better than all existing state-of-the-art detection methods.
The last decade witnessed an exponential growth of smartphones and their users, which has drawn massive attention from malware designers. The current malware detection engines are unable to cope with the volume, velocity, and variety of incoming malware. Thus the anti-malware community is investigating the use of machine learning and deep learning to develop malware detection models. However, research in other domains suggests that the machine learning/deep learning models are vulnerable to adversarial attacks. Therefore in this work, we proposed a framework to construct robust malware detection models against adversarial attacks. We first constructed twelve different malware detection models using a variety of classification algorithms. Then we acted as an adversary and proposed Gradient-based Adversarial Attack Network to perform adversarial attacks on the above detection models. The attack is designed to convert the maximum number of malware samples into adversarial samples with minimal modifications in each sample. The proposed attack achieves an average fooling rate of 98.68% against twelve permission-based malware detection models and 90.71% against twelve intent-based malware detection models. We also identified the list of vulnerable permissions/intents which an adversary can use to force misclassifications in detection models. Later we proposed three adversarial defense strategies to counter the attacks performed on detection models. The proposed Hybrid Distillation based defense strategy improved the average accuracy by 54.21% for twelve permission-based detection models and 59.14% for intent-based detection models. We also concluded that the adversarial-based study improves the performance and robustness of malware detection models and is essential before any real-world deployment.
This work forms the second part of a two part series providing the necessary scaffolding for the digital forensic discipline to conduct effective peer review in their laboratories and units. The first part articulated the need for a structured approach to peer review in digital forensic investigations (Horsman and Sunde, 2020). Here in part two, the Phase-oriented Advice and Review Structure (PARS) for digital forensic investigations is offered. PARS is the first documented peer review methodology for the digital forensics field, a six staged approach designed to formally support organisations and their staff in their goal of facilitating effective peer review of DF work, from investigative tasks to forensic activities and forensic analysis processes (Pollitt et al., 2018). This article discusses how the PARS methodology can be implemented, and the available options and mechanisms available to ease the interpretation of this model into existing practices. Both the early ‘Advisor’ and later ‘Reviewer’ roles in PARS are discussed and their requirements and expectations are defined. Three template documents are provided and explained: The PARS Advisors template, the PARS Advisor Brief template and the PARS Peer Review Hierarchy template, for direct use by organisations seeking to adopt the PARS methodology.
Recent developments of drone technologies have shown a surge of commercial sales of drone devices, which have found use in many industries. However, the technology has been misused to commit crimes such as drug trafficking, robberies, and terror attacks. The digital forensics industry must match the speed of development with forensic tools and techniques. However, it has been identified that there is a lack of an agreed framework for the extraction and analysis of drone devices and a lack of support in commercial digital forensics tools available. In this research, an investigation into the extraction tools available for drone devices and analysis techniques has been performed to identify best practices for handling drone devices in a forensically sound manner. A new framework to perform a full forensic analysis of small to medium sized commercial drone devices and their controllers has been proposed to give investigators a plan of action to perform forensic analysis on these devices. The proposed framework overcomes some limitations of other drone forensics investigation frameworks presented in the literature.
Recent evidence shows digital forensics experts are at risk of burnout and job-related stress. This may be related to the increase in digital evidence and/or repetitive exposure to challenging material, either face to face or via digital imagery in real time or post-event. This exposure includes footage and/or sound recording of extreme violence, child exploitation, suicide, and death scenes. This increase in the risk of stress also aligns with the changing nature of policing with rates of serious crime, especially robbery and homicide decreasing, while digital crime in many countries increases. This increase changes workload demands and requires new skillsets in addition to traditional investigation methods. Workplace stress has high financial and personal costs, impacting organisations, teams, family, friends, and the individual. For organisations and teams, occupational stress is associated with increases in workplace accidents, absenteeism, early retirement, higher intention to quit, lower motivation and disillusionment with work, all of which impacts the cohesion of forensic teams. The aim of this paper is to present a set of key evidence-based, targeted strategies that forensic science and policing agencies can roll-out in order to manage workplace stress, thereby managing the risk of higher turnover, absenteeism and lower workplace innovation.
Digital forensic investigators’ aim is identifying, collecting and presenting reliable, accurate, and admissible evidence in court. However, anti-forensics manipulate, obfuscate, hide, and remove the remaining piece of evidence in a compromised system. Anti-forensics interrupt investigation procedures; thus, the investigators require specific defensive strategies (counter-anti-forensics) against anti-forensics. This paper mounts a survey to explore existing anti-forensic research, and constitute a taxonomy on behaviour of anti-forensics and another taxonomy on further research tasks of anti-forensics.
The knowledge of interactions between forensic agents' (an investigator and an attacker) in a forensic environment helps the investigator to evaluate the existing counter-anti-forensics, and enables him/her to design and develop more advanced counter-anti-forensics. Therefore, in this paper, first, we formulate a set of characteristics to model interactions between the attacker and the investigator (players) in a realistic forensic environment. Next, we propose a game-theoretic approach to model the players' interactions. The attacker uses anti-forensics (i.e. rootkits) and the investigator employs counter-anti-forensics (i.e. anti-rootkits). We select and evaluate a set of game-theoretic models and algorithms to simulate the players' interactions. Results of the evaluation show that a gradient play algorithm has satisfactory performance, among the selected game-theoretic models and algorithms to simulate the interactions in the forensic environment. The gradient play algorithm identifies the investigator's most stable and desired strategies after spending 10.0E-4 s and consuming 5.8 KB.
Digital forensics incident response (DFIR) specialists are expected to possess multidisciplinary skills including expert knowledge of computer-related principles and technology. On the other hand, recent studies suggest that existing training and study programs may not fully address the needs of future DFIR professionals. To reveal possible gaps in practitioners education and identify the most needed skills, we built a skillmap for DFIR where we followed a threefold approach: (1) an online survey among DFIR experts; (2) a review of training programs; and (3) an analysis of job listings on LinkedIn. Each source was first analyzed on its own and the findings were merged into a DFIR skillmap which is the main contribution of this article. The results show that network forensics and incident handling are the most demanded domains of skills. While these are covered by existing courses the newly desired skills, in particular, cloud forensics and encrypted data, need to get more space in training and education. We hope that this article provides educators with information on ways to improve in the years ahead.
AI Speakers are typical cloud-based internet of things (IoT) devices that store a variety of information regarding users on the cloud. Although analyzing encrypted traffic between these devices and the cloud, as well as the artifacts stored there, is an important research topic from the perspective of cloud-based IoT forensics, studies on directly analyzing encrypted traffic between AI Speakers and the cloud remain insufficient. In this study, we propose a forensic model that can collect and analyze encrypted traffic between an AI Speaker and the cloud based on a certificate injection. The proposed model consists of porting AI Speaker image on Android device, porting AI Speaker image using QEMU (Quick EMUlator), running exploit using the AI Speaker app vulnerability, rewriting Flash memory using H/W interface, and reworking and updating Flash memory. These five forensic methods are used to inject the certificate into AI Speakers. The proposed model shows that we can analyze encrypted traffic against various AI Speakers such as an Amazon Echo Dot, Naver Clova, SKT NUGU Candle, SKT NUGU, and KT GiGA Genie, and obtain artifacts stored on the cloud. In addition, we develop a verification tool that collects artifacts stored on KT GiGA Genie cloud.
As the Internet of Things (IoT) era arrives, many Internet-connected devices are being released, and their use is increasing. One of these, the AI speaker, is designed to augment user convenience by using voice recognition. The best-known products are the Amazon Echo family, including Echo and Echo Dot and more recently, Echo Show with display. An AI speaker with display provides diverse functions such as surfing the Internet, taking pictures, making voice or video calls, and controlling smart home devices. To do this, Alexa cloud servers store a variety of configuration values and historical logs, and users can manage their own cloud-native data through interfaces (e.g., Web sites or mobile apps). For this reason, AI speakers with smart display are similar to PCs or smartphones, which can be very profitable from a digital forensic perspective. This paper focuses on detailed research on the second generation of Echo Show. The first step was to collect forensic artifacts stored inside the product by teardown, identifying eMMC flash memory chips and performing chip-off on Echo Show. Alexa app-related artifacts used on smartphones and how to automatically acquire data from the Alexa cloud were also investigated. From three sources including Echo Show, a companion client (smartphone), and the Alexa cloud, it was possible to acquire user credentials, traces of photos, records of watching videos, log files, and Internet histories with timestamp. The second step was to identify the possibility of inferring new information by correlating artifacts collected from different sources. Integrative analysis enables investigators to track suspect activity across digital devices. Third, this paper introduces an updated version of Cloud-based IoT Forensic Toolkit (CIFT) to support digital investigation of Echo Show. Based on the technical findings, this study proposes a digital forensic framework for a smart speaker with a display that can play an important role as a digital witness at a crime scene. Until now, there has been no multilevel approach to acquisition and analysis of Echo Show data in the field of digital forensics. Therefore, this study makes a contribution to the digital forensic community.
The present level of skepticism expressed by courts, legal practitioners, and the general public over Artificial Intelligence (AI) based digital evidence extraction techniques has been observed, and understandably so. Concerns have been raised about closed-box AI models’ transparency and their suitability for use in digital evidence mining. While AI models are firmly rooted in mathematical, statistical, and computational theories, the argument has centered on their explainability and understandability, particularly in terms of how they arrive at certain conclusions. This paper examines the issues with closed-box models; the goals; and methods of explainability/interpretability. Most importantly, recommendations for interpretable AI-based digital forensics (DF) investigation are proposed.
This study presents designing of an Android-based software tool to use in AK rifle related shooting investigations. The designed softwear tool; “Bullet Trajectory Plotter” could estimate the potential trajectories of perforated AK bullets in 1 mm sheet metal surfaces. The tool was developed as an Android application to use with a mobile phone or tablet computer. It is based on the results of the authors’ two previous studies, which identified the co-relation between the angles of incidence of standard steel core AK bullets (7.62 × 39 mm) and the length of the bullet hole in 1 mm sheet metal gauges. The software tool proved to be a viable, quickly employable, user-friendly, and fully mobile field investigation tool that can be installed on investigators’ mobile phones to identify the approximate angles of incidence of perforated steel core AK bullets on 1 mm sheet metal surfaces. Additionally, the software tool could be used as a new method of reconfirming findings from the existing trajectory estimation methods.
Fuzzy hashing or similarity hashing (a.k.a. bytewise approximate matching) converts digital artifacts into an intermediate representation to allow an efficient (fast) identification of similar objects, e.g., for blacklisting. They gained a lot of popularity over the past decade with new algorithms being developed and released to the digital forensics community. When releasing algorithms (e.g., as part of a scientific article), they are frequently compared with other algorithms to outline the benefits and sometimes also the weaknesses of the proposed approach. However, given the wide variety of algorithms and approaches, it is impossible to provide direct comparisons with all existing algorithms. In this paper, we present the first classification of approximate matching algorithms which allows an easier description and comparisons. Therefore, we first reviewed existing literature to understand the techniques various algorithms use and to familiarize ourselves with the common terminology. Our findings allowed us to develop a categorization relying heavily on the terminology proposed by NIST SP 800-168. In addition to the categorization, this article presents an abstract set of attacks against algorithms and why they are feasible. Lastly, we detail the characteristics needed to build robust algorithms to prevent attacks. We believe that this article helps newcomers, practitioners, and experts alike to better compare algorithms, understand their potential, as well as characteristics and implications they may have on forensic investigations.
The amount of data to be handled in digital forensic investigations is continuously increasing, while the tools and processes used are not developed accordingly. This especially affects the digital forensic sub-field of file carving. The use of the structuring of stored data induced by the allocation algorithm to increase the efficiency of the forensic process has been independently suggested by Casey and us. Building on that idea we have set up an experiment to study the allocation algorithm of NTFS and its behavior over time from different points of view. This includes if the allocation algorithm behaves the same regardless of Windows version or size of the hard drive, its adherence to the best fit allocation strategy and the distribution of the allocation activity over the available (logical) storage space. Our results show that space is not a factor, but there are differences in the allocation behavior between Windows 7 and Windows 10. The results also show that the allocation strategy favors filling in holes in the already written area instead of claiming the unused space at the end of a partition and that the area with the highest allocation activity is slowly progressing from approximately 10 GiB into a partition towards the end as the disk is filling up.
Mainstream social platforms boast billions of users worldwide. In recent years, popular social platforms have seen a decline in their users that are choosing to migrate to alternative-tech social applications reinforced by frustrations of mainstream social platforms over alleged censorship of free speech and banning of predominant public figures such as the former president of the United States (U.S.). As such, group effect of similar minded users on alternative-tech social platforms may lead to fostering events such as the U.S. Capitol attack on January 6th, 2021, where the spreading of false information and extremist ideologies through alt-tech applications such as Parler and MeWe took place. These cases demonstrate the immense forensic need to understand how alternative-tech social applications operate and what they store about their users' personal information and activities. We present the primary account for the digital forensic study of (n = 9) alternative-tech social applications used on Android and iOS devices. Our analysis includes Parler, MeWe, CloutHub, Wimkin, Minds (Minds Mobile and Minds Chat), SafeChat, 2nd1st, and GETTR. Results revealed that some applications do store unencrypted user information on the devices, such as usernames, phone numbers, email addresses, posts and comments, and private chat messages. Furthermore, some security vulnerabilities were discovered that allow users to download data that should have been private (such as sent private images) without authentication and authorization by other users. Finally, to aid in the analysis and automatic extraction of relevant evidence, we share Alternative Social Networking Applications Analysis Tool (ASNAAT), that automatically aggregates forensically relevant data from the alt-tech social networking applications when presented with a mobile device's forensic image.
We investigate the problem of creating ambiguous file system partitions, i.e., the possibility to have two fully functional file systems within a single file system partition. The problem is different from steganographic data hiding since there is no real distinction between content and cover data, and no translation process may be applied to the content data. Since typical file systems that occur in forensic analysis are usually unambiguous, ambiguous file system partitions may be useful corner cases in forensic tools and processes. We show that it is possible to create ambiguous file system partitions by integrating a guest file system into the structures of a host file system in two cases: We integrate a fully functional FAT32 into Ext3 and HFS+. In a third example we even integrate two guest file systems (HFS+ and FAT32) into a single Btrfs file system partition. We test common forensic tools on these examples and exhibit some deficiencies. Moreover, we develop a taxonomy of ambiguous file system partitions and argue that the existence of essential data at fixed positions still is a way to distinguish host from guest and so to heuristically reduce the ambiguity, without removing it completely.
The adaptive multi-rate (AMR) audio codec is an established standard of speech signal compression, for either transmitting speech signal over mobile networks or storing digital audio on handheld devices. The widespread use of AMR codec and the high availability of tampering software have increased authentication cases in court. AMR double compression detection is a challenging engineering problem and a topic of multimedia forensics. As a general rule, a double compressed AMR file cannot be considered an original file. In this paper, a new method based on support vector machine (SVM) is proposed to classify single and double compressed AMR digital audio. Instead of using the decoded speech waveform, the proposed method uses only compressed-domain speech features. Specific parameters are extracted from encoded AMR files and used to create a set of statistical features. After applying robust scaling to features and selecting the SVM model, recursive feature elimination with correlation bias reduction (RFE-CBR) is used to determine the best number of features to maximize accuracy in SVM classification. The experiments reveal that the proposed algorithm can discriminate single and double compressed AMR speech, outperforming the published methods. The average accuracies using TIMIT database and CARIOCA1 database, the last recorded from landline phone calls, are about 99%. Other experiments, including frame offset attack and noise addition, found that the method is robust and reliable.
The ability to discover cyberbullying “hotspots” on social media is vitally important for purposes of preventing victimization. This study attempts to develop a prediction model for identifying cyberbullying “hotspots” by analyzing the manifestation of charged language on Twitter. A total of 140,000 tweets were collected using a Twitter API during September 2019. The study reports that certain charged language in tweets can indicate a high potential for cyberbullying incidents. Cyberbullies tend to share negative emotion, demonstrate anger, and use abusive words to attack victims. The predictor variables related to “biology,” “sexual,” and “swear” can be further used to differentiate cyberbullies from non-cyberbullies. The study contributes to the detection of cyberbullying “hotspots,” by providing an approach to identify a tendency for cyberbullying activity based on computational analysis of charged language. The contribution is significant for mediation agencies—such as school counseling and law enforcement agencies.
With the advancement of digital crimes, the field of digital forensic science grows more and more, and with this growth, the search for faster and more accurate solutions to aid the investigation process becomes a necessity. In the context of the Brazilian judicial system, during a criminal investigation, forensic specialists extract, decode, and analyze the evidence collected to allow the prosecutor to make legal demands for a prosecution. These specialists have a very short time to analyze to find criminal evidence and the process can take a long time. To solve this problem this paper proposes to use a micro-services-based application with artificial intelligence to process large amounts of images contained in criminal evidence using open-source software. The image classification module contains some pre-trained classifiers, considering the needs of forensic analysts of the Rio Grande do Norte District Attorney's Office (MPRN). The models were built to identify specific types of objects, for example, firearms, ammunition, Brazilian identity cards, text documents, cell phone screen captures, and nudity. The results obtained show that the system achieved good accuracy in most cases. This is extremely important in the context of this research, where false positives should be avoided in order to save analysts' working time. Moreover, the proposed architecture was able to speed up the image classification process using Apache Spark.
Since the advent of various IoT devices, the need for digital forensics for mobile devices that people use most closely in their daily lives has continued to grow. Besides, as Bring Your Own Device (BYOD) becomes the trend, devices store business-related information as well as privacy. Thus, mobile devices are becoming the most critical evidence of digital forensics. For practical mobile forensics, it is necessary to identify crime-related items among the many files inside the device accurately. Also, various user information for user behavior analysis from these files should be effectively extracted and managed as potential evidence to ensure integrity. This paper proposes an efficient forensics investigation method for mobile devices with Android OS, which holds the highest share in the world among mobile devices. In this paper, we studied data pre-processing (classification and identification of data), data analysis, evidence management, and Android data Taxonomy.
Backups on smartphones protect user data from the risk of data corruption and loss by storing personal information, media data, application data, and other settings. Although backups were originally designed to maintain and protect user data, these data can be important in criminal investigations requiring the verification of suspect behavior-related information at the time of an incident. However, backup data are often encrypted by each manufacturer using different scheme to protect user privacy. Since the encryption acts as a disturbance to the use of backup data in investigations, it is necessary to decrypt backup data by analyzing the encryption schemes of each manufacturer.
In this paper, we propose a widely applicable methodology that efficiently analyzes various encryption backup schemes. Our methodology checks the backup features, identifies the backup data, and their encrypting locations reverses encryption schemes used in the backup and finally decrypts encrypted backup data. As a case study, we apply our methodology to the latest Samsung smartphone backup system consisting of a Samsung SmartSwitch Mobile and a Samsung SmartSwitch PC. We acquired the backup data including the encrypted data generated by the Samsung smartphone backup in plain form, and revealed a technique to recover the Personal Identification Number (PIN) used for encryption through the authenticator included in the backup data. We also identified, through reverse engineering, a hidden feature that could be used to extract more data than was possible using the normal backup. Finally, we developed a decryption tool to verify that the encrypted backup data were correctly decrypted. Although, in this paper, we focused on the Samsung smartphone backup, our methodology could be applied to any smartphone backup system on Android platform. We believe that our work will be very helpful to mobile investigators.
With advancements in technology, people are taking advantage of mobile devices to access e-mails, search the web, and video chat. Therefore, extracting evidence from mobile phones is an important component of the investigation process. As Android app developers could leverage existing native libraries to implement a part of the program, evidentiary data are generated and stored by these native libraries. However, current state-of-art Android static analysis tools, such as FlowDroid (Arzt et al., 2014), Evihunter (Cheng et al., 2018), DroidSafe (Gordon et al., 2015) and CHEX (Lu et al., 2012) adopt the conservative approach for data-flow analysis on native method invocation. None of those tools have the capability to capture the data-flow within native libraries.
In this work, we propose a new approach to conduct native data-flow analysis for security vetting of Android native libraries and build an analysis framework, called LibDroid to compute data-flow and summarize taint propagation for Android native libraries. The common question app users and developers often face is whether certain native libraries contain hidden functions or utilize user private information. LibDroid aims to answer this question. Therefore, we build a precise and efficient data-flow analysis with the support of SummarizeNativeMethod algorithm, and pre-compute an Android Native Libraries Database (ANLD) for 13,138 native libraries collected from 2,627 real-world Android applications. The ANLD includes the taint propagation summary of each native method and potential evidentiary data generated or stored within the native library. We evaluate LibDroid on 52 open-source native libraries and 2,627 real-world apps. Our results show that LibDroid can precisely summarize the information flow within the native libraries.
Ransomware attacks are not only limited to Personal Computers but are increasing rapidly to target smart-phones as well. The attackers target smart-phone devices to steal users’ personal information for monetary purposes. However, Android is the most widely used mobile operating system with the largest market share in the world that makes it a primary target for cyber-criminals to attack. The existing research towards the detection of Android ransomware lacks significant features and works with supervised machine learning techniques. But there are several restrictions in supervised machine learning techniques such as these techniques heavily rely on anti-virus vendors to provide explicit labels and the given sample can be wrongly classified if the training set does not include related examples and/or if the labels are incorrect. Moreover, it may not detect unknown ransomware samples in real-time situations due to the absence of historical targets in the real world. In this work, an attempt is made for an in-depth investigation of Android ransomware with reverse engineering and forensic analysis to extract static features. Furthermore, a novel RansomDroid framework on clustering based unsupervised machine learning techniques is proposed to address the issues such as mislabeling of historical targets and detecting unforeseen Android ransomware. To the best of our knowledge, performing unsupervised machine learning techniques for the detection of Android ransomware is still an open area of research that has not been explored by the researchers yet. The proposed RansomDroid framework employs a Gaussian Mixture Model that has a flexible and probabilistic approach to model the dataset. RansomDroid framework utilizes feature selection and dimensionality reduction to further improve the performance of the model. The experimental results show that the proposed RansomDroid framework detects Android ransomware with an accuracy of 98.08% in 44 ms.
Android OS popularity has given significant rise to malicious apps targeting it. Malware use state of the art obfuscation methods to hide their functionality and evade anti-malware engines. We present BLADE, a novel obfuscation resilient system based on Opcode Segments for detection. It makes three contributions: Firstly, a novel Opcode Segment Document results in feature characterization resilient to obfuscation techniques. Secondly, we perform semantics based simplification of dalvik opcodes to enhance the resilience. Thirdly, we evaluate effectiveness of BLADE against different obfuscation techniques such as trivial obfuscation, string encryption, class encryption, reflection and their combinations. Our approach is found effective, accurate and resilient, when tested against benchmark datasets for malware detection, familial classification, malware type detection, obfuscation type detection and obfuscation resilient familial classification.
Dataset available on: https://www.kaggle.com/vikassihag/blade-dataset
Mobile devices are increasingly involved in crimes. Therefore, digital evidence on mobile devices plays a more and more important role in crime investigations. Existing studies have designed tools to identify and/or extract digital evidence in the main memory or the file system of a mobile device. However, identifying and extracting digital evidence from the logging system of a mobile device is largely unexplored.
In this work, we aim to bridge this gap.Specifically, we design, prototype, and evaluate LogExtractor, the first tool to automatically identify and extract digital evidence from log messages on an Android device. Given a log message, LogExtractor first determines whether the log message contains a given type of evidentiary data (e.g., GPS coordinates) and then further extracts the value of the evidentiary data if the log message contains it.
Specifically, LogExtractor takes an offline-online approach. In the offline phase, LogExtractor builds an App Log Evidence Database (ALED) for a large number of apps via combining string and taint analysis to analyze the apps' code. Specifically, each record in the ALED contains 1) the string pattern of a log message that an app may write to the logging system, 2) the types of evidentiary data that the log message includes, and 3) the segment(s) of the string pattern that contains the value of a certain type of evidentiary data, where we represent a string pattern using a deterministic finite-state automaton. In the online phase, given a log message from a suspect's Android device, we match the log message against the string patterns in the ALED and extract evidentiary data from it if the matching succeeds. We evaluate LogExtractor on 65 benchmark apps from DroidBench and 12.1 K real-world apps. Our results show that a large number of apps write a diverse set of data to the logging system and LogExtractor can accurately extract them.
With increasing quantity and sophistication, malicious code is becoming difficult to discover and analyze. Modern NLP (Natural Language Processing) techniques have significantly improved, and are being used in practice to accomplish various tasks. Recently, many research works have applied NLP for finding malicious patterns in Android and Windows apps. In this paper, we exploit this fact and apply NLP techniques to an intermediate representation (MAIL – Malware analysis intermediate language) of Android apps to build a similarity index model, named SIMP. We use SIMP to find malicious patterns in Android apps. MAIL provides control flow patterns to enhance the malware analysis and makes the code accessible to NLP techniques for checking semantic similarities. For applying NLP, we consider a MAIL program as one document. The control flow patterns in this program when divided, into specific blocks (words), become sentences. We apply TFIDF and Bag-of-Words over these control flow patterns to build SIMP. Our proposed model, when tested with real malware and benign Android apps using different validation methods, achieved an MCC (Mathews Correlation Coefficient) ≥ 0.94 between the true and predicted values. That indicates, predicting a new sample either as malware or benign with a high success rate.