Conference PaperPDF Available

SecureBERT: A Domain-Specific Language Model for Cybersecurity

Authors:

Abstract

Natural Language Processing (NLP) has recently gained wide attention in cybersecurity, particularly in Cyber Threat Intelligence (CTI) and cyber automation. Increased connection and automation have revolutionized the world's economic and cultural infrastructures, while they have introduced risks in terms of cyber attacks. CTI is information that helps cybersecurity analysts make intelligent security decisions, which is often delivered in the form of natural language text, which must be transformed into a machine-readable format through an automated procedure before it can be used for automated security measures. This paper proposes SecureBERT, a cybersecurity language model capable of capturing text connotations in cybersecurity text (e.g., CTI) and therefore successful in automation for many critical cybersecurity tasks that would otherwise rely on human expertise and time-consuming manual efforts. SecureBERT has been trained using a large corpus of cybersecurity text. To make SecureBERT effective not just in retaining general English understanding, but also when applied to text with cybersecurity implications, we developed a customized tokenizer as well as a method to alter pre-trained weights. The SecureBERT is evaluated using the standard Masked Language Model (MLM) test as well as two additional standard NLP tasks. Our evaluation studies show that Secure-BERT outperforms existing similar models, confirming its capability for solving crucial NLP tasks in cybersecurity.
SecureBERT: A Domain-Specific Language Model
for Cybersecurity
Ehsan Aghaei1, Xi Niu1, Waseem Shadid1, and Ehab Al-Shaer2
1University of North Carolina at Charlotte, USA
{eaghaei, xniu2, waseem}@uncc.edu
2Carnegie Mellon University, USA
ehab@cmu.edu
Abstract.
Natural Language Processing (NLP) has recently gained
wide attention in cybersecurity, particularly in Cyber Threat Intelligence
(CTI) and cyber automation. Increased connection and automation have
revolutionized the world’s economic and cultural infrastructures, while
they have introduced risks in terms of cyber attacks. CTI is information
that helps cybersecurity analysts make intelligent security decisions, that
is often delivered in the form of natural language text, which must be
transformed to machine readable format through an automated procedure
before it can be used for automated security measures.
This paper proposes SecureBERT, a cybersecurity language model ca-
pable of capturing text connotations in cybersecurity text (e.g., CTI)
and therefore successful in automation for many critical cybersecurity
tasks that would otherwise rely on human expertise and time-consuming
manual efforts. SecureBERT has been trained using a large corpus of
cybersecurity text.To make SecureBERT effective not just in retaining
general English understanding, but also when applied to text with cy-
bersecurity implications, we developed a customized tokenizer as well
as a method to alter pre-trained weights. The SecureBERT is evaluated
using the standard Masked Language Model (MLM) test as well as two
additional standard NLP tasks. Our evaluation studies show that Secure-
BERT
3
outperforms existing similar models, confirming its capability for
solving crucial NLP tasks in cybersecurity.
Keywords: cyber automation, cyber threat intelligence, language model
1 Introduction
The adoption of security automation technologies has grown year after year. Cyber
security industry is saturated with solutions that protect users from malicious
sources, safeguard mission-critical servers, and protect personal information,
healthcare data, intellectual property, and sensitive financial data. Enterprises
invest in technology to handle such security solutions, typically aggregating a
large amount of data into a single system to facilitate in organizing and retrieving
3https://github.com/ehsanaghaei/SecureBERT
2 SecureBERT
key information in order to better identify where they face risk or where specific
traffic originates or terminates. Recently, as social networks and ubiquitous
computing have grown in popularity, the overall volume of digital text content
has increased. This textual contents span a range of domains, from a simple tweet
or news blog article to more sensitive information such as medical records or
financial transactions. In cybersecurity context, security analysts analyze relevant
data to detect cyber threat-related information, such as vulnerabilities, in order
to monitor, prevent, and control potential risks. For example, cybersecurity
agencies such as MITRE, NIST, CERT, and NVD invest millions of dollars
in human expertise to analyze, categorize, prioritize, publish, and fix disclosed
vulnerabilities annually. As the number of products grows, and therefore the
number of vulnerabilities increases, it is critical to utilize an automated system
capable of identifying vulnerabilities and quickly delivering an effective defense
measure.
By enabling machines to swiftly build or synthesize human language, natural
language processing (NLP) has been widely employed to automate text analytic
operations in a variety of domains including cybersecurity. Language models,
as the core component of modern text analytic technologies, play critical role
in NLP applications by enabling computers to interpret qualitative input and
transform it into quantitative representations. There are several well-known and
well-performing language models, such as ELMO [20], GPT [21], and BERT [12],
trained on general English corpora and used for a variety of NLP tasks such as
machine translation, named entity recognition, text classification, and semantic
analysis. There is continuous discussion in the research community over whether
it is beneficial to employ these off-the-shelf models as a baseline, and then fine-
tune them through domain-specific tasks. The assumption is that the fine-tuned
models will retain the basic linguistic knowledge in general English and meanwhile
develop "advanced" knowledge in the domain while fine tuning [7].
However, certain domains, such as cybersecurity, are indeed highly sensitive,
dealing with processing of critical data and any error in this procedure may
expose the entire infrastructure to the cyber threats, and therefore, automated
processing of cybersecurity text requires a robust and reliable framework. Cy-
bersecurity terms are either uncommon in general English (such as ransomware,
API, OAuth, exfilterate, and keylogger) or have multiple meanings (homographs)
in different domains (e.g., honeypot, patch, handshake, and virus). This existing
gap in language structure and semantic contexts complicates text processing
and demonstrates the standard English language model may be incapable of
accommodating the vocabulary of cybersecurity texts, leading to a restricted or
limited comprehension of cybersecurity implications.
In this study, we address this critical cybersecurity problem by introducing
a new language model called SecureBERT by employing the state-of-the-art
NLP architecture called BERT [12], which is capable of processing texts with
cybersecurity implications effectively. SecureBERT is generic enough to be applied
in a variety of cybersecurity tasks, such as phishing detection [10], code and
malware analysis [24], intrusion detection [2], etc. SecureBERT is a pre-trained
2. OVERVIEW OF BERT LANGUAGE MODEL 3
cybersecurity language model that have the fundamental understanding of both
the word-level and sentence-level semantics, which is an essential building block
for any cybersecurity report. In this context, we collected and processed a large
corpus of 1.1 billion words (1.6 million in vocabulary size) from a variety of
cybersecurity text resources, including news, reports and textbooks, articles,
research papers, and videos. On top of the pre-trained tokenizer, we developed a
customized tokenization method that preserves standard English vocabulary as
much as possible while effectively accommodating new tokens with cybersecurity
implication. Additionally, we utilized a practical way to optimize the retraining
procedure by introducing random noise to the pre-trained weights. We rigorously
evaluated the performance of our proposed model through three different tasks
such as standard Masked Language Model (MLM), sentiment analysis, and
Named Entity Recognition (NER), to demonstrate SecureBERT’s performance
in processing both cybersecurity and general English inputs.
2 Overview of BERT Language Model
BERT (Bidirectional Encoder Representations from Transformers) [12] is a
transformer-based neural network technique for natural language processing
pre-training. BERT can train language models based on the entire set of words
in a sentence or query (bidirectional training) rather than the traditional way of
training on the ordered sequence of words (left-to-right or combined left-to-right
and right-to-left). BERT allows the language model to learn word context based
on surrounding words rather than just the word that immediately precedes or
follows it.
BERT leverages Transformers, an attention mechanism that can learn con-
textual relations between words and subwords in a sequence. The Transformer
includes two separate mechanisms, an encoder that reads the text inputs and a
decoder that generates a prediction for the given task. Since BER goal is to
generate a language model, only the encoder mechanism is necessary [27]. This
transformer encoder reads the entire data at the same time instead of reading
the text in order.
Building a BERT model requires two steps: pre-training and fine-tuning. In
pre-training stage, the model is trained on unlabeled data against two different pre-
training tasks, namely Masked LM (MLM) and Next Sentence Prediction (NSP).
MLM typically masks some percentage of the input tokens (15%) at random
and then predicts them through a learning procedure. In this case, the final
hidden vectors corresponding to the mask tokens are fed into an output softmax
over the vocabulary. NSP is mainly designed to understand the relationship
between two sentences, which is not directly captured by language modeling. In
order to train a model that understands sentence relationships, it trains for a
binarized next sentence prediction task that can be trivially generated from any
monolingual corpus, in which it takes a pair of sentences as input and in 50% of
the times in replaces the second sentence with a random one from the corpus. To
perform fine-tuning, the BERT model is launched with pre-trained parameters
4 SecureBERT
and then all parameters are fine-tuned using labeled data from downstream
tasks. BERT model has a unified architecture across different tasks, and there
is a minor difference between pre-trained and final downstream architecture.
The pre-trained BERT model used Books Corpus (800M words) and English
Wikipedia (2,500M words) and improved the state-of-the-art for eleven NLP
tasks such as getting a GLUE [28] score of 80
.
4%, which is 7
.
6% of definite
improvement from the previous best results, and achieving 93
.
2% accuracy on
Stanford Question Answering Dataset (SQuAD) [23].
A derivative of BERT, which is claimed to be a robustly optimized version of
BERT with certain modifications in the tokenizer and the network architecture,
and ignored NSP task during training, is called RoBERTa [19]. RoBERTa extends
BERT’s MLM, where it intentionally learns to detect the hidden text part inside
otherwise unannotated language samples. With considerably bigger mini-batches
and learning rates, RoBERTa changes important hyperparameters in BERT
training, enabling it to noticeably improve on the MLM and accordingly the
overall performance in all standard fine-tuning tasks. As a result of the enhanced
performance and demonstrated efficacy, we develop SecureBERT on top of
RoBERTa.
3 Data Collection
We collected a large number (98
,
411) of online cybersecurity-related text data
including books, blogs, news, security reports, videos (subtitles), journals and
conferences, white papers, tutorials, and survey papers, using our web crawler
tool
4
. We created a corpus of 1.1 billion words splitting it to 2.2 million documents
each with average size of 512 words using the Spacy
5
text analytic tool. Table 1
shows the resources and the distribution of our collected dataset for pre-training
the SecureBERT.
This corpora contains various forms of cybersecurity texts, from basic infor-
mation, news, Wikipedia, and tutorials, to more advanced texts such as CTI,
research articles, and threat reports. When aggregated, this collection offers a
wealth of domain-specific connotations and implications that is quite useful for
training a cybersecurity language model. Table 2 lists the web resources from
which we obtained our corpus.
4 Methodology
We present two approaches in this section for refining and training our domain-
specific language model. We begin by describing a strategy for developing a
customized tokenizer on top of the pre-trained generic English tokenizer, followed
by a practical approach for biasing the training weights in order to improve
weight adjustment and therefore a more efficient learning process.
4
Sample data:
dropbox.com/sh/jg45zvfl7iek12i/AAB7bFghED9GmkO5YxpPLIuma?dl=
0
5https://spacy.io/usage
4. METHODOLOGY 5
Type No. Documents
Articles 8,955
Books 180
Survey Papers 515
Blogs/News 85,953
Wikipedia (cybersecurity) 2,156
Security Reports 518
Videos 134
Total 98,411
Vocabulary size 1,674,434 words
Corpus size 1,072,798,637 words
Document size 2,174,621 documents (paragraphs)
Table 1: The details of collected cybersecurity corpora for training the Secure-
BERT.
Websites
Trendmicro, NakedSecurity, NIST, GovernmentCIO Media, CShub, Threatpost,
Techopedia, Portswigger, Security Magazine, Sophos, Reddit, FireEye, SANS,
Drizgroup, NETSCOUT, Imperva, DANIEL MIESSLER, Symantec, Kaspersky,
PacketStorm, Microsoft, RedHat, Tripwire, Krebs on Security, SecurityFocus,
CSO Online, InfoSec Institute, Enisa, MITRE
Security Reports and Whitepapers
APT Notes, VNote, CERT, Cisco Security Reports , Symantec Security Reports
Books, Articles, and Surveys
Tags: cybersecurity, vulnerability, cyber attack, hack
ACM CCS: 2014-2020 , IEEE NDSS (2016-2020), IEEE Oakland (1980-2020)
IEEE Security and Privacy (1980-2020), Arxiv , Cybersecurity and Hacking books
Videos (YouTube)
Cybersecurity courses, tutorial, and conference presentations
Table 2: The resources collected for cybersecurity textual data.
4.1 Customized Tokenizer
A word-based tokenizer primarily extracts each word as a unit of analysis, called
a token. It assigns each token a unique index, then uses those indices to encode
any given sequence of tokens. Pre-trained BERT models mainly return the weight
of each word according to these indices. Therefore, in order to fully utilize a
6 SecureBERT
pre-trained model to train a specialized model, the common token indices must
match, either using the indices of the original or the new customized tokenizer.
For building the tokenizer, we employ a byte pair encoding (BPE) [25] method
to build a vocabulary of words and subwords from the cybersecurity corpora, as
it is proven to have better performance versus word-based tokenizer. Character
based encoding used in BPE allows for the learning of a small subword vocabulary
that can encode any input text without introducing any "unknown" tokens [22].
Our objective is to create a vocabulary that retains the tokens already provided
in RoBERTa’s tokenizer while also incorporating additional unique cybersecurity-
related tokens. In this context, we extract 50
,
265 tokens from the cybersecurity
corpora to generate the initial token vocabulary
ΨSec
. We intentionally make the
size of
ΨSec
the same with that of the RoBERTa’s token vocabulary
ΨRoBE RT a
as we intended to imitate original RoBERTa’s design.
If
ΨSec
represents the vocabulary set of SecureBERT, and
ΨRoBE RT a
denotes
the vocabulary set of original RoBERTa, both with size of 50
,
265,
ΨSec
shares
32
,
592 mutual tokens with
ΨRoBE RT a
leaving 17
,
673 tokens contribute uniquely
to cybersecurity corpus, such as firewall, breach, crack, ransomware, malware,
phishing, mysql, kaspersky, obfuscated, and vulnerability, where RoBERTa’s tok-
enizer analyzes those using byte pairs:
Vmutual =ΨSec ΨRoB ERT a 32,592 tokens
Vdistinct =ΨSec ΨRoB ERT a 17,673 tokens
Studies [29] shows utilizing complete words (not subwords) for those are
common in specific domain, can enhance the performance during training since
alignments may be more challenging to understand during model training, as
target tokens often require attention from multiple source tokens. Hence, we
choose all mutual terms and assign their original indices, while the remainder new
tokens are assigned random indices with no conflict, where the original indices
refers to the indices in RoBERTa’s tokenizer, to build our tokenizer. Ultimately,
we develop a customized tokenizer with a vocabulary size similar to that of the
original model, which includes tokens commonly seen in cybersecurity corpora in
addition to cross-domain tokens. Our tokenizer encodes mutual tokens
Vmutual
as original model, ensuring that the model returns the appropriate pre-trained
weights, while for new terms
Vdistinct
the indices and accordingly the weights
would be random.
4.2 Weight Adjustments
The RoBERTa model already stores the weights for all the existing tokens in
its general English vocabulary. Many tokens such as email, internet, computer,
and phone in general English convey similar meanings as in the cybersecurity
domain. On the other hand, some other homographs such as adversary, virus,
worm, exploit, and crack carry different meanings in different domains. Using
the weights from RoBERTa as initial weights for all the tokens, and then re-
training against the cybersecurity corpus to update those initial weights will
4. METHODOLOGY 7
in fact not updating much leading to overfitting condition in training on such
tokens because the size of the training data for RoBERTa (16 GB) is 25 times
larger than that for SecureBERT. When a neural network is trained on a small
dataset, it may memorize all training samples, resulting in overfitting and poor
performance in evaluation. Due to the unbalance or sparse sampling of points in
the high-dimensional input space, small datasets may also pose a more difficult
mapping task for neural networks to tackle.
One strategy for smoothing the input space and making it simpler to learn
is to add noise to the model during training to increase the robustness of the
training process and reduces generalization error. Referring to previous works
on maintaining robust neural networks [18, 31, 33], incorporation of noise to an
unstable neural network model with a limited training set can act as a regularizer
and help reduce overfitting during the training. It is generally stated that intro-
ducing noise to the neural network during training can yield in substantial gains
in generalization performance in some cases. Previous research has demonstrated
that such noise-based training is analogous to a form of regularization in which an
additional term is introduced to the error function [8]. This noise can be imposed
to either input data or between hidden layers of the deep neural networks. When
a model is being trained from scratch, typically noise can be added to the hidden
layers at each iteration, whereas in continual learning, it can be introduced to
input data to generalize the model and reduce error [4, 16].
For training SecureBERT as continual learning process, rather than using the
initial weights from RoBERTa directly, we introduce a small "noise" to the weights
of the initial model for those mutual tokens, in order to bias these tokens to "be
a little away" from the original tokens meanings in order to capture their new
connotations in a cybersecurity context, but not "too far away" from standard
language since any domain language is still written in English and still carries
standard natural language implications. If a token conveys a similar meaning in
general English and cybersecurity, the adjusted weight during training will will
conceptually tend to converge to the original vector space as the initial model.
Otherwise, it will deviate more from the initial model to accommodate its new
meaning in cybersecurity. For those new words introduced by the cybersecurity
corpus, we use the Xavier weight initialization algorithm [14] to assign initial
weights.
We instantiated the SecureBERT by utilizing the architecture of pre-trained
RoBERTa-base model, which consists of twelve hidden transformer and attention
layers, and one input layer. We adopted the base version (RoBERTa-base) given
the efficiency and usefulness. Smaller models are less expensive to train, and
the cybersecurity domain has far less diversity of corpora than general language,
implying that a compact model would suffice. The model’s size is not the only
factor to consider; usability is another critical factor to consider when evaluating a
model’s quality. Since large models are difficult to use and expensive to maintain,
it is more convenient and practical to use a smaller and portable architecture.
Each input token is represented by an embedding vector with a dimension of
768 in pre-trained RoBERTa. Our objective is to manipulate these embedding
8 SecureBERT
vector representations for each of the 50
,
265 tokens in the vocabulary by adding
a small symmetric noise. Statistical symmetric noise with a probability density
function equal to the normal distribution is known as Gaussian noise. We intro-
duce this noise by applying a random Gaussian function to the weight vectors.
Therefore, for any token
t
, let
Wt
be the embedding vector of token
t
as follows:
Wt= [wt
1, wt
2, ..., wt
768](1)
where wt
krepresents the kth element of the embedding vector for token t.
Let notation
N
(
µ, σ
)be normal distribution where
µ
denotes the mean and
σ
the standard deviation. For each weight vector
Wt
, the noisy vector
W
t
is defined
as follows:
W
t
Wt(
Wtϵ), ϵ N (µ, σ )(2)
where
ϵ
represents the noise value, and
and
means element-wise addition
and multiplication, respectively.
The SecureBERT model is designed to emulate the RoBERTa’s architecture,
as shown in 1. To train SecureBERT for a cybersecurity language model, we use
our collected corpora and customized tokenizer. SecureBERT model contains 12
hidden layers and 12 attention heads, where the size of each hidden state has
the dimension of 768, and the input embedding dimension is 512, the same with
RoBERTa. In RoBERTa (768
×
50265 elements), the average and variance of the
pretrained embedding weights are
0
.
0125 and 0
.
0173, respectively. We picked
mu
= 0 and
sigma
= 0
.
01 to generate zero-mean noise value since we want the
adjusted weights to be in the same space as the original weights. We replace the
original weights in the initial model with the noisy weights calculated using Eq.
2.
Fig. 1: SecureBERT architecture for pre-training against masked words.
5. EVALUATION 9
5 Evaluation
We trained the model against MLM using dynamic masking using RoBERTa’s
hyperparameters running for 250
,
000 training steps for 100 hours on 8 Tesla V100
GPUs with
Batch_size
= 18, the largest possible mini-batch size for V100 GPUs.
We evaluate the model on cybersecurity masked language modeling and other
general purpose underlying tasks including sentiment analysis and named entity
recognition (NER) to further show the performance and efficiency of SecureBERT
in processing the cybersecurity text as well as reasonable effectiveness in general
language.
5.1 Masked Language Model (MLM)
In this section, we evaluate the performance of SecureBERT in predicting the
masked word in an input sentence, known as the standard Masked Language
Model (MLM) task.
Owing to the unavailability of a testing dataset for the MLM task in the
cybersecurity domain, we create one. We extracted sentences manually from a
high-quality source of cybersecurity reports - MITRE technique descriptions,
which are not included in pre-training dataset. Rather than masking an arbitrary
word in a sentence, as in RoBERTa, we masked only the verb or noun in
the sentence because a verb denotes an action and a noun denotes an object,
both of which are important for understanding the sentence’s semantics in a
cybersecurity context. Our testing dataset contains 17
,
341 records, with 12
,
721
records containing a masked noun (2
,
213 unique nouns) and 4
,
620 records
containing a masked verb (888 unique masked verbs in total). Figure 2a and
4b show the MLM performance for predicting the masked nouns and verbs
respectively. Both figures present the prediction hit rate of the masked word in
topN
model prediction. SecureBERT constantly outperforms RoBERTa-base,
RoBERTa-large and SciBERT even though the RoBERTa-large is a considerably
large model trained on a massive corpora with 355Mparameters.
Our investigations show that RoBERTa-large (much larger than RoBERTa-
base which we used as initial model) is pretty powerful language model in general
cybersecurity language. However, when it comes to advance cybersecurity context,
it constantly fails to deliver desired output. For example, three cybersecurity
sentences are depicted in Fig. 3, each with one word masked. Three terms in-
cluding reconnaissance, hijacking, and DdoS are commonly used in cybersecurity
corpora. SecureBERT is able to understand the context and properly predict
these masked words, while RoBERTa’s prediction is remarkably different. When
it comes to cybersecurity tasks including cyber threat intelligence, vulnerability
analysis, and threat action extraction [1,3], such knowledge is crucial and utilizing
a model with SecureBERT’s properties would be highly beneficial. The models
do marginally better in predicting verbs than nouns, according to the prediction
results.
10 SecureBERT
(a) Performance in predicting objects. (b) Performance in predicting verbs.
Fig. 2: Cybersecurity masked word prediction evaluation on RoBERTa-base,
RoBERTa-large, SciBERT, and SecureBERT.
Fig. 3: A comparative example of predicting masked token. SecureBERT shows a
good understanding of cybersecurity context while other models constantly failed
in advanced texts.
5.2 Ablation Study
SecureBERT outperforms existing language models in predicting cybersecurity-
related masked tokens in texts, demonstrating its ability to digest and interpret
in-domain texts. To enhance its performance and maintain general language
understanding, we used specific strategies such as the development of custom
tokenizers and weight adjustment.
SecureBERT employs an effective weight modification by introducing a small
noise to the initial weights of the pre-trained model when trained on a smaller
corpus than off-the-shelf large models, enabling it to better and more efficiently
5. EVALUATION 11
fit the cybersecurity context, particularly in learning homographs and phrases
carrying multiple meanings in different domains. As a result of the noise, this
technique puts the token in a deviated space, allowing the algorithm to adjust
embedding weights more effectively.
In Table 3, given a few simple sentences containing common homographs in
cybersecurity context, we provide the masked word prediction of four different
models, including SB (SecureBERT), SB* (SecureBERT trained without weight
adjustment), RB (RoBERTa-base), and RL (RoBERTa-large). For example, word
Virus in cybersecurity context refers to a malicious code that spreads between
devices to damage, disrupt, or steal data. On the other hand, a Virus is also
a nanoscopic infectious agent that replicates solely within an organism’s live
cells. In simple sentence such as "Virus causes <mask>.", four models deliver
different prediction, each corresponding to associated context. RB and RL return
cancer, infection and diarrhea, that are definitely correct in general (or medical)
context, they are wrong in cybersecurity domain though. SB* returns a set of
words including problem, disaster and crashes, which differ from the outcomes
of generic models, yet far away from cybersecurity implication. Despite, SB
predictions which are DoS,crash, and reboot clearly demonstrate how weight
adjustment helps in improved inference of the cybersecurity context by returning
the most relevant words for the masked token.
Masked Sentence Model Predictions
Virus causes <mask>.
SB: DoS | crash | reboot
SB*: problems | disaster | crashes
RB: cancer | autism | paralysis
RL: cancer | infection | diarrhea
Honeypot is used in <mask>.
SB: Metasploit | Windows | Squid
SB*: images | software | cryptography
RB: cooking | recipes | baking
RL: cooking | recipes | baking
A worm can <mask> itself to spread.
SB: copy | propagate | program
SB*: use | alter | modify
RB: allow | free | help
RL: clone | use | manipulate
Firewall is used to <mask>.
SB: protect | prevent | detect
SB*: protect | hide | encrypt
RB: protect | communicate | defend
RL: protect | block | monitor
zombie is the other name for a <mask>.
SB: bot | process | trojan
SB*: worm | computer | program
RB: robot | clone | virus
RL: vampire | virus | person
Table 3: Shows the masked word prediction results returned by SecureBERT
(SB), SecureBERT without weight adjustment (SB*), RoBERTa-base (RB) and
RoBERTa-large (RL) in sentences containing homographs
12 SecureBERT
Customized tokenizer, on the other hand, also plays an important role in
enhancing the performance of SecureBERT in MLM task, by indexing more
cybersecurity related tokens (specially complete words as mentioned in Section
4.1). To further show the impact of SecureBERT tokenizer in returning correct
mask word prediction, we train SecureBERT with original RoBERTa’s tokenizer
without any customization (but with weight adjustment). As depicted in Fig. 4a
and Fig. 4b, when compared to the pre-trained tokenizer, SecureBERT’s tokenizer
clearly has a higher hit rate, which highlights the significance of creating a domain-
specific tokenizer for any domain-specific language model.
(a) Performance in predicting objects. (b) Performance in predicting verbs.
Fig. 4: Demonstrating the impact of the customized tokenizer in masked word
prediction performance.
5.3 Fine-tuning Tasks
To further proof the performance of SecureBERT in handling the general NLP
tasks, we conduct two training experiments including sentiment analysis as well
as named entity recognition (NER).
Task1: Sentiment Analysis
In the first task, we intend to evaluate the SecureBERT in comprehending
general English language in form of sentiment analysis. Thus, we use publicly
available Rotten Tomatoes dataset
6
that contains corpus of movie reviews used
6https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only
5. EVALUATION 13
for sentiment analysis. Socher et al. [26] used Amazon’s Mechanical Turk to
create fine-grained labels for all parsed phrases in the corpus. The dataset is
comprised of tab-separated files with phrases from the Rotten Tomatoes dataset.
Each Sentence has been parsed into many phrases by the Stanford parser. Each
phrase has a "Phrase Id" and each sentence contains a "Sentence Id" while there
is no duplicated phrase included in the dataset. Phrases are labeled with five
sentiment impressions including negative, somewhat negative, neutral, somewhat
positive, and positive. We build a single layer MLP on top of the four models
as classification layer to classify the phrases to the corresponding label. We
trained two version of the SecureBERT called raw SecureBERT and modified
SecureBERT. The former model is the version of our model in which we utilized
customized tokenizer and the weight adjustment method, while the latter is
the original RoBERTa model trained as is, using the collected cybersecurity
corpora. We trained the model for 1,500 steps with
learningrate
= 1
e
5and
Batch_size
= 32, to minimize the error of
CrossEntropy
loss function employing
Adam
optimizer and
Sof tmax
as the activation function in the classification
layer. Fig. 6 shows the SecureBERT’s architecture for sentiment analysis.
Fig. 5: SecureBERT architecture for sentiment analysis downstream task.
In Table 4, we show the performance of both models and compared it with
original RoBERTa-base and SciBERT, fine-tuned on Rotten Tomatoes dataset. As
illustrated, despite the fact that SciBERT is trained on a broader range of domains
(biomedical and computer science), both SecureBERT versions perform quite
similarly to SciBERT. In addition, the 2.23% and 2.02% difference in accuracy
and F1-score with RoBERTa-base demonstrates the effectiveness of SecureBERT
in analysing the general English language as well. Furthermore, the modified
model perform slightly better than the raw version by 0.34% accuracy and 0.71%
F1-score improvement. In the second task, we fine-tune the SecureBERT to
conduct cybersecurity-related name entity recognition (NER). NER is a special
task in information extraction that focuses on identifying and classifying named
14 SecureBERT
Model Name Error Accuracy F1-Score
RoBERTa-base 0.733 69.46 69.12
SciBERT 0.768 67.76 67.08
SecureBERT (raw) 0.788 66.89 66.39
SecureBERT (modified) 0.771 67.23 67.10
Table 4: Shows the performance of different models on general English sentiment
analysis task.
entities referenced in unstructured text into predefined entities such as person
names, organizations, places, time expressions, etc.
Since general purpose NER models may not always function well in cyberse-
curity, we must employ a domain-specific dataset to train an effective model for
this particular field. Training a NER model in cybersecurity is a challenging task
since there is no publicly available domain-specific data and, even if there is, it
is unclear how to establish consensus on which classes should be retrieved from
the data. Nevertheless, here we aim to fine-tune the SecureBERT on a relatively
small sized dataset that is related to cybersecurity just to show the overall
performance and compare it with the existing models. MalwareTextDB [17] is a
dataset containing 39 annotated APT reports with a total of 6,819 sentences. In
the NER version of this dataset, the sentences are annotated with four different
tags including:
Action: referring to an event, such as "registers", "provides" and "is written".
Subject: referring to the initiator of the Action such as "The dropper" and
"This module"
Object: referring to the recipient of the Action such as "itself ", "remote persis-
tent access" and "The ransom note"; it also refers to word phrases that provide
elaboration on the Action such as "a service", "the attacker" and "disk".
Modifier: referring to the tokens that link to other word phrases that provide
elaboration on the Action such as "as" and "to".
In each sentence in addition, all the words that are not labeled by any of the
mentioned tags as well as pad tokens will be assigned by a dummy label ("O")
exclude them in calculating performance metrics.
For Named Entity Recognition, we take the hidden states (the transformer
output) of every input token from the last layer from SecureBERT. These tokens
are then fed to a fully connected dense layer with Nunits where Nequals to
the total number of defined entities. Since SecureBERT’s tokenizer breaks some
words into pieces (Bytes), in such cases we just predict the first piece of the word.
6. RELATED WORKS 15
Fig. 6: SecureBERT architecture for named entity recognition (NER).
We trained the model in 3 epochs with
learningrate
= 2
e
5and
batchsize
=
8, to minimize the error of
CrossEntropy
loss function using
Adam
optimizer
and Sof tmax as the activation function in the classification layer.
Model Name Precision Recall F1-Score
RoBERTa-base 84.92 87.53 86.20
SciBERT 83.19 85.84 84.49
SecureBERT (raw) 86.08 86.81 86.44
SecureBERT (modified) 85.24 88.10 86.65
Table 5: Shows the performance of different models trained on MalwareTextDB
dataset for NER task.
Similar to the previous task, Table 5 shows the performance of both Secure-
BERT’s version as well as two other models. As depicted, modified SecureBERT
outperforms all other models, despite the fact that MalwareTextDB dataset still
contains many sentences with general English meaning and is not an cybersecurity-
specific corpora.
6 Related Works
Beltagy et al. [7] unveiled SciBERT following the exact BERT’s architecture,
a model that improves performance on downstream scientific NLP tasks by
exploiting unsupervised pretraining from scratch on a 1
.
14
M
multi-domain
corpus of scientific literature, including 18% computer science and 82% biomedical
domain.
In a similar work on biomedical domain, Gu et al. [15] introduced BioBERT
focusing particularly on biomedical domain using BERT architecture and publicly
16 SecureBERT
available biomedical datasets. This work also creates a benchmark for biomedical
NLP featuring a diverse set of tasks such as named entity recognition, relation
extraction, document classification, and question answering. ClinicalBERT [5] is
another domain adaptation model based on BERT which is trained on clinical
text from the MIMIC-III database.
Thus far, utilizing language models such as BERT for cybersecurity applica-
tions is quite limited. CyBERT [6] presents a classifier for cybersecurity feature
claims by fine-tuning a pre-trained BERT language model for the purpose of
identifying cybersecurity claims from a large pool of sequences in ICS device
documents. There are also some other studies working on fine-tuning of BERT in
cybersecurity domain. Das et al. [11] fine-tunes BERT to hierarchically classify
cybersecurity vulnerabilities to weaknesses. Additionally, there are several studies
on fine-tuning BERT for NER tasks such as [9], [32] and [13]. Yin et al. [30]
fine-tuned pre-trained BERT against cybersecurity text and developed a classifi-
cation layer on top of their model, ExBERT, to extract sentence-level semantic
features and predict the exploitability of vulnerabilities. There is also another
model called SecBERT
7
published in Github repository which trains BERT
on cybersecurity corpus from "APTnotes"
8
, "Stucco-Data: Cyber security data
sources"
9
, "CASIE: Extracting Cybersecurity Event Information from Text"
10
,
and "SemEval-2018 Task 8: Semantic Extraction from CybersecUrity REports
using Natural Language Processing (SecureNLP)"
11
. However, at the time of
submitting this paper, we could not find any article to learn more about the
details and the proof-of-concept to discuss.
7 Conclusions and Future Works
This study introduces SecureBERT, a transformer-based language model for
processing cybersecurity text language based on RoBERTa. We presented two
practical ways for developing a successful model that can capture contextual rela-
tionships and semantic meanings in cybersecurity text by designing a customized
tokenization tool on top of RoBERTa’s tokenizer and altering the pre-trained
weights. SecureBERT is trained to utilize a corpus of 1.1 billion words collected
from a range of online cybersecurity resources. SecureBERT has been evaluated
using the standard Masked Language Model (MLM) as well as the named entity
recognition (NER) task. The evaluation outcomes demonstrated promising results
in grasping cybersecurity language.
7https://github.com/jackaduma/SecBERT
8https://github.com/kbandla/APTnotes
9https://stucco.github.io/data/
10 https://ebiquity.umbc.edu/_file_directory_/papers/943.pdf
11 https://ebiquity.umbc.edu/_file_directory_/papers/943.pdf
7. CONCLUSIONS AND FUTURE WORKS 17
References
1.
Aghaei, E., Al-Shaer, E.: Threatzoom: neural network for automated vulnerability
mitigation. In: Proceedings of the 6th Annual Symposium on Hot Topics in the
Science of Security. pp. 1–3 (2019)
2.
Aghaei, E., Serpen, G.: Host-based anomaly detection using eigentraces feature ex-
traction and one-class classification on system call trace data. Journal of Information
Assurance and Security (JIAS) 14(4), 106–117 (2019)
3.
Aghaei, E., Shadid, W., Al-Shaer, E.: Threatzoom: Hierarchical neural network for
cves to cwes classification. In: International Conference on Security and Privacy in
Communication Systems. pp. 23–41. Springer (2020)
4.
Ahn, H., Cha, S., Lee, D., Moon, T.: Uncertainty-based continual learning with
adaptive regularization. Advances in Neural Information Processing Systems 32
(2019)
5.
Alsentzer, E., Murphy, J.R., Boag, W., Weng, W.H., Jin, D., Naumann, T.,
McDermott, M.: Publicly available clinical bert embeddings. arXiv preprint
arXiv:1904.03323 (2019)
6.
Ameri, K., Hempel, M., Sharif, H., Lopez Jr, J., Perumalla, K.: Cybert: Cyber-
security claim classification by fine-tuning the bert language model. Journal of
Cybersecurity and Privacy 1(4), 615–637 (2021)
7.
Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific
text. arXiv preprint arXiv:1903.10676 (2019)
8.
Bishop, C.M.: Training with Noise is Equivalent to Tikhonov Regularization. Neural
Computation 7(1), 108–116 (01 1995). https://doi.org/10.1162/neco.1995.7.1.108,
https://doi.org/10.1162/neco.1995.7.1.108
9.
Chen, Y., Ding, J., Li, D., Chen, Z.: Joint bert model based cybersecurity named en-
tity recognition. In: 2021 The 4th International Conference on Software Engineering
and Information Management. pp. 236–242 (2021)
10.
Dalton, A., Aghaei, E., Al-Shaer, E., Bhatia, A., Castillo, E., Cheng, Z., Dhaduvai,
S., Duan, Q., Hebenstreit, B., Islam, M.M., et al.: Active defense against social
engineering: The case for human language technology. In: Proceedings for the First
International Workshop on Social Threats in Online Conversations: Understanding
and Management. pp. 1–8 (2020)
11.
Das, S.S., Serra, E., Halappanavar, M., Pothen, A., Al-Shaer, E.: V2w-bert: A
framework for effective hierarchical multiclass classification of software vulnerabili-
ties. In: 2021 IEEE 8th International Conference on Data Science and Advanced
Analytics (DSAA). pp. 1–12. IEEE (2021)
12.
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
13.
Gao, C., Zhang, X., Liu, H.: Data and knowledge-driven named entity recognition
for cyber security. Cybersecurity 4(1), 1–13 (2021)
14.
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the thirteenth international conference on
artificial intelligence and statistics. pp. 249–256. JMLR Workshop and Conference
Proceedings (2010)
15.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a
pre-trained biomedical language representation model for biomedical text mining.
Bioinformatics 36(4), 1234–1240 (2020)
18 SecureBERT
16.
Li, X., Yang, Z., Guo, P., Cheng, J.: An intelligent transient stability assessment
framework with continual learning ability. IEEE Transactions on Industrial Infor-
matics 17(12), 8131–8141 (2021)
17.
Lim, S.K., Muis, A.O., Lu, W., Ong, C.H.: MalwareTextDB: A database for
annotated malware articles. In: Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). pp. 1557–
1567. Association for Computational Linguistics, Vancouver, Canada (Jul 2017).
https://doi.org/10.18653/v1/P17-1143, https://aclanthology.org/P17-1143
18.
Liu, X., Cheng, M., Zhang, H., Hsieh, C.J.: Towards robust neural networks via
random self-ensemble. In: Proceedings of the European Conference on Computer
Vision (ECCV). pp. 369–385 (2018)
19.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692 (2019)
20.
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365
(2018)
21.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language
understanding by generative pre-training (2018)
22.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language
models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
23.
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for
machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
24.
Sajid, M.S.I., Wei, J., Alam, M.R., Aghaei, E., Al-Shaer, E.: Dodgetron: Towards
autonomous cyber deception using dynamic hybrid analysis of malware. In: 2020
IEEE Conference on Communications and Network Security (CNS). pp. 1–9. IEEE
(2020)
25.
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T.,
Arikawa, S.: Byte pair encoding: A text compression scheme that accelerates pattern
matching (1999)
26.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.:
Recursive deep models for semantic compositionality over a sentiment treebank.
In: Proceedings of the 2013 conference on empirical methods in natural language
processing. pp. 1631–1642 (2013)
27.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017)
28.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi-
task benchmark and analysis platform for natural language understanding. arXiv
preprint arXiv:1804.07461 (2018)
29.
Wang, C., Cho, K., Gu, J.: Neural machine translation with byte-level subwords.
In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp.
9154–9160 (2020)
30.
Yin, J., Tang, M., Cao, J., Wang, H.: Apply transfer learning to cybersecurity:
Predicting exploitability of vulnerabilities by description. Knowledge-Based Systems
210, 106529 (2020)
31.
You, Z., Ye, J., Li, K., Xu, Z., Wang, P.: Adversarial noise layer: Regularize
neural network by adding noise. In: 2019 IEEE International Conference on Image
Processing (ICIP). pp. 909–913. IEEE (2019)
7. CONCLUSIONS AND FUTURE WORKS 19
32.
Zhou, S., Liu, J., Zhong, X., Zhao, W.: Named entity recognition using bert with
whole world masking in cybersecurity domain. In: 2021 IEEE 6th International
Conference on Big Data Analytics (ICBDA). pp. 316–320. IEEE (2021)
33.
Zur, R.M., Jiang, Y., Pesce, L.L., Drukker, K.: Noise injection for training artificial
neural networks: A comparison with weight decay and early stopping. Medical
physics 36(10), 4810–4818 (2009)
... Li et al. [30] develop AttacKG, a similar tool that extracts techniques from reports. The latest tool, TTPHunter [31], focuses on APT reports and uses SecureBERT [32]. ...
Article
Full-text available
Analysts in Security Operations Centers (SOCs) are often occupied with time-consuming investigations of alerts from Network Intrusion Detection Systems (NIDSs). Many NIDS rules lack clear explanations and associations with attack techniques, complicating the alert triage and the generation of attack hypotheses. Large Language Models (LLMs) may be a promising technology to reduce the alert explainability gap by associating rules with attack techniques. In this paper, we investigate the ability of three prominent LLMs (ChatGPT, Claude, and Gemini) to reason about NIDS rules while labeling them with MITRE ATT&CK tactics and techniques. We discuss prompt design and present experiments performed with 973 Snort rules. Our results indicate that while LLMs provide explainable, scalable, and efficient initial mappings, traditional machine learning (ML) models consistently outperform them in accuracy, achieving higher precision, recall, and F1-scores. These results highlight the potential for hybrid LLM-ML approaches to enhance SOC operations and better address the evolving threat landscape. By utilizing automation, the presented methods will enhance the analysis efficiency of SOC alerts, and decrease workloads for analysts.
... They are increasingly used for telecom-related tasks, such as domain knowledge generation (e.g., [44], [45]), code generation (e.g., [46], [47], [48], [49]), and network configuration generation (e.g., [50], [51], [52]). LLMs also excel in classification tasks, including network security (e.g., [53], [54], [55], [56], [57]), text (e.g., [58], [59]), image (e.g., [60], [61]), and network traffic classification (e.g., [62], [63]). In network optimization, LLM-enabled techniques like reinforcement learning (e.g., [64], [65], [66]), black-box optimization [67], convex optimization (e.g., [68], [69]), and heuristic algorithms (e.g., [70], [71]) enhance wireless network management. ...
Preprint
Full-text available
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Exploiting the heterogeneous capabilities of edge LLMs is crucial for diverse emerging applications, as it enables greater cost-effectiveness and reduced latency. In this work, we introduce \textit{Mixture-of-Edge-Experts (MoE2^2)}, a novel collaborative inference framework for edge LLMs. We formulate the joint gating and expert selection problem to optimize inference performance under energy and latency constraints. Unlike conventional MoE problems, LLM expert selection is significantly more challenging due to the combinatorial nature and the heterogeneity of edge LLMs across various attributes. To this end, we propose a two-level expert selection mechanism through which we uncover an optimality-preserving property of gating parameters across expert selections. This property enables the decomposition of the training and selection processes, significantly reducing complexity. Furthermore, we leverage the objective's monotonicity and design a discrete monotonic optimization algorithm for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results validate that performance improvements of various LLM models and show that our MoE2^2 method can achieve optimal trade-offs among different delay and energy budgets, and outperforms baselines under various system resource constraints.
... This may include using code-inspection capabilities of LLMs to find software vulnerabilities, and code-writing capabil-ities of LLMs to create novel malware and exploits. However, cybersecurity teams may also leverage LLMs in a similar fashion to preemptively identify and remedy software vulnerabilities and strengthen cybersecurity in other ways (Aghaei et al., 2022;Ferrag et al., 2023). Consequently, the net impact of LLMs on cybersecurity is currently not clear and deserves further study (Hendrycks et al., 2021a). ...
Article
Full-text available
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+ concrete research questions.
... With an impressive accuracy rate of up to 92.65%, VulDetect effectively identifies software vulnerabilities. Similarly, Aghaei et al. [12] introduced SecureBERT, a domain-specific LLM tailored for cybersecurity tasks. By utilizing customized tokenizers and altered pre-trained weights, SecureBERT outperforms existing models in capturing text connotations in cybersecurity text. ...
Preprint
Full-text available
The integration of Internet of Things (IoT) technology in various domains has led to operational advancements, but it has also introduced new vulnerabilities to cybersecurity threats, as evidenced by recent widespread cyberattacks on IoT devices. Intrusion detection systems are often reactive, triggered by specific patterns or anomalies observed within the network. To address this challenge, this work proposes a proactive approach to anticipate and preemptively mitigate malicious activities, aiming to prevent potential damage before it occurs. This paper proposes an innovative intrusion prediction framework empowered by Pre-trained Large Language Models (LLMs). The framework incorporates two LLMs: a fine-tuned Bidirectional and AutoRegressive Transformers (BART) model for predicting network traffic and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model for evaluating the predicted traffic. By harnessing the bidirectional capabilities of BART the framework then identifies malicious packets among these predictions. Evaluated using the CICIoT2023 IoT attack dataset, our framework showcases a notable enhancement in predictive performance, attaining an impressive 98% overall accuracy, providing a powerful response to the cybersecurity challenges that confront IoT networks.
... The CAN-C-BERT model uses BERT's bidirectional text comprehension, which is pretrained on a large text corpus. CAN-SecureBERT relies on the SecureBERT architecture [167], which is a version of RoBERTa tailored for cybersecurity, and which has been pretrained on a large corpus of cybersecurity-related data. Lastly, CAN-LLAMA2 is based on Meta's LLAMA2 model that is pre-trained on an extensive dataset covering various domains. ...
Preprint
Full-text available
The rapid evolution of communication networks in recent decades has intensified the need for advanced Network and Service Management (NSM) strategies to address the growing demands for efficiency, scalability, enhanced performance, and reliability of these networks. Large Language Models (LLMs) have received tremendous attention due to their unparalleled capabilities in various Natural Language Processing (NLP) tasks and generating context-aware insights, offering transformative potential for automating diverse communication NSM tasks. Contrasting existing surveys that consider a single network domain, this survey investigates the integration of LLMs across different communication network domains, including mobile networks and related technologies, vehicular networks, cloud-based networks, and fog/edge-based networks. First, the survey provides foundational knowledge of LLMs, explicitly detailing the generic transformer architecture, general-purpose and domain-specific LLMs, LLM model pre-training and fine-tuning, and their relation to communication NSM. Under a novel taxonomy of network monitoring and reporting, AI-powered network planning, network deployment and distribution, and continuous network support, we extensively categorize LLM applications for NSM tasks in each of the different network domains, exploring existing literature and their contributions thus far. Then, we identify existing challenges and open issues, as well as future research directions for LLM-driven communication NSM, emphasizing the need for scalable, adaptable, and resource-efficient solutions that align with the dynamic landscape of communication networks. We envision that this survey serves as a holistic roadmap, providing critical insights for leveraging LLMs to enhance NSM.
... Similar to the approach taken by [2], [3] developed a domain-specific language model named SecureBERT, which was trained on cybersecurity threat intelligence reports. This training utilized automated labeling techniques facilitated by semantic role labeling. ...
Preprint
The vast majority of cybersecurity information is unstructured text, including critical data within databases such as CVE, NVD, CWE, CAPEC, and the MITRE ATT&CK Framework. These databases are invaluable for analyzing attack patterns and understanding attacker behaviors. Creating a knowledge graph by integrating this information could unlock significant insights. However, processing this large amount of data requires advanced deep-learning techniques. A crucial step towards building such a knowledge graph is developing a robust mechanism for automating the extraction of answers to specific questions from the unstructured text. Question Answering (QA) systems play a pivotal role in this process by pinpointing and extracting precise information, facilitating the mapping of relationships between various data points. In the cybersecurity context, QA systems encounter unique challenges due to the need to interpret and answer questions based on a wide array of domain-specific information. To tackle these challenges, it is necessary to develop a cybersecurity-specific dataset and train a machine learning model on it, aimed at enhancing the understanding and retrieval of domain-specific information. This paper presents a novel dataset and describes a machine learning model trained on this dataset for the QA task. It also discusses the model's performance and key findings in a manner that maintains a balance between formality and accessibility.
... With AttacKG Li et al. [30] introduced a similar tool that extracts techniques from reports. The most recent work on the extraction of TTPs from reports is TTPHunter [31] which targets APT reports taking advantage of SecureBERT [32]. ...
Preprint
Full-text available
Analysts in Security Operations Centers (SOCs) are often occupied with time-consuming investigations of alerts from Network Intrusion Detection Systems (NIDS). Many NIDS rules lack clear explanations and associations with attack techniques, complicating the alert triage and the generation of attack hypotheses. Large Language Models (LLMs) may be a promising technology to reduce the alert explainability gap by associating rules with attack techniques. In this paper, we investigate the ability of three prominent LLMs (ChatGPT, Claude, and Gemini) to reason about NIDS rules while labeling them with MITRE ATT&CK tactics and techniques. We discuss prompt design and present experiments performed with 973 Snort rules. Our results indicate that while LLMs provide explainable, scalable, and efficient initial mappings, traditional Machine Learning (ML) models consistently outperform them in accuracy, achieving higher precision, recall, and F1-scores. These results highlight the potential for hybrid LLM-ML approaches to enhance SOC operations and better address the evolving threat landscape.
Article
Full-text available
We introduce CyBERT, a cybersecurity feature claims classifier based on bidirectional encoder representations from transformers and a key component in our semi-automated cybersecurity vetting for industrial control systems (ICS). To train CyBERT, we created a corpus of labeled sequences from ICS device documentation collected across a wide range of vendors and devices. This corpus provides the foundation for fine-tuning BERT’s language model, including a prediction-guided relabeling process. We propose an approach to obtain optimal hyperparameters, including the learning rate, the number of dense layers, and their configuration, to increase the accuracy of our classifier. Fine-tuning all hyperparameters of the resulting model led to an increase in classification accuracy from 76% obtained with BertForSequenceClassification’s original architecture to 94.4% obtained with CyBERT. Furthermore, we evaluated CyBERT for the impact of randomness in the initialization, training, and data-sampling phases. CyBERT demonstrated a standard deviation of ±0.6% during validation across 100 random seed values. Finally, we also compared the performance of CyBERT to other well-established language models including GPT2, ULMFiT, and ELMo, as well as neural network models such as CNN, LSTM, and BiLSTM. The results showed that CyBERT outperforms these models on the validation accuracy and the F1 score, validating CyBERT’s robustness and accuracy as a cybersecurity feature claims classifier.
Article
Full-text available
Named Entity Recognition (NER) for cyber security aims to identify and classify cyber security terms from a large number of heterogeneous multisource cyber security texts. In the field of machine learning, deep neural networks automatically learn text features from a large number of datasets, but this data-driven method usually lacks the ability to deal with rare entities. Gasmi et al. proposed a deep learning method for named entity recognition in the field of cyber security, and achieved good results, reaching an F1 value of 82.8%. But it is difficult to accurately identify rare entities and complex words in the text.To cope with this challenge, this paper proposes a new model that combines data-driven deep learning methods with knowledge-driven dictionary methods to build dictionary features to assist in rare entity recognition. In addition, based on the data-driven deep learning model, an attention mechanism is adopted to enrich the local features of the text, better models the context, and improves the recognition effect of complex entities. Experimental results show that our method is better than the baseline model. Our model is more effective in identifying cyber security entities. The Precision, Recall and F1 value reached 90.19%, 86.60% and 88.36% respectively.
Conference Paper
Full-text available
We describe Panacea, a system that supports natural language processing (NLP) components for active defenses against social engineering attacks. We deploy a pipeline of human language technology, including Ask and Framing Detection, Named Entity Recognition, Dialogue Engineering, and Sty-lometry. Panacea processes modern message formats through a plug-in architecture to accommodate innovative approaches for message analysis, knowledge representation, and dialogue generation. The novelty of the Panacea system is that uses NLP for cyber defense and engages the attacker using bots to elicit evidence to attribute to the attacker and to waste the attacker's time and resources.
Conference Paper
Full-text available
With the advancement of technology, all our valuable and sensitive information has now moved into digital formats. Adversary utilizes malware as a medium to steal our information for their benefits. Active Cyber Deception (ACD) has emerged prominently to defend a computer system by making the attackers think it is not worth attacking or by presenting falsified data to the attackers, making them believe they achieved their purpose. As the malware is the medium between our systems and adversaries, comprehensive malware analysis is required to find out the ways to present falsified data to mislead the attackers. Nevertheless, developing an active cyber deception with the guidance of comprehensive malware analysis requires human intelligence, effort and insight to characterize the attack behaviors. In this paper, we present DodgeTron, an autonomous cyber deception approach, which performs comprehensive malware behavioral analysis and creates deception schemes automatically by extracting deception parameters that are leveraged by attackers to discover target systems and reach their goal. Thus our approach protects users' by altering these deception parameters to feed false information to the adversaries and corrupt their decisions making automatically without human effort. To make our approach efficient and scalable to deal with a large number of malware samples created per day, we employ machine-learning-based malware classification to reduce the number of malware samples that require an in-depth analysis. We conducted comprehensive evaluations on DodgeTron with recent malware and confirmed its accuracy of 91.18% on average with 1.1x to 2.8x analysis time optimization to achieve deception.
Article
Data driven method is an effective tool to solve the problem of online power system transient stability assessment (TSA). Nevertheless, there is no proper way for a model to supplement the original knowledge base but to retrain it from scratch. To address this issue, a novel TSA framework with continual learning ability is proposed. First, the improved convolutional neural network (CNN) based orthogonal weight modification (OWM) algorithm is selected as a transient stability predictor. Then, large disturbance is regarded as a new recognition task which can be continuously learned by the predictor. Through the proposed updating process, the original knowledge base can be updated and supplemented only by the data corresponding to new scenario. The test results on two power systems show that the proposed scheme can realize the trade-off between the integrity of knowledge base and the update speed without losing too much accuracy in a limited capacity.
Article
Thousands of software vulnerabilities are archived and disclosed to the public each year, posing severe cybersecurity threats to the whole society. Predicting the exploitability of vulnerabilities is crucial for decision-makers to prioritize their efforts and patch the most critical vulnerabilities. Software vulnerability descriptions are accessible features in early stage and contain rich semantic information. Therefore, descriptions are wildly used for exploitability prediction in both industry and academia. However, comparing with other corpora, the size of vulnerability description corpus is too small to train a comprehensive Natural Language Processing (NLP) model. To gain a better performance, this paper proposes a framework named ExBERT to accurately predict if a vulnerability will be exploited or not. ExBERT essentially is an improved Bidirectional Encoder Representations from Transformers (BERT) model for exploitability prediction. First, we fine-tune a pre-trained BERT using collected domain-specific corpus. Then, we design a Pooling Layer and a Classification Layer on top of the fine-tuned BERT model to extract sentence-level semantic features and predict the exploitability of vulnerabilities. Results on 46,176 real-word vulnerabilities have demonstrated that the proposed ExBERT framework achieves 91.12% on accuracy and 91.82% on precision, outperforming the state-of-the-art approach with 89.0% on accuracy and 81.8% on precision.
Conference Paper
The Common Vulnerabilities and Exposures (CVE) represent standard means for sharing publicly known information security vulnerabilities. One or more CVEs are grouped into the Common Weakness Enumeration (CWE) classes for the purpose of understanding the software or configuration flaws and potential impacts enabled by these vulnerabilities and identifying means to detect or prevent exploitation. As the CVE-to-CWE classification is mostly performed manually by domain experts, thousands of critical and new CVEs remain unclassified, yet they are unpatchable. This significantly limits the utility of CVEs and slows down proactive threat mitigation tremendously. In addition, this manual classification is error-prone and highly expensive. This paper presents an automatic classification of CVEs into CWEs to enable understanding of CVE detection and mitigation, and facilitate proactive cybersecurity. Our approach uses a novel learning algorithm that employs an adaptive hierarchical neural network that adjusts its weights based on text analytic scores and classification errors to automatically estimate the CWE classes corresponding to a CVE instance using both statistical and semantic features. We implemented our approach in a tool called ThreatZoom that takes the description of a CVE and classifies it to the most corresponding CWEs in a hierarchical fashion. Our tool employs an adaptive hierarchical neural network that adjusts its weights based on the classification errors during the backpropagation to classify every CVEs to fine-grain CWE classes. This tool is rigorously tested by various datasets provided by MITRE and the National Vulnerability Database (NVD). The accuracy of classifying CVE instances to their correct CWE classes is 92% (fine-grain) and 94% (coarse-grain) for NVD dataset, and 75% (fine-grain) and 90% (coarse-grin) for MITRE dataset, despite the small corpus.