January 2008
·
145 Reads
·
7,474 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
January 2008
·
145 Reads
·
7,474 Citations
January 2008
·
659 Reads
·
5,651 Citations
January 2008
·
204 Reads
·
214 Citations
... where w ji is the weight of the term t j on the document d i , which quantifies the importance of the term on the document. To compute the weights w ji , one of the most common approaches is the method known as Term Frequency-Inverse Document Frequency (TF-IDF) [26,17]. This method evaluates two aspects: the frequency of a term t on the document d, tf (t, d) , and the inverse frequency of t on the corpus of documents, idf (t, D), quantifying its global importance. ...
January 2008
... In second place, and with the purpose of creating a cleaner corpus, the text of all the messages underwent pre-processing, with the help of Python libraries, which included the habitual procedures of tokenization, lemmatization, n-grams identification, and stopword removal (Heydt, 2018;Kirilenko et al., 2021;Laureate et al., 2023;Maier et al., 2018;Manning, Raghavan, & Schütze, 2008;Vázquez, Pereira-Delgado, Cid-Sueiro, & Arenas-García, 2022). In concrete, the following steps were followed: a) Tokenization with removal of short (fewer than 4 symbols) and long (more than 25 symbols) tokens. ...
January 2008
... Given the availability of multiple LLMs and the lack of prior investigation into their applicability to NVD data, we develop three variants of ChatNVD with three different widely adopted models: GPT-4o mini by OpenAI [30], Gemini 1.5 Pro by Google [31], and Llama 3 by Meta [32]. The models are trained using the term frequencyinverse document frequency (TF-IDF) embedding technique, chosen for its computational efficiency, lower cost, and faster processing time [33]. High-quality embeddings for the entire NVD dataset (720.7 MB) were deemed too costly and timeintensive, making TF-IDF a suitable alternative for tasks focused on CVE IDs and term significance. ...
January 2008