Fig 2 - uploaded by Mandana Fasounaki
Content may be subject to copyright.
Conventional Convolution Approach (Left), Proposed Convolution Approach (Right) After the first non-square convolution layer, following 1-D convolutional filters transform the output of the first layer into a fixed-size vector. The following FC part of the model classifies the embedding into one of the speaker classes. We used skip connections similar to the ones in Residual Networks (ResNet) [24] to preserve information from previous layers. For 1-second identification, the model contains 1024 kernels of size (13 × 20) in the first layer. The following layers have convolutional filters of sizes (1 × 20) , (1 × 16), and (1 × 10). If the input size increases, more MaxPooling layers may be

Conventional Convolution Approach (Left), Proposed Convolution Approach (Right) After the first non-square convolution layer, following 1-D convolutional filters transform the output of the first layer into a fixed-size vector. The following FC part of the model classifies the embedding into one of the speaker classes. We used skip connections similar to the ones in Residual Networks (ResNet) [24] to preserve information from previous layers. For 1-second identification, the model contains 1024 kernels of size (13 × 20) in the first layer. The following layers have convolutional filters of sizes (1 × 20) , (1 × 16), and (1 × 10). If the input size increases, more MaxPooling layers may be

Contexts in source publication

Context 1
... network with non-square filters. The general architecture is illustrated in Fig. 1. Rectangular kernels in the first layer cover consecutive frames of the input sample, such that in each convolutional step all MFCCs of a small time period are covered. The comparison of our proposed convolutional approach with the conventional process is shown in Fig. 2. As depicted in this figure, with the traditional convolution approach, the filters are applied to local square areas in each step, not considering the temporal aspect of the input. However, the convolutional filters in the proposed approach are applied to all the features of consecutive frames. The main motivation for this ...
Context 2
... network with non-square filters. The general architecture is illustrated in Fig. 1. Rectangular kernels in the first layer cover consecutive frames of the input sample, such that in each convolutional step all MFCCs of a small time period are covered. The comparison of our proposed convolutional approach with the conventional process is shown in Fig. 2. As depicted in this figure, with the traditional convolution approach, the filters are applied to local square areas in each step, not considering the temporal aspect of the input. However, the convolutional filters in the proposed approach are applied to all the features of consecutive frames. The main motivation for this ...

Citations

... To address the issue of short utterances, Fasounaki et al. (2021) proposed a modified CNNs for closed-set text-independent ASI. The authors developed an optimum CNN architecture with rectangular-shaped kernels to extract the temporal features of the speech signals. ...
Article
Full-text available
Automatic speaker identification (ASI) is an exciting area of research with numerous applications such as surveillance, voice authentication, identity verification, and electronic voice eavesdropping. This study investigates ASI based on features derived from spectrogram images through a convolution neural network (CNN) with rectangular-shaped kernels. Traditionally, CNN employs square-shaped kernel and max-pooling operations at different layers, a design optimized to handle 2D data. Nevertheless, encoding of information differs slightly to deal with spectrograms. The frequency is displayed along the y-axis, and the x-axis presents the time of the audio. Amplitude is denoted by intensity within the spectrogram image at certain point. The main contributions of this study are 1: To analyze audio signals effectively using spectrograms, this study proposed the utilization of spectrogram features with different sizes and shapes of rectangular kernels to derive distinctive features by improving the recognition accuracy of the speaker identification system. 2. The extracted spectrogram-based features and models are evaluated on the ELSDSR, TSP, and LibriSpeech datasets and achieved the weighted accuracy of 96.0%, 99.2%, and 97.6%, respectively. 3. The proposed rectangular-shaped CNN approach effectively derives suitable features from spectrogram images and outperformed several baseline techniques when performance was assessed on ELSDSR, TSP, and LibriSpeech datasets.
... Proses pelatihan dilakukan dengan memanfaatkan CNN, sebuah arsitektur neural network yang dikenal efektif dalam memproses data gambar dan mengidentifikasi pola yang kompleks [16]. CNN memiliki kemampuan untuk secara otomatis mengekstraksi fitur-fitur penting dari gambar yang diberikan, menjadikannya pilihan yang tepat untuk tugas-tugas pengenalan pola visual seperti yang dilakukan dalam penelitian ini [17]. Selama proses pelatihan, model CNN diberikan informasi dari dataset yang telah disiapkan, dan dilakukan iterasi berulang untuk mengoptimalkan bobot dan parameter lainnya guna meningkatkan akurasi prediksi [18]. ...
... Dataset preprocessing aims to prepare the data set before inputting it into the model training stage [21]. The process of preprocessing needle-leaf images involves several steps, including adjusting the size and increasing the data. ...
... Arsitektur CNN yang digunakan dalam penelitian ini meliputi beberapa arsitektur dalam implementasinya, seperti Conv2D (300 filters, 3×3), Activation (ReLU), MaxPooling 2D (3×3), Conv2D (200 filters, 3×3), Activation (ReLU), MaxPooling 2D (3×3), Flatten. Dropout (0,5), Dense (150 neurons, ReLU), Dense (4 neurons, Softmax) (Fasounaki et al., 2021). ...
Article
Kecurangan sering terjadi di dunia akademik, dengan istilah 'menyontek' yang lazim digunakan untuk menyebutnya. Perbuatan menyontek tidak memandang status pendidikan, dari tingkat SD hingga SMA, dapat dilakukan baik secara individu maupun berkelompok, tanpa memperdulikan usia. Pada ujian online, kasus menyontek lebih sering terjadi karena pengawasan manusia yang kurang optimal. Faktor lain yang mempengaruhi tingkat kecurangan dalam ujian online adalah jenis platform ujian online yang umum digunakan saat ini dan kekurangan sistem pendukung untuk pengawasan virtual. Penelitian ini bertujuan untuk mengembangkan platform ujian online berbasis deteksi kecurangan menggunakan kamera, yang akan mempermudah pengawasan dalam ujian online. Sistem yang dibangun juga bersifat web-based sehingga lebih fleksibel dalam akses. Aplikasi ini menggunakan teknologi deep learning dengan metode algoritma Convolutional Neural Network, sedangkan website-nya dibangun dengan framework flask yang telah terintegrasi dengan library tensorflow dan keras. Hasil penelitian ini adalah aplikasi platform ujian online berbasis deteksi gerakan kecurangan menggunakan kamera, yang dinamakan Fraud Catcher. Selain itu, penelitian ini juga menghasilkan klasifikasi menggunakan algoritma Convolutional Neural Network dengan tingkat akurasi yang cukup baik, mencapai 98,5%. Pengujian implementasi model pada sistem menunjukkan tingkat akurasi di atas 80% dengan bantuan cahaya, dan hasil pengujian blackbox juga memperlihatkan fungsionalitas sesuai harapan.
... Softmax adalah proses algoritma matematika yang mengklasifikasikan objek yang telah digabungkan dalam proses yang terhubung sepenuhnya (FC) untuk pengenalan objek yang lebih akurat. [11]. ...
Article
Aksara sunda swara panglayar adalah salah satu aksara daerah indonesia khususnya aksara sunda yaitu aksara vokal dengan tambahan konsonan R. Seiring dengan perkembangan teknologi sekarang ini, bahasa daerah semakin lama semakin mengalami degradasi. Aksara Sunda juga mulai dilupakan, bahkan kurang digunakan oleh masyarakat Sunda dalam kehidupan sehari-hari dan karena kurangnya pemahaman akan bahasa daerahnya. Oleh karena itu, bahasa daerah yang berkembang dari waktu ke waktu perlu dilestarikan agar tetap dikenal dan dilestarikan salah satunya dengan identifikasi aksara sunda swara panglayar menggunakan algoritma Convolutional Neural Network (CNN) yang merupakan bagian dari deep learning yang biasa digunakan dalam pengolahan data citra. Hasil dari penelitian ini menggunakan optimasi ADAM dengan epoch 110, 150 dan 160 berurutan pada rasio dataset 80:20, 50:50 dan 20:80. Akurasi tertinggi didapatkan 86,85% dari rasio dataset 80:20 dengan nilai loss dan accuracy pada proses pelatihan sebesar 0,1589 dan 0,9389.
... Arsitektur CNN memungkinkan banyak lapisan untuk menyimpan dan memproses ciri-ciri objek dalam gambar dengan efisiensi yang tinggi [8]. CNN telah menjadi salah satu model deep learning yang paling umum digunakan dalam klasifikasi gambar, deteksi kesamaan, dan pengenalan objek [9] . Arsitektur CNN yang populer adalah Visual Geometric Group (VGG16). ...
Article
Artikel ini membahas penggunaan deep learning, khususnya arsitektur Convolutional Neural Network (CNN) VGG16, untuk mengklasifikasikan tingkat kerapatan dan transparansi tajuk pada pohon jenis daun jarum. Penelitian ini mengumpulkan gambar dari empat jenis pohon daun jarum: araucaria heterophylla, pinus merkusii, cupressus retusa, dan shorea javanica, masing-masing dengan sepuluh tingkat kerapatan dan transparansi yang berbeda. Setiap jenis memiliki 1000 gambar yang telah di-label. Proses preprocessing melibatkan perubahan ukuran, dan augmentasi gambar. Data dibagi menjadi data training (70%), data validation (10%), dan data testing (20%). Model deep learning yang digunakan adalah VGG16 dengan hyperparameter yang telah ditentukan. Hasil pelatihan model menunjukkan bahwa VGG16 berhasil mengklasifikasikan pohon daun jarum dengan tingkat akurasi yang baik. Hasil akurasi mencapai 90.00% untuk pinus merkusii, 92.00% untuk araucaria heterophylla, 96.00% untuk cupressus retusa, dan bahkan 99.00% untuk shorea javanica. Hasil evaluasi juga mencakup precision, recall, dan F1-score untuk setiap kelas kerapatan dan transparansi. Kesalahan prediksi terutama terjadi pada kelas dengan tingkat kesamaan visual yang tinggi antar gambar. Penelitian ini membuktikan bahwa teknologi deep learning dapat digunakan untuk mengklasifikasikan tingkat kerapatan dan transparansi tajuk pada pohon daun jarum. Hasilnya dapat digunakan dalam pemantauan kesehatan hutan, membantu pemerintah dan organisasi terkait dalam pengelolaan hutan yang berkelanjutan.
... Since the usage of voice-controlling applications by people worldwide has led to the acceleration in the field of speaker identification. Fasounaki et al. [64] implemented an automatic speaker recognition based on a traditional neural network. This technique is independent of text data and provides accurate speaker identification. ...
Article
Full-text available
Speech is a unique characteristic of humans that expresses one's emotional viewpoint to others. Speech emotion recognition (SER) identifies the speaker's emotion from the speech signal. Nowadays, (SER) plays a vital role in real-time applications such as human–machine interface, lie detection, virtual reality, security, audio mining, etc. But in SER, filtering the noise content and extracting the emotional features is complex. Moreover, incorporating digital filters increases the cost and complexity of the system. Thus, a novel hybrid firefly-based recurrent neural speech recognition (FbRNSR) was developed with preprocessing and a feature analysis module to classify human emotions based on the speech input. The extracted features from the feature extraction module are trained to classify the emotions as happy, sad, or average. Moreover, the incorporation of firefly fitness improves the classification rate. The presented model is executed in Python, and the results are estimated. The performance of the presented approach is analyzed using the confusion matrix. The designed model achieved high true positive rate of 99.34%, true negative rate of 99.12%, false positive of 99.21%, and false negative rate of 99.07%. The designed model achieved 99.2% accuracy, 98.9% recall, and precision value for the speech signal dataset. Finally, the effectiveness and robustness of the proposed approach are proved by comparing it with the existing techniques. Hence, this method is applicable in various sectors such as medicine, security, etc., to identify the state of emotions among the people.
Preprint
Full-text available
Speaker identification is crucial in many application areas, such as automation, security, and user experience. This study examines the use of traditional classification algorithms and hybrid algorithms, as well as newly developed subspace classifiers, in the field of speaker identification. In the study, six different feature structures were tested for the various classifier algorithms. Stacked Features-Common Vector Approach (SF-CVA) and Hybrid CVA-FLDA (HCF) subspace classifiers are used for the first time in the literature for speaker identification. In addition, CVA is evaluated for the first time for speaker recognition using hybrid deep learning algorithms. This paper is also aimed at increasing accuracy rates with different hybrid algorithms. The study includes Recurrent Neural Network-Long Short-Term Memory (RNN-LSTM), i-vector + PLDA, Time Delayed Neural Network (TDNN), AutoEncoder + Softmax (AE + Softmaxx), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Common Vector Approach (CVA), SF-CVA, HCF, and Alexnet classifiers for speaker identification. The six different feature extraction approaches consist of Mel Frequency Cepstral Coefficients (MFCC) + Pitch, Gammatone Cepstral Coefficients (GTCC) + Pitch, MFCC + GTCC + Pitch + eight spectral features, spectrograms,i-vectors, and Alexnet feature vectors. For SF-CVA, 100% accuracy was achieved in most tests by combining the training and test feature vectors of the speakers separately. RNN-LSTM, i-vector + KNN, AE + softmax, TDNN, and i-vector + HCF classifiers gave the highest accuracy rates in the tests performed without combining training and test feature vectors.
Article
One of the authentication models that are currently often used is based on biometrics, such as eye retina, fingerprint, and speech recognition. Moreover, textindependent speaker identification is one of the domains of speech recognition that has been widely studied. Short speech duration in the speaker identification process is one of the challenges in the field of speaker recognition. Accuracy is a great issue when speech duration shorter, besides identification system has to be general enough to process various languages with different dialects which have their own characteristic based on tribe and region. Therefore, the author of this study introduces the speaker identification system in multi languages that comprise of regional, Indonesian, and English with short utterance. Researchers used MFCC technique to extract voice features and CNN as the classification model. There are two kinds of dataset used, open dataset for regional and English language, and own dataset for Indonesian. Own dataset used is a voice recording of 18 persons of different gender who each read the text in several paragraphs of sentences in Indonesian. Whereas public dataset of regional language used consisted of 80 speakers, 41 Sundanese and 39 Javanese. As for English dataset, 126 male speakers and 125 female speakers were taken from LibriSpeech. Tests are carried out separately with variety of language and speech duration, about 3 seconds in English and regional languages, 1 and 3 seconds in Indonesian. The result, best accuracy obtained by each dataset is 95% (regional dataset), 94% (English dataset), and 98% (private dataset).
Article
Full-text available
Penelitian ini mengembangkan metode budidaya tomat lebih efektif dengan analisis pola dan klasifikasi data citra menggunakan Convolutional Neural Network (CNN). Tujuan utamanya adalah mengklasifikasikan penyakit tomat dan jenis tanaman berdasarkan citra digital. Dengan menggunakan teknologi Deep Learning, informasi yang akurat dan mendalam tentang pertumbuhan dan kualitas tanaman tomat dapat diperoleh. Penelitian ini berhasil mencapai akurasi 99% dalam mengklasifikasikan citra daun tomat sehat, jamur septoria, jamur fulva, dan jamur target spot dengan model Inception V3. Perangkat lunak berbasis desktop yang dikembangkan mampu menampilkan hasil klasifikasi jenis daun secara spesifik.