Fig 1 - uploaded by Minghui Wu
Content may be subject to copyright.
Examples of non-code frames.

Examples of non-code frames.

Source publication
Article
Full-text available
Programming screencasts have become a pervasive resource on the Internet, which help developers learn new programming technologies or skills. The source code in programming screencasts is an important and valuable information for developers. But the streaming nature of programming screencasts (i.e., a sequence of screen-captured images) limits the...

Contexts in source publication

Context 1
... frames refer to screen images of software applications other than software development tools or screen images of development tools containing no source code. Figure 1 shows some typical examples of non-code frames that we commonly see in YouTube programming videos, including a frame of PowerPoint slide and a frame of web page with API Documentation. Many non-code frames, such as PowerPoint slides, do not contain source code. ...
Context 2
... the post-processing may retain the code psc2code: Denoising Code Extraction from Programming Screencasts 21:5 elements in non-code frames that are never used in the programming screencast. For example, none of the API information in the Javadoc page in Figure 1(b) is relevant to the programming screencast discussing the usage of the Math APIs. 4 Retaining such irrelevant code elements will subsequently result in inaccurate search and navigation of the source code in the programming screencast. To avoid these two issues, our approach distinguishes non-code frames from code frames before the OCR (see Section 3.2). ...
Context 3
... goal of psc2code is to extract code from frames in programming screencasts. As discussed in Section 2, an informative frame may not contain code (see Figure 1 for a typical examples of non-code frames). Furthermore, the code region of an IDE window in an informative frame may contain noise (e.g., code completion popups that block the real code). ...
Context 4
... the author adds a function setHeight(int height) to set the value of height and revises the getHeight() by removing the parameter int x and the assignment statement for the field height. Finally, the author makes setHeight(int) set the value of height with the input parameter only if the input parameter is less than 10, otherwise it sets the value of height as 0. Figure 10 and Figure 11 present the initial and the revised source code of the function getHeight and setHeight, respectively. For the third video, we designed four questions (V3-Q2/Q3/Q4/Q5) that require participants to properly understand the process (i.e., sequence of steps) in a programming video to answer the questions correctly. ...
Context 5
... the author adds a function setHeight(int height) to set the value of height and revises the getHeight() by removing the parameter int x and the assignment statement for the field height. Finally, the author makes setHeight(int) set the value of height with the input parameter only if the input parameter is less than 10, otherwise it sets the value of height as 0. Figure 10 and Figure 11 present the initial and the revised source code of the function getHeight and setHeight, respectively. For the third video, we designed four questions (V3-Q2/Q3/Q4/Q5) that require participants to properly understand the process (i.e., sequence of steps) in a programming video to answer the questions correctly. ...
Context 6
... noisy-code frames will affect the effectiveness of the segmentation algorithm and introduce errors in its follow-up steps. For example, Figure 12 presents an example of segments of a frame from V1 generated by CodeMotion. As shown in this figure, due to the popup windows, CodeMotion detects two segments and infers that these two segments contain source code. ...

Similar publications

Research
Full-text available
Smartphones are a vital part of life these days due to the connectivity they provide. One of the things that make smartphones so important in our daily lives is their efficiency. The speed with which you can do tasks on a smartphone is almost unmatchable. The major drawbacks of smartphones were their touch operations which create difficulties in do...
Preprint
Full-text available
In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect s...

Citations

... Ahmed et al. [19] examined the use of image processing to identify duplicate questions on Stack Overflow, noting that while images can complement text, the improvements in efficiency were modest. Research has further focused on the frequent sharing of code screenshots on social coding platforms, with a particular emphasis on using machine learning models for accurate text retrieval from images [20], [21], [22]. ...
Preprint
Full-text available
The integration of generative AI into developer forums like Stack Overflow presents an opportunity to enhance problem-solving by allowing users to post screenshots of code or Integrated Development Environments (IDEs) instead of traditional text-based queries. This study evaluates the effectiveness of various large language models (LLMs), specifically LLAMA, GEMINI, and GPT-4o in interpreting such visual inputs. We employ prompt engineering techniques, including in-context learning, chain-of-thought prompting, and few-shot learning, to assess each model's responsiveness and accuracy. Our findings show that while GPT-4o shows promising capabilities, achieving over 60% similarity to baseline questions for 51.75% of the tested images, challenges remain in obtaining consistent and accurate interpretations for more complex images. This research advances our understanding of the feasibility of using generative AI for image-centric problem-solving in developer communities, highlighting both the potential benefits and current limitations of this approach while envisioning a future where visual-based debugging copilot tools become a reality.
... Furthermore, existing studies delved into specific aspects such as denoising code extraction from programming screencasts and code localization within programming screencasts, offering insights into how to parse information from programming activities [1,6]. Yet, the mechanism for providing real-time feedback or enhancing efficiency was missing. ...
... Furthermore, existing studies delved into specific aspects such as denoising code extraction from programming screencasts and code localization within programming screencasts, offering insights into how to parse information from programming activities [1,6]. Yet, the mechanism for providing real-time feedback or enhancing efficiency was missing. ...
Preprint
Full-text available
In this paper, we propose SmartMonitor, a system that utilizes an edge hardware device to record, analyze and log user activity while not interfering with said activity. The analysis of the activity can help users identify the time they spend on different tasks and provide real-time feedback from a Large Language Model (LLM) to give better awareness of user activity. The SmartMonitor passes through the HDMI signal from the video card and analyzes the user’s activity on edge using two artificial intelligence models, logs user activity and sends the log for analysis to a Large Language Model for feedback. SmartMonitor enables non-intrusive self-quantifying technology that both records and analyzes user activity while protecting privacy and reducing the processing burden on the user’s device, which can serve as an excellent research framework for behavioral analysis in the workplace and a way to enhance user's work activity.
... Furthermore, existing studies delved into specific aspects such as denoising code extraction from programming screencasts and code localization within programming screencasts, offering insights into how to parse information from programming activities [1,6]. Yet, the mechanism for providing real-time feedback or enhancing efficiency was missing. ...
Conference Paper
Full-text available
In this paper, we propose SmartMonitor, a system that utilizes an edge hardware device to record, analyze and log user activity while not interfering with said activity. The analysis of the activity can help users identify the time they spend on different tasks and provide real-time feedback from a Large Language Model (LLM) to give better awareness of user activity. The SmartMonitor system is shown in Fig 1, where the hardware device passes through the HDMI signal from the video card and analyzes the user's activity on edge using two artificial intelligence models, logs user activity and sends the log for analysis to a Large Language Model for feedback. SmartMonitor enables non-intrusive self-quantifying technology that both records and analyzes user activity while protecting privacy and reducing the processing burden on the user's device, which can serve as an excellent research framework for behavioral analysis in the workplace and a way to enhance user's work activity.
... This method extracts the textual content from the images, facilitating the transformation of pixelated code into a format that developers can easily integrate into their projects. Over the last decade, many efforts have been dedicated to facilitating the process of extracting source code from video programming tutorials [7][8][9][10][11][12][13][14][15][16]. Several approaches have been proposed to clean the extracted code, such as by (i) using statistical language models [10,16] for Java language, (ii) verifying the extracted tokens using a dictionary-based approach [14,15,17], and (iii) using the CodeT5 model [18] to correct Java code [13]. ...
... Over the last decade, many efforts have been dedicated to facilitating the process of extracting source code from video programming tutorials [7][8][9][10][11][12][13][14][15][16]. Several approaches have been proposed to clean the extracted code, such as by (i) using statistical language models [10,16] for Java language, (ii) verifying the extracted tokens using a dictionary-based approach [14,15,17], and (iii) using the CodeT5 model [18] to correct Java code [13]. Other work focused on developing an ensemble technique that improves OCR performance on specific domains, such as extracting texts from (i) geology papers using Softmax regression and SVM [19], (ii) raster images with Arabic script using SVM [20], and (iii) an IAM handwritten dataset using CRNN, LSTM, and CTC [21]. ...
... The second crucial aspect overlooked by prior studies is the potential of image pre-processing to boost OCR performance for code extraction from programming tutorials, aiming to eliminate noise that could originate from images with noisy pixels. Yet, previous work has extracted the code from images, then created a custom model for a specific programming language (e.g., Java) to detect and correct errors [10,13,16], without initially addressing image quality or employing image denoising techniques. Finally, previous work has relied on off-the-shelf OCR engines such as Tesseract [7,22] and Google Vision [14,15,23] for extracting source code without investigating the state-of-the-art visionbased large language models (LLMs), which could significantly enhance code extraction performance dramatically. ...
Article
Full-text available
The rapid evolution of video programming tutorials as a key educational resource has highlighted the need for effective code extraction methods. These tutorials, varying widely in video quality, present a challenge for accurately transcribing the embedded source code, crucial for learning and software development. This study investigates the impact of video quality on the performance of optical character recognition (OCR) engines and the potential of large language models (LLMs) to enhance code extraction accuracy. Our comprehensive empirical analysis utilizes a rich dataset of programming screencasts, involving manual transcription of source code and the application of both traditional OCR engines, like Tesseract and Google Vision, and advanced LLMs, including GPT-4V and Gemini. We investigate the efficacy of image super-resolution (SR) techniques, namely, enhanced deep super-resolution (EDSR) and multi-scale deep super-resolution (MDSR), in improving the quality of low-resolution video frames. The findings reveal significant improvements in OCR accuracy with the use of SR, particularly at lower resolutions such as 360p. LLMs demonstrate superior performance across all video qualities, indicating their robustness and advanced capabilities in diverse scenarios. This research contributes to the field of software engineering by offering a benchmark for code extraction from video tutorials and demonstrating the substantial impact of SR techniques and LLMs in enhancing the readability and reusability of code from these educational resources.
... For example, Shi et al. [61] utilized early stopping when training a deep Siamese network to identify hidden feature requests posted in chat messages by developers. Other techniques like data augmentation [2], [18], [42] and data balancing [52], [74] are also employed to address overfitting. For instance, Bao et al. [2] developed a CNNbased image classification model to filter out non-code and noisy-code frames from programming screencasts. ...
... Other techniques like data augmentation [2], [18], [42] and data balancing [52], [74] are also employed to address overfitting. For instance, Bao et al. [2] developed a CNNbased image classification model to filter out non-code and noisy-code frames from programming screencasts. To enhance training data diversity, they employed data augmentation techniques such as rotation, scaling, translation, and shearing. ...
Article
Full-text available
In software engineering, deep learning models are increasingly deployed for critical tasks such as bug detection and code review. However, overfitting remains a challenge that affects the quality, reliability, and trustworthiness of software systems that utilize deep learning models. Overfitting can be (1) prevented (e.g., using dropout or early stopping) or (2) detected in a trained model (e.g., using correlation-based approaches). Both overfitting detection and prevention approaches that are currently used have constraints (e.g., requiring modification of the model structure, and high computing resources). In this paper, we propose a simple, yet powerful approach that can both detect and prevent overfitting based on the training history (i.e., validation losses). Our approach first trains a time series classifier on training histories of overfit models. This classifier is then used to detect if a trained model is overfit. In addition, our trained classifier can be used to prevent overfitting by identifying the optimal point to stop a model’s training. We evaluate our approach on its ability to identify and prevent overfitting in real-world samples. We compare our approach against correlation-based detection approaches and the most commonly used prevention approach (i.e., early stopping). Our approach achieves an F1 score of 0.91 which is at least 5% higher than the current best-performing non-intrusive overfitting detection approach. Furthermore, our approach can stop training to avoid overfitting at least 32% of the times earlier than early stopping and has the same or a better rate of returning the best model.
... fragment and FRAPT treats a fragment as irrelevant to such an API. In future work, we will consider the usage of the semantics of the image (e.g., Bao et al. 2020). ...
... Recently, API tutorials have attracted much research interest (e.g., Nguyen et al. 2018;Ponzanelli et al. 2019;Treude et al. 2015). Some studies have developed tools (e.g., CodeTube (Ponzanelli et al. 2019) and psc2code (Bao et al. 2020)) to recognize software development video tutorial fragments. Different from the above tools, our work focuses on textual API tutorial fragments. ...
Article
Full-text available
API tutorials are crucial resources as they often provide detailed explanations of how to utilize APIs. Typically, an API tutorial is segmented into a number of consecutive fragments.. If a fragment explains API usage, we regard it as a relevant fragment of the API. Recognizing relevant fragments can aid developers in comprehending, learning, and using APIs. Recently, some studies have presented relevant fragments recognition approaches, which mainly focused on using API tutorials or Stack Overflow to train the recognition model. API references are also important API learning resources as they contain abundant API knowledge. Considering the similarity between API tutorials and API references (both provide API knowledge), we believe that using API knowledge from API references could help recognize relevant tutorial fragments of APIs effectively. However, it is non-trivial to leverage API references to build a supervised learning-based recognition model. Two major problems are the lack of labeled API references and the unavailability of engineered features of API references. We propose a supervised learning based approach named RRTR (which stands for Recognize Relevant Tutorial fragments using API References) to address the above problems. For the problem of lacking labeled API references, RRTR designs heuristic rules to automatically collect relevant and irrelevant API references for APIs. Regarding the unavailable engineered features issue, we adopt the pre-trained SBERT model (SBERT stands for Sentence-BERT) to automatically learn semantic features for API references. More specifically, we first automatically generate labeled API,AREAPI,ARE\left\langle API, ARE \right\rangle pairs (ARE stands for an API reference) via our heuristic rules of API references. We then use SBERT to automatically learn semantic features for the collected pairs and train a supervised learning based recognition model. Finally, we can recognize the relevant tutorial fragments of APIs based on the trained model. To evaluate the effectiveness of RRTR, we collected Java and Android API reference datasets containing a total of 20,680 labeled API,AREAPI,ARE\left\langle API, ARE \right\rangle pairs. Experimental results demonstrate that RRTR outperforms state-of-the-art approaches in terms of F-Measure on two datasets. In addition, we conducted a user study to investigate the practicality of RRTR and the results further illustrate the effectiveness of RRTR in practice. The proposed RRTR approach can effectively recognize relevant fragments of APIs with API references by solving the problems of lacking labeled API references and engineered features of API references.
... One approach is to use statistical language models such as uni-gram and bi-gram trained on a large corpus of error-free code. Research by Yadid and Yahav [43] and Bao et al. [8] has followed this approach, with the former utilizing both uni-gram and bi-gram models for token, line, and code fragment level correction while the latter used a uni-gram model for token and line level. Another approach involved using dictionary-based techniques for OCR postprocessing correction. ...
... OCR has also been recently adopted by researchers in software engineering for extracting source code from programming video tutorials [43,33,18,4,5,8]. It brings great promise in this area of research, allowing the text found in programming tutorials, including the source code, to be extracted, indexed, searched, and reused. ...
... It brings great promise in this area of research, allowing the text found in programming tutorials, including the source code, to be extracted, indexed, searched, and reused. Researchers in software engineering have primarily relied on the Tesseract OCR engine to extract code from programming tutorials [43,33,18,4,5,8], given it is free and open source, as opposed to other OCR engines, such as the Google Drive OCR, which can get costly when processing large amounts of data. ...
Preprint
Accurate automatic code extraction from tutorial videos is crucial for software developers seeking to reuse the code contained in these videos. Current methods using optical character recognition (OCR) often yield inaccurate results due to code complexity and variations in screencast formats. To address this issue, we introduce CodeT5-OCRfix, an approach that leverages the pre-trained code-aware large language model CodeT5 to enhance code extraction accuracy by post-processing OCRed code. We first collect a large and diverse dataset of source code screenshots captured from more than 10K Java projects from GitHub. We then apply the most widely used OCR engine for the task of code extraction from videos, Tesseract, on these screenshots and collect the OCRed code along with the ground truth code extracted from the Java files. We built a training dataset of more than 585K pairs of OCRed and ground truth code pairs, which we then used to fine-tune CodeT5, obtaining our model CodeT5-OCRfix. An empirical evaluation on both screenshots and screencast frames shows that CodeT5-OCRfix outperforms baseline code extraction models and is also more time-efficient. Our approach therefore improves the state-of-the-art in code extraction techniques from screencasts and images.
... When identifying coding steps, our approach is aware of primitive HCI actions that constitute the coding steps. This action awareness differentiates our approach from existing action-agnostic, content-based workflow extraction methods [9], [6], [10]. It allows our approach to filter out irrelevant content changes on the computer screen, resulting from noncoding actions such as switch windows, trigger or leave popups, which always confuse action-agnostic methods. ...
... Optical Character Recognition (OCR) tools (e.g., Tesseract [11]) are commonly used for extracting text from UI screenshots [9], [6], [10]. OCR tools are ideal for processing document images, but their performance degrades for UI screenshots [10] due to complex view layouts and text styles. ...
... Optical Character Recognition (OCR) tools (e.g., Tesseract [11]) are commonly used for extracting text from UI screenshots [9], [6], [10]. OCR tools are ideal for processing document images, but their performance degrades for UI screenshots [10] due to complex view layouts and text styles. Fig. 4(a) shows some inaccurate text lines on an IDE screenshot detected by Tesseract. ...
Preprint
Full-text available
Programming screencasts (e.g., video tutorials on Youtube or live coding stream on Twitch) are important knowledge source for developers to learn programming knowledge, especially the workflow of completing a programming task. Nonetheless, the image nature of programming screencasts limits the accessibility of screencast content and the workflow embedded in it, resulting in a gap to access and interact with the content and workflow in programming screencasts. Existing non-intrusive methods are limited to extract either primitive human-computer interaction (HCI) actions or coarse-grained video fragments.In this work, we leverage Computer Vision (CV) techniques to build a programming screencast analysis tool which can automatically extract code-line editing steps (enter text, delete text, edit text and select text) from screencasts.Given a programming screencast, our approach outputs a sequence of coding steps and code snippets involved in each step, which we refer to as programming workflow. The proposed method is evaluated on 41 hours of tutorial videos and live coding screencasts with diverse programming environments.The results demonstrate our tool can extract code-line editing steps accurately and the extracted workflow steps can be intuitively understood by developers.
... Moran et al. [28] UI element's assembling after detection Dou et al. [84] Quantification of website aesthetics Halter et al. [85] Annotation tool for films Bao et al. [86] Programming code extraction Han et al. [87] 3D sketching system Bell et al. [88] product visual similarity Nishida et al. [89] sketching of urban models Shao et al. [90] Semantic modeling of indoor scenes Schlattner et al. [66] Prediction of the element's property value Bylinskii et al. [91] User's focus areas prediction Liu et al. [61] Created semantic annotations for Rico dataset. Yeo et al. [92] Pose recognition by using wearable device Kong et al. [93] Smart glass UI for the selection of home appliances Mairittha et al. [94] Mobile UIs personalization detection and perdiction Stiehl et al. [95] UI for sign wirting (hand gesture) detection Tensmeyer et al. [96] Font recognition and classification ...
Preprint
Full-text available
In this paper, we present a review of the recent work in deep learning methods for user interface design. The survey encompasses well known deep learning techniques (deep neural networks, convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks) and datasets widely used to design user interface applications. We highlight important problems and emerging research frontiers in this field. We believe that the use of deep learning for user interface design automation tasks could be one of the high potential fields for the advancement of the software development industry.