ArticlePDF Available

Using artificial intelligence to create biology multiple choice questions for higher education

Authors:

Abstract and Figures

This study aims to determine the validity, reliability, level of difficulty, and discrimination power of an artificial intelligence (AI)-generated collection of biology questions for higher education. Students’ responses to AI-generated questions are also presented in this study. A sample of 272 students was selected using a random sampling technique to answer a series of multiple-choice questions and complete a questionnaire. Based on the research findings, 20 of the 21 questions generated by ChatGPT AI are valid. Cronbach’s alpha coefficient was determined to be 0.65 (fairly reliable) for the twenty valid questions. Based on student responses to questions generated by ChatGPT’s AI, it was determined that 79% of students indicated that the AI-generated questions were relevant to the class subject. 72% of students reported that the clarity of AI-generated questions was acceptable. 73% of students reported that the accuracy of AI-generated questions was good.
Copyright © 2023 by Author/s and Licensed by Modestum. This is an open access article distributed under the Creative Commons Attribution License which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work is proper ly cited.
Agricultural and Environmental Education
2023, 2(1), em002
e-ISSN: 2752-647X
https://www.agrenvedu.com/ Research Article OPEN ACCESS
Using artificial intelligence to create biology multiple choice
questions for higher education
Nanda Eska Anugrah Nasution 1*
1 UIN Kiai Haji Achmad Siddiq Jember, Jaw a Timur, INDONESIA
*Corresponding Author: nsteska@gmail.com
Citation: Nasution, N. E. A. (2023). Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and
Environmental Education, 2(1), em002. https://doi.org/10.29333/agrenvedu/13071
ARTICLE INFO
ABSTRACT
Received: 11 Mar. 2023
Accepted: 11 Mar. 2023
This study aims to determine the validity, reliability, level of difficulty, and discrimination power of an artificial
intelligence (AI)-generated collection of biology questions for higher education. Students responses to AI-
generated questions are also presented in this study. A sample of 272 students was selected using a random
sampling technique to answer a series of multiple-choice questions and complete a questionnaire. Based on the
research findings, 20 of the 21 questions generated by ChatGPT AI are valid. Cronbachs alpha coefficient was
determined to be 0.65 (fairly reliable) for the twenty valid questions. Based on student responses to questions
generated by ChatGPTs AI, it was determined that 79% of students indicated that the AI-generated questions were
relevant to the class subject. 72% of students reported that the clarity of AI-generated questions was acceptable.
73% of students reported that the accuracy of AI-generated questions was good.
Keywords: ChatGPT, multiple choice questions, artificial intelligence, validity, reliability
INTRODUCTION
Algorithms driven by machine-learning technologies are now gaining maturity. ChatGPT is one such innovation. ChatGPT is an
interactive chatbot created by OpenAI, a California-based artificial intelligence (AI) startup (Susnjak, 2022). OpenAIs ChatGPT is a
comprehensive language model. ChatGPT AI was trained on a massive corpus of text data using a deep learning algorithm to create
replies like those of a human for natural language questions (ChatGPT, 2023). ChatGPT AI bot is now accessible at
https://chat.openai.com/chat.
AI natural language processing (NLP) technologies, such as ChatGPT AI, provide a means through which computers may
engage with human language. A crucial stage in NLP, known as tokenization, is transforming unstructured information into
organized text appropriate for computing (Hosseini et al., 2023). ChatGPT AI is interactive, able to comprehend what is being
requested, and able to deliver it if it meets with application policies and data availability. For example, if you ask a search engine
like Google to offer a list of questions connected to a particular topic, Google will send a link to a website that includes information
relevant to the query you requested. When asking the same command to ChatGPT AI, the application will provide the question in
that column.
The emergence of ChatGPT AI is similar to the emergence of other new innovative technologies that, if used appropriately,
have the potential to benefit education. Despite the fact that ChatGPT AI has the potential to be utilized for activities that are not
acceptable in the academic sector. Students, for example, utilize ChatGPT AI to generate assignments such as essays. However,
teachers may be able to use AI to spot AI-created works.
Teachers can use ChatGPT AI in a variety of ways, including asking information-related questions, confirming the accuracy of
data, reviewing topics, etc. Teachers can request ChatGPT AI to generate multiple-choice questions for tests. Obviously, with its
current version, ChatGPT AI has not been able to create an assessment instrument that can accurately measure a learning
objective if it is not given explicit instructions by an expert or teacher. However, it is not impossible that in the future ChatGPT AI
may be able to generate complex questions if it has access to a huge amount of data and has received extensive training.
A question arises regarding the form of questions that the current version of ChatGPT AI is capable of compiling. How valid and
reliable are the question sets generated by ChatGPT AI? What is the difficulty level of the questions created by the ChatGPT AI?
What do students think about the questions created by ChatGPT AI? is it easy to read or understand? Is it relevant to the material
being studied? Is it comparable to questions posed by humans?
Reliability and validity are, at a minimum, the two most important and essential aspects to consider when evaluating any
measurement instrument or tool used (Mohajan, 2017). A measurement instrument is valid when it measures what it is intended
2 / 11 Nasution / Agricultural and Environmental Education, 2(1), em002
to measure (Muijs, 2011). In other words, if an instrument measures a required variable accurately, it is termed a valid instrument
for that variable (Ghazali, 2016). In comparison, reliability is defined as the degree to which test scores are free of measurement
error (Muijs, 2011). It is a measurement of the stability or internal consistency of an instrument used to measure particular
variable (Jackson, 2003). Multiple-choice questions are regarded to have a high level of reliability since they are scored objectively
(Considine et al., 2005; Haladyna, 1999). Validity and reliability are related. It is possible for an instrument to be reliable but not
valid; however, it cannot be valid if it is not reliable (Jackson, 2003). In other words, a valid instrument must also be reliable
(Ghazali, 2016).
The quality of a multiple-choice questions test instrument can be determined by its validity and reliability, as well as its level
of difficulty and discrimination power (Considine et al., 2005; Friatma & Anhar, 2019; Setiawaty et al., 2017; Rao et al., 2016; Salwa,
2012). The items difficulty corresponds to the proportion of correct responses (McCowan & McCowan, 1999). It is the frequency
with which test-takers select the appropriate response (Thorndike et al., 1991). Items with a higher difficulty index are less difficult.
A question that was answered correctly by 75% of test-takers has a difficulty level of 0.75. A question that was answered correctly
by 35% of test-takers has a difficulty level of 0.35 (McCowan & McCowan, 1999). Item discrimination contrasts the proportion of
high scorers and low scorers who correctly answer a given item. It refers to the degree to which items discriminate between
students in the high and low groups. The whole test and each individual item should assess the same concept. High performers
should be more likely to answer a good question properly, but poor performers should be more likely to do so wrong (McCowan &
McCowan, 1999).
This study aims to determine the validity, reliability, level of difficulty, and discrimination power of an AI-generated collection
of biology questions for higher education. Studentsresponses to AI-generated questions are also presented in this study.
METHODS
This research is a descriptive quantitative analysis to explain the validity and reliability of ChatGPT AIs questions. Before
conducting the research, questions obtained from ChatGPT AI were compiled and administered to students. The steps of research
are described in more detail below.
Accessing ChatGPT Artificial Intelligence
The researcher accessed the ChatGPT AI website in 2023, created an account, and logged into the application. Version 30
January 2023 of ChatGPT AI is in use (Figure 1).
Figure 1. View of the publicly available ChatGPT AI bot landing page after login (ChatGPT, 2023)
Nasution / Agricultural and Environmental Education, 2(1), em002 3 / 11
Creating questions
Researchers ask ChatGPT AI to create questions using the query write me a multiple choice question with one correct answer
option and four wrong answer options about <subject> for bachelors degree, tag the correct answer. <subject> are seven basic
biology studies discussed in high school and university biology subjects. The seven studies and the distribution of questions made
by ChatGPT AI can be seen in Table 1.
In accordance with the request, ChatGPT AI successfully created 21 questions, each with five multiple-choice options, one of
which was the correct answer and four of which were incorrect. ChatGPT AI also marks the correct answer for each question. The
questions created by ChatGPT AI are written in English, which are then translated into Indonesian and evaluated by English and
Indonesian lecturers with expertise in both languages. The researcher then compiled the 21 questions on the Google form and
administered them to students in person. Students were presented with questions in both English and Indonesian. The test is
administered under strict supervision and with closed books to ensure that studentsresponses are based solely on their own
knowledge and not on the assistance of others or the internet/books. The exam is administered in 42 minutes, with two minutes
allotted to each question.
StudentsResponses to Artificial Intelligence-Generated Questions
We gathered student responses to AI-generated questions using the criteria developed by Susnjak (2022) in his research to
assess AI responses. After completing the AI-generated questions, students work on this questionnaire. Students were told that
the questions they had just worked on had been created by AI, and they were given 10 minutes to complete this questionnaire.
Only students who are willing to complete this questionnaire will be eligible (not required for all students).
Table 2 displays the response questionnaire as well as the criteria.
Participants and Data collection
This study was carried out at the department of science education at a state university in East Java, Indonesia. A sample of 272
students were selected using random sampling technique from two study programs, namely biology education and natural
science education. Not all pupils in class do the questions, only those who wish to. And only those students who choose to fill out
a response questionnaire to students are asked to do so. 68% of the students that worked on the questions were those who worked
on the response questionnaire, which only included 185 students.
Majority (38.97%, n=106) of the participants were aged 20 years. This was followed by 21 years (30.88%, n=84), 19 years
(21.32%, n=58), 22 years (6.98%, n=19), 23 years (1.47%, n=4), and 24 years (.36%, n=1). Among the students, 231 (84.92%) were
female and 41 (15%) were male. 133 participants (48.9%) are student of biology education program study, and 139 participants
(51.1%) are student of natural science program study. The participants were picked from all levels in the undergraduate program.
However, majority (50.37%, n=137) of them were third year students. The second year students were 47.42% (n=129), the fourth
year students were 1.1% (n=3), the fifth year students were 0.7% (n=2), and the first year students were only 0.36% (n=1).
Statistical Analysis
All statistical analyses were calculated using IBM SPSS statistics 26 software. The validity of the questions was determined
using Pearson product-moment correlation (Ahrens et al., 2020; Cho et al., 2006; Harahap et al., 2019; Mutmainah & Isdiati, 2022;
Salwa, 2012). The reliability of the questions was determined using the Cronbachs alpha value (Ahrens et al., 2020; Cho et al.,
2006; Harahap et al., 2019; Mutmainah & Isdiati, 2022; Salwa, 2012). The level of difficulty of the questions is determined by the
following formula from McCowan and McCowan (1999):
Difficulty index (P)=# who answered an item correctly/total # tested.
Table 1. Descriptive statistics and Cronbachs alpha coefficient value
Biology subject
Number of questions
Question number
Change and growth
3
1, 2, & 3
Cell
3
4, 5, & 6
Biodiversity
3
7, 8, & 9
Genetics
3
10, 11, & 12
Evolution
3
13, 14, & 15
Ecology
3
16, 17, & 18
Biotechnology
3
19, 20, & 21
Table 2. Questionnaire of student responses to AI-generated questions
Criteria
Relevance
Clarity
Accuracy
Precision
Depth
4 / 11 Nasution / Agricultural and Environmental Education, 2(1), em002
The difficulty level of the questions is described, as follows: Difficult if below 0.3, medium if between 0.3 and 0.7, and easy if
above 0.7. Using the following formula from Salwa (2012), the discrimination power level of the questions is determined:
Discrimination index (D)=# of top test takers who answered an item correctly/total # of top test takers tested (27% of all
students)-# of bottom test takers who answered an item correctly/total # of bottom test takers tested (27% of all students).
The discrimination power of the questions is defined, as follows: poor if below 0.2, adequate if between 0.2 and 0.4, good if
between 0.4 and 0.7, and excellent if over 0.7. If the result is negative, the discrimination power level of the item is inadequate,
and the item must be eliminated. Studentsresponses to AI-generated questions were analyzed descriptively.
RESULTS
ChatGPT Artificial Intelligence-Generated Questions
ChatGPT AI successfully generated 21 questions. Appendix A shows a list of all questions.
Validity
The results of the validity test of all ChatGPT AI-generated questions can be seen in Table 3.
Reliability
The results of the reliability test for all ChatGPT AI-generated questions may be viewed in Table 4 if the invalid question is not
removed (question no. 17), and in Table 5 if the invalid question is removed.
Level of Difficulty and Discrimination Power
The results of the level of difficulty and discrimination power of all ChatGPT AI-generated questions can be seen in Table 6.
Student responses to Artificial Intelligence-Generated Questions
The percentage of student responses to questions generated by ChatGPT AI can be seen in Figure 2.
Table 3. The results of the validity of all ChatGPT AI-generated questions
Question number
Pearson correlation
Sig. (2-tailed)
Description
1
.321**
.000
Valid
2
.211**
.000
Valid
3
.258**
.000
Valid
4
.313**
.000
Valid
5
.539**
.000
Valid
6
.401**
.000
Valid
7
.15**
.000
Valid
8
.365**
.000
Valid
9
.391**
.000
Valid
10
.502**
.000
Valid
11
.478**
.000
Valid
12
.421**
.000
Valid
13
.338**
.000
Valid
14
.318**
.000
Valid
15
.222**
.000
Valid
16
.467**
.000
Valid
17
.038
.533
Invalid
18
.429**
.000
Valid
19
.345**
.000
Valid
20
.39**
.000
Valid
21
.478**
.000
Valid
Table 4. The results of the reliability of all ChatGPT AI-generated questions (invalid question is not removed)
Cronbachs alpha
Number of items
.623
21
Table 5. The results of the reliability of all ChatGPT AI-generated questions (invalid question is removed)
Cronbachs alpha
Number of items
.655
20
Nasution / Agricultural and Environmental Education, 2(1), em002 5 / 11
DISCUSSION
According to Kimberlin and Winterstein (2008), validity is generally described as the degree to which an instrument measures
what it claims to measure. It is necessary for an instrument to be valid so that it may be used to measure its intended subject.
Using Pearson product moment correlation method to assess the validity of the questions, it was determined that 20 out of 21
items were valid, while one item was invalid. The invalid question is number 17, which is related to ecology. The results of the
validity test indicate that 20 of the 21 questions generated by AI are valid and may be used.
The question number 17, which is invalid, asks students to choose the term used to describe the way by which organisms
obtain energy from their environment. Option D (photosynthesis) is the correct answer, selected by 90 students (or 33%). Option
A (metabolism), which 113 students (41.54%) selected, option B (ecosystem), which 40 students (14.7%) selected, option C
(biodiversity), which 21 students (7.7%) selected, and option E (biogeography), which eight students (2.9%) selected, are all
incorrect answer choices. Based on student choices, it was determined that a higher number of students selected option A
(incorrect answer option) than the right answer. It is possible to derive that, first, answer choice A is an excellent diversion, or,
second, there is a problem with question number 17. According to follow-up interviews with three random students who claimed
to have chosen answer option A, they were confused by the question sentences. If the question is how do plants get energy from
their environmentor how do organisms obtain energy from nature, it is likely that the student would choose option D (the correct
one). Yet, given the wording in question 17 is how organisms get energy from their environment, students have misinterpreted the
organisms at issue as animals and plants, and the environment referred to are other living species (by way of prey).
Table 6. The results of the level of difficulty and discrimination power of all ChatGPT AI-generated questions
Question number
Level of difficulty
Discrimination power
Scale
Description
Scale
Description
1
0.87
Easy
0.19
Poor
2
0.15
Difficult
0.21
Adequate
3
0.36
Medium
0.32
Adequate
4
0.74
Easy
0.4
Good
5
0.39
Medium
0.7
Good
6
0.88
Easy
0.29
Adequate
7
0.17
Difficult
0.08
Poor
8
0.71
Easy
0.44
Good
9
0.82
Easy
0.36
Adequate
10
0.85
Easy
0.45
Good
11
0.57
Medium
0.63
Good
12
0.56
Medium
0.55
Good
13
0.70
Medium
0.34
Adequate
14
0.82
Medium
0.29
Adequate
15
0.39
Medium
0.19
Poor
16
0.92
Easy
0.21
Adequate
17
0.33
Medium
0
Poor
18
0.38
Medium
0.53
Good
19
0.88
Easy
0.23
Adequate
20
0.41
Medium
0.51
Good
21
0.93
Easy
0.22
Adequate
Figure 2. The percentage of student responses to questions generated by ChatGPT AI (Source: Author’s own elaboration)
6 / 11 Nasution / Agricultural and Environmental Education, 2(1), em002
In addition, language and sentence issues that may be present in multiple-choice questions created by ChatGPT AI, such as
question number 17 in this research, may be corrected by experts using content and face validity, as suggested Considine et al.
(2005) and Harahap and Nasution (2022). Nevertheless, we did not do so in our research because we wanted to ensure that the
questions generated by ChatGPT AI were free from any human adjustment.
Cronbachs alpha was used to assess the scales internal consistency. Cronbachs alpha coefficient was determined to be
0.623% if item 17 (invalid item) was not removed, and 0.655% if item 17 (invalid item) was removed. The acceptable values for
cronbachs alpha vary according on the source. According to van Griethuijsen et al. (2014), the acceptable values of cronbachs
alpha are 0.7 or 0.6. Arulogun et al. (2020), George and Mallery (2003), Morgan et al. (2004), Rii et al. (2020), Taber (2018), and
Wongpakaran and Wongpakaran (2012) emphasized the same point, that a Cronbachs alpha above 0.6 can be recognized as a
reliable instrument. If this value is adhered to, then the multiple-choice questions generated by ChatGPT AI in this study may be
deemed reliable. Several other resources, however, say that the allowable values for cronbachs alpha are 0.8 or even 0.9; if this
figure is used, the multiple-choice questions created by ChatGPT AI in this study may be regarded unreliable.
By evaluating the level of difficulty of the questions, it was determined that, of the 21 questions created by ChatGPT AI, nine
were classified as easy, 10 were categorized as medium, and two were classified as difficult. It is preferable to use a proportionate
distribution of easy, medium, and difficult multiple-choice questions. In this context, proportionate means that there should be
at least twice as many questions at the medium level as at the easy and difficult levels, with an equal number of questions at the
easy and difficult levels. ChatGPT AI developed multiple choice questions having nearly identical easy and medium levels, and
only two items (9.5%) are classified as difficult. It is preferred if questions with easy and challenging difficulty levels be revised to
be more proportional, or to become questions with a medium difficulty level. Rao et al. (2016) stated that ideally multiple choice
questions have a medium level of difficulty. Of course, this should be revised depending on the aim of the assessment.
By assessing the discriminating power of the questions, it was determined that, of the 21 questions created by ChatGPT AI, 4
had low discrimination power, nine had adequate discrimination power, and the remaining eight had good discrimination power.
Questions with low discrimination power should be modified to have discrimination power that is adequate or greater. There are
no items with negative discrimination power, suggesting that there are no questions that should be deleted based on the
discriminating power analysis. One of the items, however, has a discrimination value of zero, indicating that this item has very
poor discriminatory power, since the number of students who answered this item correctly in the upper group and the lower group
are identical. This question turned out to be number 17, which was classified as invalid based on the validity test, thus it is not
unexpected that this question has very poor discriminating power. Moreover the difficulty index and discrimination index are
reciprocally related (Chauhan et al., 2013; Mehta & Mokhasi, 2014; Rao et al., 2016; Suruchi & Rana, 2014). For instance, if a question
is determined to have a low level of difficulty and poor discriminating power, the question should be revised (Rao et al., 2016).
Based on student responses to questions produced by ChatGPTs AI, it was determined that 79% of students indicated the AI-
generated questions were relevant to the departmental subject they study. This finding suggests that ChatGPT AI is capable of
generating questions pertaining to the specified subject, in this case biology in natural science, including change and growth, cell,
biodiversity, genetics, evolution, ecology, and biotechnology. 72% of students reported that the questions generated by AI were
clear. This suggests that the majority of students are capable of comprehending the questions posed by ChatGPT AI. The clarity of
questions is determined by three survey items. The first item on the questionnaire asks whether the questions generated by
ChatGPT AI are simple to comprehend. 66% of students indicated that the questions were straightforward. The second question
asks if the questions generated by ChatGPT AI are logically structured and ordered. According to 76% of students, the questions
were well-structured and logically ordered. The last question asks if questions generated by ChatGPT AI employ the proper
language. 73% of students feel the question language is suitable. The questions in an assessment must be clear and concise.
Difficult-to-understand questions will surely make it harder for students to answer, and there is a chance that students will
respond erroneously not because of their incompetence but because of an error in the question.
73% of students stated that the AI-generated questions were accurate. This means that the majority of students consider the
questions created by AI to be accurate; they see no grammatical or conceptual errors in the questions. However, you cant depend
just on studentsopinions to confirm the accuracy of a question. Several experts should be consulted to validate the question.
Nevertheless, as stated previously, the questions in this study were not evaluated by professionals in order to determine how the
questions were generated by AI.
74% of students indicated that the questions generated by AI were precise. This suggests that the majority of students consider
AI-generated questions to be explicit and detailed. Students comprehend the intent of the questions and the required responses.
If questions are not made clear and explicit, it is possible that students may have difficulties answering.
71% of students indicated that the questions posed by AI were of sufficient depth. The majority of students found that the
questions generated by ChatGPT AI were challenging, not overly simple, and appropriate for their college or university level. As
was done in this study, measuring the difficulty level of the questions is another method for determining if the questions are too
easy or too difficult. Just two of the twenty-one questions generated by AI are difficult, while nine are quite easy.
The majority of students responded positively to the questions generated by ChatGPTs AI, according to the results of the
student response questionnaire. Therefore, the teacher can use AI to assist him construct an assessment tool, but this must be
complemented by the teachers capacity to provide AI with clear instructions and to verify and optimize the resulting assessment
tool as needed. Further study is required to determine if students can differentiate between questions developed by AI and those
created by humans, as well as their perspectives on the conditions for AI-created questions.
Given that constructing multiple-choice questions is a complex and time-consuming process (Rao et al., 2016), it would be
highly beneficial if AI could aid teachers or the education sector in the future in developing standardized and high-quality multiple-
Nasution / Agricultural and Environmental Education, 2(1), em002 7 / 11
choice questions. Nevertheless, the present version of ChatGPT AI has several limitations, as mentioned by OpenAI on its website
(ChatGPT, 2023), such as the possibility of producing wrong information, harmful instructions, or biased material, and limited
awareness of the world and events after 2021. Quite likely, ChatGPT AI will acquire more data and better training over time,
allowing it to assist its users more effectively.
CONCLUSION
Based on the research findings, twenty of the twenty-one questions generated by ChatGPT AI are valid. Ecology-related
questions are the only question that is invalid. Cronbachs alpha coefficient was determined to be 0.65 for the twenty valid
questions. By assessing the level of difficulty of the questions, it was determined that, of the 21 questions created by ChatGPT AI,
nine were rated as easy, 10 were classified as medium, and two were classified as difficult. By assessing the discriminating power
of the questions, it was determined that, of the 21 questions created by ChatGPT AI, four had low discrimination power, nine had
adequate discrimination power, and the remaining eight had good discrimination power. Based on student responses to
questions generated by ChatGPTs AI, it was determined that 79% of students indicated that the AI-generated questions were
relevant to the class subject. 72% of students reported that the clarity of AI-generated questions was acceptable. 73% of students
reported that the accuracy of AI-generated questions was good. According to 74% of pupils, the accuracy of AI-generated
questions was good. 71% of students reported that the depth of the questions generated by AI was acceptable.
Funding: No funding source is reported for this study.
Ethical statement: Author stated that all participants were over the age of 18 and that their participation was entirely voluntary. The author
also stated that since no personal data was analyzed and pseudonyms were used in this article, no ethics committee approval was required.
Declaration of interest: No conflict of interest is declared by the author.
Data sharing statement: Data supporting the findings and conclusions are available upon request from the author.
REFERENCES
Arulogun, O. T., Akande, O. N. Akindele, A. T., & Badmus, T. A. (2020). Survey dataset on open and distance learning students
intention to use social media and emerging technologies for online facilitation. Data in Brief, 31, 105929.
https://doi.org/10.1016/j.dib.2020.105929
ChatGPT. (2023). ChatGPT. https://chat.openai.com/chat
Chauhan, P. R., Ratrhod, S. P., Chauhan, B. R., Chauhan, G. R., Adhvaryu, A., & Chauhan, A. P. (2013). Study of difficulty level and
discriminating index of stem type multiple choice questions of anatomy in Rajkot. Biomirror, 4(6), 1-4.
Cho, K., Schunn, C. D., & Wilson, R. W. (2006). Validity and reliability of scaffolded peer assessment of writing from instructor and
student perspectives. Journal of Educational Psychology, 98(4), 891-901. https://doi.org/10.1037/0022-0663.98.4.891
Considine, J., Botti, M., & Thomas, S. (2005). Design, format, validity and reliability of multiple choice questions for use in nursing
research and education. Collegian, 12(1), 19-24. https://doi.org/10.1016/S1322-7696(08)60478-3
de Barros Ahrens, R., da Silva Lirani, L., & de Francisco, A. C. (2020). Construct validity and reliability of the work environment
assessment instrument WE-10. International Journal of Environmental Research and Public Health, 17(20), 7364.
https://doi.org/10.3390/ijerph17207364
Friatma, A., & Anhar, A. (2019). Analysis of validity, reliability, discrimination, difficulty and distraction effectiveness in learning
assessment. Journal of Physics: Conference Series, 1387, 012063. https://doi.org/10.1088/1742-6596/1387/1/012063
George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference 11.0 update. Allyn & Bacon.
Ghazali, N. H. M. (2016). A reliability and validity of an instrument to evaluate the school-based assessment system: A pilot study.
International Journal of Evaluation and Research in Education, 5(2), 148-157. http://doi.org/10.11591/ijere.v5i2.4533
Haladyna, T. M. (1999). Developing and validating multiple-choice test items. Lawrence Erlbaum.
Harahap, F., Nasution, N. E. A., & Manurung, B. (2019). The effect of blended learning on students learning achievement and
science process skills in plant tissue culture course. International Journal of Instruction, 12(1), 521-538.
https://doi.org/10.29333/iji.2019.12134a
Harahap, M. P., & Nasution, N. E. A. (2022). Validity of computer based learning media to improve junior high school students
learning outcomes on ecosystem topics. META: Journal of Science and Technological Education, 1(1), 31-45.
Hosseini, M., Rasmussen, L. M., & Resnik, D. B. (2023). Using AI to write scholarly publications. Accountability in Research, 1-9.
https://doi.org/10.1080/08989621.2023.2168535
Jackson, S. L. (2003). Research methods and statistics: A critical thinking approach. Thomson Wadsworth.
Kimberlin, C. L., & Winterstein, A. G. (2008). Validity and reliability of measurement instruments used in research. American Journal
of Health-System Pharmacy, 65(23), 2276-2284. https://doi.org/10.2146/ajhp070364
McCowan, R. J., & McCowan, S. C. (1999). Item analysis for criterionReferenced tests. Center for Development of Human Services.
https://files.eric.ed.gov/fulltext/ED501716.pdf
8 / 11 Nasution / Agricultural and Environmental Education, 2(1), em002
Mehta, G., & Mokhasi, V. (2014). Item analysis of multiple choice questionsAn assessment of the assessment tool. International
Journal of Health Sciences and Research, 4(7), 197-202. https://doi.org/10.1016/j.mjafi.2020.11.007
Mohajan, A. K. (2017). Two criteria for good measurements in research: Validity and reliability. Annals of Spiru Haret University,
17(3), 58-82. https://doi.org/10.26458/1746
Morgan, P. J., CleaveHogg, D., DeSousa, S., & Tarshis, J. (2004). Highfidelity patient simulation: Validation of performance
checklists. BJA: British Journal of Anaesthesia, 92(3), 388-392. https://doi.org/10.1093/bja/aeh081
Muijs, D. (2011). Doing quantitative research in education with SPSS. SAGE. https://doi.org/10.4135/9781849203241
Mutmainah, I., & Isdiati, A. (2022). Validity and reliability test of a written English test online-based using Google Form.
INTERACTION: Jurnal Pendidikan Bahasa [INTERACTION: Journal of Language Education], 9(1), 89-100.
Rao, C., Kishan, P. H. L., Sajitha, K., Permi, H., & Shetty, J. (2016). Item analysis of multiple choice questions: Assessing an
assessment tool in medical students. International Journal of Education and Psychological Research, 2, 201-214.
https://doi.org/10.4103/2395-2296.189670
Rii, K. B., Choi, L. K., Shino, Y., Kenta, H., & Adianita, I. R. (2020). Application of iLearning Education in Learning Methods for
Entrepreneurship and Elementary School Student Innovation. Aptisi Transactions on Technopreneurship, 2(2), 131-142.
https://doi.org/10.34306/att.v2i2.90
Salwa, A. (2012). The validity, reliability, level of difficulty and appropriateness of curriculum of the English test [PhD thesis,
Diponegoro University].
Setiawaty, R., Sulistyorini, T. B., Margono, & Rahmawat, L. E. (2017). Validity test and reliability of Indonesian language multiple
choice in final term examination. In Proceedings of the 1st International Seminar on Language, Literature and Education (pp. 43-
50). https://doi.org/10.18502/kss.v3i9.2609
Suruchi, S., & Rana, S. S. (2014). Test item analysis and relationship between difficulty level and discrimination index of test items
in an achievement test in biology. Paripex-Indian Journal of Research, 3(6), 56-58. https://doi.org/10.15373/22501991/JUNE
2014/18
Susnjak, T. (2022). ChatGPT: The end of online exam integrity? arXiv. https://doi.org/10.48550/arXiv.2212.09292
Taber, K. S. (2018). The use of Cronbachs alpha when developing and reporting research instruments in science education.
Research in Science Education, 48, 1273-1296. https://doi.org/10.1007/s11165-016-9602-2
Thorndike, R. M., Cunningham, G. K., Thorndike, R. L., & Hagen, E. P. (1991). Measurement and evaluation in psychology and
education. MacMillan.
van Griethuijsen, R. A. L. F., van Eijck, M. W., Haste, H., den Brok, P. J., Skinner, N. C., Mansour, N.,Gencer, A. S., & BouJaoude, S.
(2014). Global patterns in studentsviews of science and interest in science. Research in Science Education, 45(4), 581-603.
http://doi.org/10.1007/s11165-014-9438-6
Wongpakaran, T., & Wongpakaran, N. (2012). A comparison of reliability and construct validity between the original and revised
versions of the Rosenberg self-esteem scale. Psychiatry Investigation, 9(1), 54-58. https://doi.org/10.4306/pi.2012.9.1.54
Nasution / Agricultural and Environmental Education, 2(1), em002 9 / 11
APPENDIX A: ChatGPT AI-GENERATED QUESTIONS
1. Which of the following is NOT a factor that affects aging and biological changes in organisms?
a. Environmental factors
b. Genetic factors
c. Lifestyle choices
d. Planetary alignments
e. Hormonal imbalances
2. Which of the following hormones is primarily responsible for stimulating cell division and growth in animals?
a. Adrenaline
b. Insulin
c. Thyroxine
d. Testosterone
e. Human growth hormone
3. What is the process by which a single cell develops into a multicellular organism called?
a. Differentiation
b. Fertilization
c. Metamorphosis
d. Germination
e. Replication
4. What is the primary function of the endoplasmic reticulum in a eukaryotic cell?
a. Energy production
b. Protein synthesis
c. Waste removal
d. Storage of lipids
e. DNA replication
5. What is the primary function of lysosomes in a eukaryotic cell?
a. Energy production
b. Protein synthesis
c. Waste removal
d. Storage of lipids
e. DNA replication
6. What is the basic unit of life?
a. Atom
b. Molecule
c. Cell
d. Tissue
e. Organ
7. What is the main cause of loss of biodiversity?
a. Climate change
b. Deforestation
c. Overfishing
d. Land use change
e. Increased CO2 emissions
8. What is the definition of biodiversity?
a. The number of different species in a particular habitat
b. The total amount of living matter in a given area
c. The variety of life forms in a particular ecosystem
d. The total mass of all living organisms in a given area
e. The study of different types of plants and animals in a given area
10 / 11 Nasution / Agricultural and Environmental Education, 2(1), em002
9. What is the term used to describe the variety of life on Earth, including the variety of species, ecosystems, and genes?
a. Biodiversity
b. Geodiversity
c. Climate diversity
d. Technology diversity
e. Cultural diversity
10. What is the study of heredity and variation in organisms called?
a. Botany
b. Zoology
c. Genetics
d. Paleontology
e. Geology
11. What type of molecule carries genetic information in cells?
a. Carbohydrates
b. Lipids
c. Proteins
d. Nucleic acids
e. Vitamins
12. What is the scientific method used to determine the sequence of nucleotides in a DNA molecule?
a. Electrophoresis
b. PCR
c. Sequencing
d. Microscopy
e. Chromatography
13. What is the process by which species of organisms change over generations through natural selection?
a. Regression
b. Mutation
c. Adaptation
d. Evolution
e. Stagnation
14. What theory was developed by Charles Darwin and Alfred Russel Wallace to explain the mechanism of evolution?
a. Plate Tectonics Theory
b. General Relativity
c. Theory of Evolution by Natural Selection
d. Quantum Mechanics
e. The Big Bang Theory
15. What is the main driving force behind evolution?
a. Divine intervention
b. Survival of the fittest
c. A fixed process
d. Mutation of species at random
e. The environment alone
16. What is the study of relationships between organisms and their environment known as?
a. Sociology
b. Geology
c. Ecology
d. Meteorology
e. Archaeology
Nasution / Agricultural and Environmental Education, 2(1), em002 11 / 11
17. What is the term used to describe the way organisms obtain energy from their environment?
a. Metabolism
b. Ecosystem
c. Biodiversity
d. Photosynthesis
e. Biogeography
18. What is the term used to describe the variety of life found in a specific location?
a. Biodiversity
b. Ecosystem
c. Biogeography
d. Metabolism
e. Photosynthesis
19. What is the application of scientific and engineering principles to the processing of materials by biological agents to provide goods
and services?
a. Nanotechnology
b. Biotechnology
c. Robotics
d. Geotechnology
e. Information technology
20. What is the process of genetically modifying an organism to produce a desired trait or characteristic called?
a. Cloning
b. Fermentation
c. Hybridization
d. Transfection
e. Gene editing
21. What is the production of a large amount of a specific protein using microorganisms such as bacteria or yeast in bioreactors called?
a. Cloning
b. Fermentation
c. Hybridization
d. Transfection
e. Gene editing
... In an experiment, a study found that questions generated by AI were satisfactorily clear, acceptable and favorable to students, and relevant to the subject matter (Nasution, 2023). ...
Article
Full-text available
Artificial intelligence (AI) is transforming various industries, and education is no exception. Rapid advancements in AI technology have become essential for educators and educational assessment professionals to enhance teaching and learning experiences. AI-powered educational assessment tools provide numerous benefits, including improving the accuracy and efficiency of assessments, generating personalized feedback for students, and enabling teachers to adapt their teaching strategies to meet the unique needs of each student. Therefore, AI has the potential to revolutionize the way education is delivered and assessed, ultimately leading to better educational outcomes for students. This paper explores the various applications of AI tools in educational measurement and assessment. Specifically, it discusses the integration of large language AI models in classroom assessment, in specific areas such as test purpose determination and specification, developing, test blueprint, test item generation/development, preparation of test instructions, item assembly/selection, test administration, test scoring, interpretation of test results, test analysis/appraisal, and reporting. It analyses the role of teachers in AI-based assessment and the challenges of using AI-powered tools in educational assessment. Finally, the paper presents strategies to address these challenges and enhance the effectiveness of AI in educational assessment. In conclusion, using AI in educational assessment has benefits and limitations. As such, educators, policymakers, and stakeholders must work together to develop strategies that maximize the benefits of AI in educational assessment while mitigating the associated risks. The application of AI in educational assessment can ultimately transform education, improve learning outcomes, and equip students with the skills needed to succeed in the 21st century.
... ChatGPT is able to develop a conceptual survey (or single items) to assess students' understanding (Nasution, 2023), e.g., of basic quantum concepts (part a in Figure 3). On the one hand, items created by ChatGPT may initiate classroom discussions about correct solutions. ...
Article
Full-text available
Large language models, such as ChatGPT, have great potential to enhance learning and support teachers, but they must be used with care to tackle limitations and biases. This paper presents two easy-to-implement examples of how ChatGPT can be used in physics classrooms to foster critical thinking skills at the secondary school level. A pilot study (n=53) examining the implementation of these examples found that the intervention had a positive impact on students' perceptions of ChatGPT, with an increase in agreement with statements related to its benefits and incorporation into their daily lives.
Preprint
Full-text available
This study evaluated the ability of ChatGPT, a recently developed artificial intelligence (AI) agent, to perform high-level cognitive tasks and produce text that is indistinguishable from human-generated text. This capacity raises concerns about the potential use of ChatGPT as a tool for academic misconduct in online exams. The study found that ChatGPT is capable of exhibiting critical thinking skills and generating highly realistic text with minimal input, making it a potential threat to the integrity of online exams, particularly in tertiary education settings where such exams are becoming more prevalent. Returning to invigilated and oral exams could form part of the solution, while using advanced proctoring techniques and AI-text output detectors may be effective in addressing this issue, they are not likely to be foolproof solutions. Further research is needed to fully understand the implications of large language models like ChatGPT and to devise strategies for combating the risk of cheating using these tools. It is crucial for educators and institutions to be aware of the possibility of ChatGPT being used for cheating and to investigate measures to address it in order to maintain the fairness and validity of online exams for all students.
Article
Full-text available
The purpose of this research is to examine the validity and reliability of the multiple-choice items of the Even Semester Final Examination of Bahasa Indonesia subject of class 7A Junior High School 2 of Surakarta (SMPN 2 Surakarta) in the academic year 2015/2016. A qualitative research method is used by involving the primary data from the school's documentation, which in this case is SMP N 2 Surakarta. The respondents of this research comprised 26 students. The discussion of this research shows that based on the five percents of calculation table 0.388, the validity of multiple-choice items of-of significance of 5% with the correlation table of 0.388, the validity of the multiple-choice items of Even Semester Final Examination of Bahasa Indonesia subject that consists of 45 numbers, there are seven valid items, and 38 invalid items. The result of the reliability index calculation of the Even Semester Final Examination of Bahasa Indonesia subject of class 7A, SMP N 2 Surakarta is 0.3657 ≤ 0.6. This means that this instrument cannot be implemented to carry out measurements, so there must be an evaluation to improve its quality. Therefore, r obtained 0.388 with r 11 <r , and it can be said that the result of measuring the Learning Result Test (THB) does not have a significant correlation and cannot be said to be reliable.
Article
Full-text available
The purpose of this study was to validate the construct and reliability of an instrument to assess the work environment as a single tool based on quality of life (QL), quality of work life (QWL), and organizational climate (OC). The methodology tested the construct validity through Exploratory Factor Analysis (EFA) and reliability through Cronbach’s alpha. The EFA returned a Kaiser–Meyer–Olkin (KMO) value of 0.917; which demonstrated that the data were adequate for the factor analysis; and a significant Bartlett’s test of sphericity (χ² = 7465.349; Df = 1225; p ≤ 0.000). After the EFA; the varimax rotation method was employed for a factor through commonality analysis; reducing the 14 initial factors to 10. Only question 30 presented commonality lower than 0.5; and the other questions returned values higher than 0.5 in the commonality analysis. Regarding the reliability of the instrument; all of the questions presented reliability as the values varied between 0.953 and 0.956. Thus; the instrument demonstrated construct validity and reliability
Article
Full-text available
Primary school is an early age in the introduction of information and communication technology, therefore it is necessary to be prepared abilities and skills in the use of technology. Learning about entrepreneurship at the elementary school level is generally still traditional, where valuable renewal results are created. However, in this 4.0 era, many elementary students still traditionally run entrepreneurship, this has not been a challenge in line with the Ministry of Education and Culture on an independent campus. In order to create the involvement of young people in the field of entrepreneurship, the right solution is how entrepreneurship learning in Learning can be applied from an early age, namely to elementary school students. SEP (School Enrichment Program) is an entrepreneurial learning application based on iLearning aimed at elementary school students to have a high quality of creativity and a willingness to innovate at an early age. Based on the observational test results the Ubiquitous Learning Method is significantly able to influence the motivation of elementary school students to be enthusiastic in terms of entrepreneurial learning from an early age, and to show the results that Cronbach's Alpha 0.9> 0.6 ie the SEP is very accurate in its application especially can improve the results significant in influencing the formation of intentions in entrepreneurship even more starting to spread the trend of entrepreneurship which has now touched various circles, one of them among students.
Article
Full-text available
Open and Distance Learning (ODL) students rely majorly on the use of Information, Communication and Technology (ICT) tools for online facilitation and other activities supporting learning. With emphasis on ODL students of Ladoke Akintola University of Technology (LAUTECH), Oyo Sta te, Nigeria; Moodle Learning Management System (LMS) has being the major medium for online facilitation for the past 5 years. Therefore, this data article presents a survey dataset that was administered to LAUTECH ODL students with a view to assess their readiness to accept and use alternative social media platforms and emerging technologies for online facilitation. The data article also includes questionnaire instrument administered via google form, 900 responses received in spreadsheet formats, chats generated from the responses, the Statistical Package of the Social Sciences (SPSS) file, the descriptive and reliability statistics for all the variables. Authors believe that the dataset will guide policy makers on the choice of social media and emerging technologies to be adopted as a facilitation tool for ODL students. It will also reveal the challenges that could militate against the willingness to use these supplementary modes of learning from students’ perspectives.
Article
Full-text available
A valid, reliable and practical instrument is needed to evaluate the implementation of the school-based assessment (SBA) system. The aim of this study is to develop and assess the validity and reliability of an instrument to measure the perception of teachers towards the SBA implementation in schools. The instrument is developed based on a conceptual framework developed by Daniel Stufflebeam, that is the CIPP (context, input, process and product) Evaluation Model. The instrument in the form of questionnaire is distributed to a sample of 120 primary and secondary school teachers. The response rate is 80 percent. The content validity is assessed by the experts and the construct validity is measured by Exploratory Factor Analysis (EFA). The reliability of the instrument is measured using internal consistence reliability, which is measured by alpha coefficient reliability or Cronbach Alpha. The finding of this pilot study shows that the instrument is valid and reliable. Finally, out of 71 items, 68 items are retained.
Article
Full-text available
Reliability and validity are the two most important and fundamental features in the evaluation of any measurement instrument or tool for a good research. The purpose of this research is to discuss the validity and reliability of measurement instruments that are used in research. Validity concerns what an instrument measures, and how well it does so. Reliability concerns the faith that one can have in the data obtained from the use of an instrument, that is, the degree to which any measuring tool controls for random error. An attempt has been taken here to review the reliability and validity, and threat to them in some details.
Article
The purpose of this research was to know the effect of blended learning strategy on learning achievement and science process skills of students in plant tissue culture course in the Universitas Negeri Medan. The research method was quasi experiment. The population of this study was all semester VIII students of biology education program. The study sample consists of two classes, namely: class A known as control class who were taught by conventional learning strategy and class C known as experiment class who were taught by blended learning strategy. They were selected using cluster random sampling technique. The results of the study showed that tcount = 3.769, P= 0.001 at the level significance of 0.05 in learning achievements score. It also showed that tcount = 5.435 > ttable = 1.661 P= 0.001 at the level significance of 0.05 in science process skills score. Based on the study result, it can be concluded the blended learning strategy found significantly more effective in enhancing students' learning achievement and science process skills in plant tissue culture course as compared to the conventional learning strategy.