PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.
Content may be subject to copyright.
Unveiling Dual Quality in Product Reviews: An NLP-Based Approach
Rafał Po´
swiata, Marcin Michał Miro´
nczuk, Sławomir Dadas,
Małgorzata Gr˛ebowiec, Michał Perełkiewicz
National Information Processing Institute
al. Niepodległo´
sci 188b, 00-608 Warsaw, Poland
{rposwiata, mmironczuk, sdadas, mgrebowiec, mperelkiewicz}@opi.org.pl
Abstract
Consumers often face inconsistent product
quality, particularly when identical products
vary between markets, a situation known as the
dual quality problem. To identify and address
this issue, automated techniques are needed.
This paper explores how natural language pro-
cessing (NLP) can aid in detecting such dis-
crepancies and presents the full process of de-
veloping a solution. First, we describe in detail
the creation of a new Polish-language dataset
with 1,957 reviews, 540 highlighting dual qual-
ity issues. We then discuss experiments with
various approaches like SetFit with sentence-
transformers, transformer-based encoders, and
LLMs, including error analysis and robustness
verification. Additionally, we evaluate multi-
lingual transfer using a subset of opinions in
English, French, and German. The paper con-
cludes with insights on deployment and practi-
cal applications.
1 Introduction
Dual quality of products refers to practices where
companies sell items under the same brand and
similar packaging in different markets, yet present
them with significantly altered composition or qual-
ity parameters (The European Consumer Organ-
isation (BEUC),2018). This phenomenon has
sparked growing controversy among consumers,
especially within the European Union (EU), where
it is perceived as a potential violation of fair com-
petition rules (The European Consumer Organisa-
tion (BEUC),2018). From a sociological and eco-
nomic perspective, dual quality practices raise mul-
tifaceted concerns about market trust, purchasing
behaviours and the perception of fairness among
consumers (Veselovská,2022;Bartkova and Siroti-
aková,2021). Multiple reports published by con-
sumer organizations and EU research services sug-
gest that offering products with distinct ingredi-
ents or characteristics under identical branding
constitutes a widespread international issue (The
European Consumer Organisation (BEUC),2018;
European Parliament,2019;European Commis-
sion,2023). The above reasons and EU regula-
tions—such as the amended Directive on Unfair
Commercial Practices—recognize dual quality as
misleading conduct, which may require enforce-
ment at the national level (Chambers;EU Moni-
tor) (also, see more details in Appendix A). Our
recent research project focused on creating a so-
lution to support a national agency from one of
the EU countries to address the above problem,
namely the Office of Competition and Consumer
Protection (UOKiK) in Poland (
https://uokik.
gov.pl/en).
The main goal of the project was to automate
the detection of unfair commercial practices using
natural language processing (NLP) methods. The
project, currently in the proof-of-concept stage, is
enabling the automated collection and analysis of
product-related data from e-commerce sites and
social media. It comprises a data retrieval module
(intelligent web crawling, scraping, cleaning, and
preprocessing) and a text analysis module that in-
cludes language identification, sentiment analysis,
aspect base sentiment analysis, and the detection
of consumer reviews
1
that may indicate potential
dual quality issues in products.
In this paper, we focus on the last and most
novel of these components for detecting dual qual-
ity reviews, describing the entire process from data
preparation, through extensive evaluation of differ-
ent approaches, to deployment. To our knowledge,
no available dataset or model is aimed at recog-
nizing dual quality-related reviews. While several
articles (discussed further in Section 2) approach
1
In this article, we use the terms ‘reviews’ and ‘opinions’
interchangeably to refer to consumer expressions regarding
a product. While ‘review’ may often imply a structured eval-
uation, we also include informal opinions that may indicate
perceptions of dual quality.
1
arXiv:2505.19254v1 [cs.CL] 25 May 2025
dual quality from sociological, economic, and legal
perspectives, our study takes a different approach
presented in Figure 1.
Product: Coffee
Market:
Product: Coffee
Market:
Consumer
"It's one of my favourite
coffees, especially the version
from Germany. It is much
more aromatic than the one
offered on the Polish market."
"I love this coffee. I always
drink it when I'm in Germany."
Expresses opinions
NLP Analysis
Dual Quality Detected
!
No Dual Quality
Dual Quality Recognition Process
Figure 1: Illustration of the NLP-based workflow for
recognizing dual quality consumer reviews. The dual
quality detection system flags reviews for potential is-
sues when a consumer explicitly notes a difference be-
tween product versions from different markets. This
illustration exemplifies the process with a Polish con-
sumer assessing products from Polish and German mar-
kets; the reviews shown are English translations of the
original Polish texts for clarity and wider accessibility.
The main contributions of this work can be sum-
marized as follows:
Proposition of new NLP task: detecting the dual
quality issues in product reviews.
A coherent methodology for dataset construc-
tion and preparation of a corpus of 1,957 human-
verified product reviews, 540 of which potentially
exhibit dual quality.
A comprehensive evaluation of Polish and multi-
lingual models, including a presentation of various
metrics, error analysis, and robustness verification
conducted primarily for Polish.
Expansion of the dataset to include product re-
views in other key languages such as English, Ger-
man, and French, demonstrating the system’s mul-
tilingual capabilities.
2 Related Work
Economic and social research on dual quality prod-
ucts highlights the erosion of consumer trust when
identical branding masks disparities in product
quality across EU Member States. Studies indi-
cate that these discrepancies, particularly in food
products, impact consumer perceptions of fair-
ness and lead to behavioral changes in purchas-
ing decisions (Bartková et al.,2018;Bartková,
2019;Bartkova et al.,2021;Bartkova and Siro-
tiaková,2021). Research has further demon-
strated that wealthier consumers are more aware
of the issue and seek alternatives in other markets,
whereas lower-income consumers are more likely
to adapt their behavior to avoid lower-quality prod-
ucts (Bartkova and Sirotiako,2021). The per-
ception of dual quality as an economic problem
is also evident, as lower-quality ingredients often
correspond to price disparities that disadvantage
consumers in specific regions (Závadský and Hi-
adlovský,2020).
Additionally, empirical studies confirm that pub-
lic perception of dual quality is shaped by expo-
sure to media reports and political discourse, lead-
ing to heightened scrutiny of multinational cor-
porations and their regional product differentia-
tion strategies (Veselovská,2022). While some
scholars argue that manufacturers may justify
product variations based on local market prefer-
ences, research suggests that these practices often
lack transparency and leave consumers feeling de-
ceived (Bartkova and Veselovska,2023). Moreover,
comparative consumer tests confirm that dual qual-
ity is not confined to food products but also extends
to household and personal care items, reinforcing
the need for regulatory intervention (Bartková and
Veselovská,2024). Given the strong consumer op-
position across Europe, particularly in Central and
Eastern European countries, economic research in-
creasingly supports regulatory measures to curb
these practices and ensure consistent product qual-
ity across EU markets.
From an computer science perspective, the topic
of applying NLP techniques to e-commerce plat-
forms and customer behavior analysis is widely
studied. Among these works, we can point out
customer reviews analysis (Botunac et al.,2024;
Satjathanakul and Siriborvornratanakul,2024;
Mamani-Coaquira and Villanueva,2024), product
question answering (Shen et al.,2023;Wang et al.,
2023), product categorization (Gong et al.,2023),
2
moderation of e-commerce reviews (Nayak and
Garera,2022), product feature extraction from the
web (Fuchs et al.,2022), customer service sup-
port (Obadinma et al.,2022), data augmentation
in e-commerce (Avigdor et al.,2023), fake news
detection (Hu et al.,2023), predictive quality in
manufacturing (Tercan and Meisen,2022), or in-
tent classification (Parikh et al.,2023). However,
none of these works address the dual quality prob-
lem directly or consider how to harness consumer
opinions—such as reviews from the Internet, e-
commerce platforms, or social media—to help re-
solve this issue. Thus, a clear research gap exists in
applying NLP-based methods to detect or analyze
dual quality products.
3 DQ Dataset
3.1 Dataset Creation Methodology
In the first stage of our work, we collected a large
dataset of reviews in Polish, sourced from the e-
commerce platform CENEO
2
and the discussion
forum on beauty, makeup, and cosmetics, WIZAZ
3
.
Our preliminary tests have shown that the problem
of dual quality does not occur often in reviews, and
thus randomly selecting a set of opinions and giv-
ing them to annotators is an inefficient approach
to building a dataset. Therefore, we prepared a
methodology to optimize this process, which con-
sists of the following steps:
1
Find dual quality reviews on the Internet by
searching for publicly available articles that de-
scribe the problem of dual quality. Such articles
often included examples of products along with the
differences observed depending on the sales mar-
ket, which we extracted. In addition, some articles
had comment sections where people shared their
experiences with the dual quality issue, which we
also collected. In this way, we obtained 117 dual
quality reviews.
2
Randomly select 300 reviews from the CENEO
/ WIZAZ dataset as standard opinions that do not
indicate a dual quality problem. These reviews
have been verified to ensure that they are standard.
Along with the examples obtained in step
1
, these
formed the base dataset.
3
Train a model using a few-shot learning method
to detect dual quality reviews based on the pre-
pared base or an extended dataset (subsequent it-
erations). We adopted this approach due to the
2https://www.ceneo.pl/
3https://wizaz.pl/forum/
limited amount of training data. The model was im-
plemented using the SetFit (Sentence Transformer
Fine-tuning) framework (Tunstall et al.,2022) and
a sentence transformer for the Polish language st-
polish-paraphrase-from-distilroberta4.
4
Apply the model trained in step
3
to all re-
views of the CENEO / WIZAZ dataset. The results
of the classification were sorted according to the
probability returned by the model.
5
Select up to 200
5
reviews with the highest prob-
ability of indicating a dual quality problem, which
did not appear previously in the dataset. Then per-
form manual verification of the selected reviews.
If a review did not indicate a dual quality issue,
it was labeled as a standard review. During this
step, we noticed that some reviews mentioned other
problems, including, for example, the product be-
ing possibly counterfeit, deterioration in product
quality over time, or the received product does not
match the order. Annotators labeled such opinions
as other problems and added additional informa-
tion regarding the type of problem mentioned in the
review. For training the model in step
3
, the re-
views labeled as other problems and standard were
combined. The outcome of this step and the base
dataset constituted the extended dataset.
6
Return to step
3
to increase the size of the
dataset.
Steps
3
,
4
, and
5
were repeated 7times,
allowing us to expand the base dataset with 1,303
examples (in last iteration only 103 new reviews
were selected). We then applied the model, trained
on the entire dataset prepared so far, to classify
the reviews imported into the demo version of our
system. Reviews were sourced from Polish and
international e-commerce sites. Of these reviews,
237 were labeled as dual quality, which we manu-
ally verified and changed if necessary. As a result
of the entire process described above, we obtained
a DQ (Dual Quality) dataset consisting of 1,957
unique examples. o ensure annotation accuracy, we
conducted cross-validation and identified examples
where the models were most often wrong. After
verifying these errors, in 67 (3.4%) cases the label
was incorrect and was changed. The whole above
process is shown in Figure 4.
4
At the time of the dataset creation (beginning of 2023)
it was the top Polish sentence transformer, as confirmed by
Dadas et al. (2024b).
5
Initially, many reviews were classified as dual quality,
making a probability threshold unsuitable. Selecting 200
enabled swift human verification, speeding up subsequent
iterations.
3
25%
21%
19%
15%
6%
5%
4%
2%
1%
1%
Category
Home and interior
Beauty
Delicacies
For children
Other
Automotive
Health
Home appliances
Construction and renovation
Hobbies and pets
Dual quality reviews
47%
11%
11%
9%
7%
7%
3%
2%
2%
Category
Beauty
Health
Delicacies
Home and interior
For children
Other
Home appliances
Hobbies and pets
Sports and recreation
Reviews about other problems
23%
17%
16% 8%
7%
7% 5%
4%
4%
2%
2%
1%
1%
1%
Category
Beauty
Health
Delicacies
Home and interior
Construction and renovation
Home appliances
Hobbies and pets
Automotive
For children
Garden
Sports and recreation
Office and business
Other
Phones and accessories
Standard reviews
Figure 2: Charts illustrating the distribution of product categories across various types of reviews.
3.2 Dataset Statistics
The statistics of the DQ dataset are presented in
Table 1. The dataset consists of 1,957 records, of
which 540 are labeled as dual quality, 281 as other
problems, and the rest are standard opinions. Of the
dual quality reviews, 107
6
were from the Internet,
265 from the CENEO / WIZAZ collection, and 168
from our demo system. The dataset is unbalanced,
with over half of the reviews belong to the standard
class. This characteristic was intentionally main-
tained because, in the real world, reviews on dual
quality and other problems occur less frequently
than others. For experimental purposes, the dataset
was divided into three subsets: train, test and valid,
containing 1,200 (
61%),500 (
26%), and 257
(
13%) reviews, respectively. The review texts in
the dataset consist of 261 characters and 41 words
on average.
# reviews
label all train test valid
dual quality 540 331 138 71
other problems 281 172 72 37
standard 1136 697 290 149
total 1957 1200 500 257
Table 1: DQ dataset statistics.
In addition, in Figure 2we present pie charts de-
picting the distribution of product categories across
various types of reviews
7
. A few interesting pat-
terns in these distributions are worth describing.
For instance, although Beauty,Delicacies,Health,
and Home & Interior are large categories overall,
Home & Interior has an exceptionally high share
among dual quality reviews (25%, compared to
6
In the results of the final dataset verification, of the 117
dual quality reviews initially found, 10 were classified as
standard.
7
All product reviews categorized by product type reader
may see in Figure 6.
13% overall), suggesting that this type of issue
might be more commonly perceived in products
related to household items. Similarly, For chil-
dren makes up only 7% of all reviews but appears
more prominently (15%) in dual quality reviews.
Meanwhile, Beauty reviews account for nearly half
(47%) of the ‘other problems’ category, indicating
that consumers in that segment may encounter a
broader range of product issues beyond dual quality
concerns.
4 Experiments
4.1 Experimental Setup
The problem was defined as a three-class classifica-
tion (see Table 1). Evaluation of various methods
was performed on a test set. The training set and
the validation set were used for approaches that
required training/fine-tuning. Each experiment was
repeated five times
8
, setting a different seed value
(if applicable), and the results presented in the ta-
bles are average values.
4.2 Methods
Baseline is a naive method of assigning a dual
quality class to a review if there are references to
another country in the text.
SetFit + sentence transformers is an approach in
which a sentence transformer model is first fine-
tuned using contrastive learning and then used as
text embedding for a logistic regression model.
In the experiments, we used sentence transform-
ers previously tested on the PL-MTEB bench-
mark by Po´
swiata et al. (2024). We selected
seven multilingual models namely: LaBSE (Feng
et al.,2022), paraphrase-multilingual-mpnet-
base-v2, paraphrase-multilingual-MiniLM-L12-v2
8
This rule was not applied to Baseline, which is determin-
istic, and successive runs always produce the same result.
4
Dual Quality class All classes
Method Precision Recall F1 Accuracy mPrecision mRecall mF1
Baseline 42.4±0.084.8±0.056.5±0.055.2±0.037.8±0.046.5±0.039.5±0.0
SetFit + sentence transformers
LaBSE 74.4±1.071.4±2.272.9±1.177.7±0.575.6±0.865.9±0.968.4±0.7
para-multi-mpnet-base-v2 72.8±1.766.4±2.469.4±2.075.9±1.472.4±2.266.8±2.568.8±2.6
para-multi-MiniLM-L12-v2 69.4±2.258.7±3.363.6±2.771.2±1.265.8±1.358.2±1.760.2±1.7
multi-e5-small 68.7±1.668.0±1.368.3±0.872.8±0.770.4±0.858.9±0.960.3±1.3
multi-e5-base 72.2±1.279.0±2.575.4±0.877.4±1.073.7±2.167.6±1.869.0±1.9
multi-e5-large 77.5±1.876.8±3.477.1±2.479.6±1.875.2±2.871.2±2.272.7±2.2
gte-multi-base 73.4±1.179.0±3.476.1±2.278.6±0.874.3±1.169.4±2.070.8±1.7
st-polish-para-mpnet 72.5±2.071.7±3.372.1±2.676.6±1.172.2±1.368.1±2.169.6±1.8
st-polish-para-distilroberta 72.7±2.769.1±2.770.9±2.675.7±0.770.5±0.368.1±1.669.1±1.1
mmlw-roberta-base 77.9±0.873.6±1.675.7±0.578.6±0.673.4±1.171.9±1.072.6±1.0
mmlw-roberta-large 76.0±1.975.9±2.475.9±2.078.7±1.472.7±1.872.1±1.772.4±1.7
Transformer-based encoders
mBERT 64.8±2.767.5±2.066.1±1.671.1±1.962.5±9.458.3±3.558.6±5.5
xlm-roberta-base 60.7±1.582.2±3.669.8±1.173.1±0.870.6±1.163.0±2.362.8±2.5
xlm-roberta-large 78.3±3.086.1±2.082.0±1.582.0±1.275.8±1.776.4±1.675.9±1.6
herbert-base-cased 64.0±3.977.8±3.370.1±1.673.3±0.277.3±3.359.9±2.359.4±3.4
herbert-large-cased 81.5±2.580.7±2.081.1±1.582.4±1.177.6±1.476.2±2.776.7±2.1
polish-roberta-base-v2 66.4±3.086.5±3.975.1±2.175.4±1.569.7±2.367.2±1.966.9±2.0
polish-roberta-large-v2 84.6±3.677.5±6.080.7±2.981.7±1.278.5±0.774.3±3.775.8±2.5
LLMs
deepseek-v3 zero-shot 48.1±0.390.6±1.262.9±0.649.5±0.449.6±0.247.9±0.442.7±0.5
deepseek-v3 few-shot 61.9±0.396.1±0.375.3±0.159.0±0.261.1±0.463.7±0.455.9±0.3
deepseek-v3 zero-shot+inst. 84.7±1.380.6±0.782.6±0.670.7±0.470.4±0.674.8±0.568.7±0.4
deepseek-v3 few-shot+inst. 79.7±0.982.0±0.880.9±0.968.4±0.870.1±0.676.4±0.867.4±0.8
gpt-4o zero-shot 42.8±0.2100.0±0.060.0±0.247.6±0.349.8±0.246.8±0.338.8±0.3
gpt-4o few-shot 60.3±0.298.8±0.374.9±0.357.5±0.262.1±0.166.5±0.355.5±0.3
gpt-4o zero-shot+inst. 85.7±0.476.7±0.880.9±0.675.0±0.273.4±0.279.0±0.372.5±0.2
gpt-4o few-shot+inst. 86.0±1.975.1±0.780.1±0.668.5±0.372.3±0.576.5±0.267.7±0.3
Table 2: Average scores with standard deviation for all evaluated methods. The Precision, Recall, and F1 metrics
were calculated considering only the dual quality class; the other metrics were for all classes, with ’m’ as the macro
average. Bold values indicate the highest scores for the type of method, and blue highlights the highest scores for
each metric.
(Reimers and Gurevych,2019), three e5 mod-
els (Wang et al.,2024) and mGTE (Zhang et al.,
2024). Additionally, we choose four sentence-
transformer models dedicated to the Polish lan-
guage: st-polish-paraphrase-from-mpnet, st-polish-
paraphrase-from-distilroberta (Dadas et al.,2024b)
and two mmlw models (Dadas et al.,2024a).
Transformer-based encoders involves training
pre-trained language model with classification head
on top (a linear layer on top of the pooled out-
put). We included evaluations of multilingual
BERT (mBERT) (Devlin et al.,2019), multilin-
gual XLM-RoBERTa (Conneau et al.,2020), and
models specifically trained for Polish, such as
HerBERT (Mroczkowski et al.,2021) and Polish
RoBERTa (Dadas et al.,2020).
LLMs Advanced frontier models such as
DeepSeek (DeepSeek-AI et al.,2025,2024) and
GPT-4o (OpenAI et al.,2024) were selected to
evaluate how effectively cutting-edge LLMs han-
dle dual quality review detection tasks under differ-
ent prompting scenarios, including zero-shot and
few-shot configurations, both with and without ad-
ditional instruction (see more details about used
prompts in Table 9).
4.3 Main Results
The experimental results from Table 2clearly in-
dicate notable differences among the three groups
of tested models. Sentence-transformer models us-
ing SetFit generally achieved moderate precision
scores (around 70-77%), suggesting that compress-
ing sentence semantics into a single vector might
result in information loss or inadequate semantic
representation. Transformer-based encoders, partic-
ularly the larger, language-specific models such as
polish-roberta-large-v2 (84.6%) and herbert-large-
cased (81.5%), exhibited significantly stronger
performance, comparable even with state-of-the-
art conversational large language models (LLMs).
Among LLMs, instructive prompting strategies
(providing clear definitions of classes without ex-
plicit examples) improved performance, with the
best precision results of 86% and 85.7% achieved
by GPT-4o models with and without examples, re-
spectively. It should be noted that the GPT-4o
model with zero-shot instr. prompt achieved very
good results for other measures as well. Interest-
ingly, explicit few-shot examples sometimes distort
the models and reduce detection efficiency overall.
This may suggest that the chosen examples may
not be representative and therefore helpful.
5
850
19
5
75
518
10
525
153
345
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
GPT−4o Zero−shot Instruct
1309
136
123
61
535
39
80
19
198
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
polish−roberta−large−v2
1293
94
101
79
557
48
78
39
211
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
herbert−large−cased
Figure 3: Confusion matrices aggregated from five experiments for selected models.
4.4 Errors Analysis
We conducted a detailed error analysis for selected
models using classification confusion matrices vi-
sualized through heat maps. Specifically, we se-
lected three representative models: GPT-4o (zero-
shot+inst.), polish-roberta-large-v2 and herbert-
large-cased. Figure 3shows that the GPT-4o model
exhibits substantial confusion between standard
and ‘other problems’ reviews, while errors between
standard and dual quality are less frequent. The
polish-roberta-large-v2 model frequently identifies
the standard reviews, achieving high accuracy for
this category, but often misclassifies dual quality
opinions as standard. Model herbert-large-cased
often recognizes the dual quality reviews, achiev-
ing a high detection rate but also producing the
most false positives for this class. Additional com-
parative analyses are presented in Figure 7and
Figure 8.
4.5 Robustness
As an additional experiment, we verified robustness
of selected models, i.e., whether a slight change
in the text, which does not significantly affect its
meaning, can change the model’s decision. We
generated five additional test sets, which resulted
from modifications to the original test set. The
modifications are described in Table 3. We tested
three selected models, the results are shown in Ta-
ble 4. The percentage of differences in predictions
was between 2.6 and 5.0. More often, larger text
modifications like pl_chars influenced the change
in decision.
4.6 Multilingual Transfer
To verify generalizability across markets and lan-
guages, we also explored multilingual transfer ca-
Name Description
period
Remove (if present) or add (if absent) a period at the
end of the review.
first_letter
Change the capitalization of the first letter of the first
word in the review. If the first word is written in upper-
case, change it to lowercase.
lower Change text of the review to lowercase.
pl_chars
Replace the Polish characters ˛a, ˛e, ´c, ł, ´n, ó, ˙
z, ´z with
their corresponding Latin alphabet characters, i.e., a, e,
c, l, n, o, z.
pl_chars_once
The operation is the same as pl_chars, except that each
letter can be changed once.
Table 3: Descriptions of modifications applied to the
test set for robustness verification.
Modification
gpt-4o
polish-roberta
herbert
period 4.0±0.04.2±1.05.0±0.9
first_letter 4.0±0.02.8±0.72.6±0.8
lower 5.0±0.04.6±0.54.2±0.7
pl_chars 5.0±0.04.6±1.24.6±0.8
pl_chars_once 4.0±0.04.0±1.43.6±0.8
Table 4: Robustness verification results for GPT-4o
(zero-shot+inst.), polish-roberta-large-v2 and herbert-
large-cased. The values are the average and standard
deviation of the model’s decision disagreement for the
original and modified reviews. To ensure consistent
behavior in the GPT-4o model, we set the temperature
to 0.0, resulting in a standard deviation of 0.0 across
runs.
pabilities of our solution. For this purpose, we
created a multilingual subset of reviews in English,
German, and French (200,000 reviews for each lan-
guage) selected from the AMAZON (Keung et al.,
2020) dataset and our demo system. Next, we
trained SetFit with paraphrase-multilingual-mpnet-
base-v2
9
on the DQ dataset, and applied it to these
reviews. Then we selected 500 AMAZON reviews
and 200 reviews from demo system with the high-
9
One of the top multilingual sentence transformer at that
time (2023).
6
Dual Quality class All classes
Method Precision Recall F1 Accuracy mPrecision mRecall mF1
Transformer-based encoders
xlm-roberta-base 69.5±2.366.9±6.867.9±2.973.0±1.055.5±1.155.1±2.155.0±1.7
xlm-roberta-large 84.8±3.863.1±4.872.3±4.072.6±2.760.1±2.756.7±3.957.5±3.3
LLMs
deepseek-v3 zero-shot+inst. 85.9±1.852.3±0.865.0±0.349.5±0.763.4±1.358.7±1.049.1±0.7
deepseek-v3 few-shot+inst. 91.9±4.850.6±0.865.2±1.844.3±0.965.6±2.256.2±1.246.1±1.0
gpt-4o zero-shot+inst. 85.3±1.346.6±0.060.2±0.352.6±0.662.3±0.357.1±0.349.6±0.3
gpt-4o few-shot+inst. 80.2±1.146.6±0.058.9±0.341.6±0.661.4±0.550.2±1.042.7±0.5
Table 5: Evaluation results for selected models on a multilingual dataset.
est dual quality scores. Manual verification showed
that most were actually standard, so we randomly
limited standard reviews to 130, yielding 206 final
examples (58 dual quality, 18 other problems, 130
standard). The dataset thus prepared was used as a
multilingual test set. We conducted an experiment
in which we tested methods based on multilingual
models trained as in Section 4.1 on the Polish train-
ing subset or, in the case of LLMs, using the same
prompts. The results for the selected models are
presented in Table 5. Considering the precision of
the classifier, the highest score was achieved by
the DeepSeek-V3 (91.9%) model, interestingly in
this case, adding examples to the instructions in
the prompt gave a higher score. Of the group of
transformer-based encoders, the highest score was
achieved by xlm-roberta-large (84.8%). Although
the difference in performance on the basis of pre-
cision is significant, it is important to note the low
values of the recall measure for LLMs, compared
to encoders. All results for this experiment are
available in Table 11.
5 Deployment and Practical
Considerations
During the evaluation, a key objective was to
achieve high precision, thereby minimizing the
number of false positive recommendations. Since
each flagged instance undergoes final verification
by a human analyst, the primary goal is to reduce
the analyst’s workload by minimizing the number
of irrelevant alerts. This approach accepts the pos-
sibility of missing some true dual quality cases (i.e.,
allowing for a certain level of false negatives) in
favor of ensuring that the identified cases are highly
likely to be accurate. A product with several dual
quality reviews will be selected for further analysis
to verify whether this issue genuinely exists in its
case.
The proposed solution is implemented as a stan-
dalone service within a local infrastructure and
is exclusively dedicated to UOKiK employees
(Poland’s Office of Competition and Consumer Pro-
tection). The system is currently not accessible
to the public or external users. Although the sys-
tem can analyze multilingual content, the current
deployment prioritizes support for the Polish lan-
guage to align with the context of Polish consumers
and UOKiK’s mandate within the Polish market.
Given the results of the evaluation and the above
assumptions, we would recommend using the
polish-robert-large-v2 model for a production de-
ployment. Selecting the locally deployable model
presents a pragmatic and efficient choice, par-
ticularly when minimizing external dependencies
and ensuring consistent, low-latency inference. It
should be noted that this language-specific com-
ponent is modular; for deployment within other
European consumer protection agencies analogous
to UOKiK, the model could be readily substituted
with an equivalent model fine-tuned for the respec-
tive national language (e.g., a German BERT for
a German institution) or multilingual model like
XLM-RoBERTa.
6 Conclusion
In this work, we presented the entire process of
preparing a solution for detecting the problem of
dual quality based on product reviews. Our three
key findings are: First, mentions of dual quality in
product reviews are rare, in our case appearing only
a few hundred times. Second, smaller language-
specific transformer-based encoders finetuned for
the task perform comparably to larger LLMs. Fi-
nally, including examples in prompts for LLMs
can degrade performance compared to using only
task-specific instructions.
Acknowledgments
Project co-financed/financed by the National Cen-
tre for Research and Development (
https://www.
gov.pl/web/ncbr-en
) under the programme In-
fostrateg III.
7
References
Noa Avigdor, Guy Horowitz, Ariel Raviv, and Stav
Yanovsky Daye. 2023. Consistent text categorization
using data augmentation in e-commerce. In Proceed-
ings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 5: Industry
Track), pages 313–321, Toronto, Canada. Association
for Computational Linguistics.
Lucia Bartkova and Mária Sirotiaková. 2021. Dual
quality and its influence on consumer behaviour ac-
cording to the income.SHS Web of Conferences,
92.
Lucia Bartkova and Lenka Veselovska. 2023. Does
dual quality of products in the european union truly
bother consumers? Marketing and Management of
Innovations, 14.
Lucia Bartkova, Lenka Veselovska, Marianna Sramkova,
and Jan Zavadsky. 2021. Dual quality of products:
myths and facts through the opinions of millennial
consumers.Marketing and Management of Innova-
tions.
L. Bartková and L. Veselovská. 2024. Consumer be-
haviour under dual quality of products: Does testing
reveal what consumers experience? IIMB Manage-
ment Review, 36:171–184.
Lucia Bartková. 2019. How do consumers perceive the
dual quality of goods and its economic aspects in the
european union? an empirical study.Problems and
Perspectives in Management, 17.
Lucia Bartková, Lenka Veselovská, and Katarína Zimer-
manová. 2018. Possible solutions to dual quality of
products in the european union. Scientific Papers
of the University of Pardubice, Series D: Faculty of
Economics and Administration, 26.
I. Botunac, M. Brki´
c Bakari´
c, and M. Mateti´
c. 2024.
Comparing fine-tuning and prompt engineering for
multi-class classification in hospitality review analy-
sis.Applied Sciences (Switzerland), 14.
Chambers. Dual Quality of Food Products.
https://chambers.com/legal-trends/dual-
quality-of-food-products
. [Online; accessed
06-March-2025].
European Commission. 2018. Dual quality
of food: European Commission releases
common testing methodology.
https:
//ec.europa.eu/commission/presscorner/
detail/en/ip_18_4122
. [Online; accessed
06-March-2025].
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451, Online. Association for Computational Lin-
guistics.
Sławomir Dadas, Michał Perełkiewicz, and Rafał
Po´
swiata. 2020. Pre-training polish transformer-
based language models at scale. In Artificial Intelli-
gence and Soft Computing, pages 301–314. Springer
International Publishing.
Slawomir Dadas, Michał Perełkiewicz, and Rafał
Po´
swiata. 2024a. PIRB: A comprehensive bench-
mark of Polish dense and hybrid text retrieval meth-
ods. In Proceedings of the 2024 Joint International
Conference on Computational Linguistics, Language
Resources and Evaluation (LREC-COLING 2024),
pages 12761–12774, Torino, Italia. ELRA and ICCL.
Sławomir Dadas, Marek Kozłowski, Rafał Po´
swiata,
Michał Perełkiewicz, Marcin Białas, and Małgorzata
Gr˛ebowiec. 2024b. A support system for the detec-
tion of abusive clauses in b2c contracts.Artificial
Intelligence and Law.
DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting
Chen, Shanhuang Chen, Damai Dai, Chengqi Deng,
Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu,
Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge,
Kang Guan, Daya Guo, Jianzhong Guo, Guangbo
Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan
Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li,
Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu,
Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan
Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma,
Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu,
Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli
Sha, Zhihong Shao, Junxiao Song, Xuecheng Su,
Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingx-
uan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang,
Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie,
Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu,
Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping
Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong
Zhang, Liyue Zhang, Mingchuan Zhang, Minghua
Zhang, Wentao Zhang, Yichao Zhang, Chenggang
Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou,
Qihao Zhu, and Yuheng Zou. 2024. Deepseek llm:
Scaling open-source language models with longter-
mism.Preprint, arXiv:2401.02954.
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx-
uan Wang, Bochao Wu, Chengda Lu, Chenggang
Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan,
Damai Dai, Daya Guo, Dejian Yang, Deli Chen,
Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai,
Fuli Luo, Guangbo Hao, Guanting Chen, Guowei
Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng
Wang, Haowei Zhang, Honghui Ding, Huajian Xin,
Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang,
Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang,
Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie
Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu,
Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean
Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao,
Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang,
Mingchuan Zhang, Minghua Zhang, Minghui Tang,
Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang,
Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu
Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge,
8
Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin
Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao
Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu,
Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu
Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou,
Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun,
W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An,
Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu,
Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang,
Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen,
Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen,
Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin
Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu,
Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang,
Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li,
Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yan-
hong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao
Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu,
Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong,
Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yix-
uan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo,
Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue
Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan
Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxi-
ang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z.
Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu,
Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan
Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhi-
gang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu,
Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu,
Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi
Gao, and Zizheng Pan. 2025. Deepseek-v3 technical
report.Preprint, arXiv:2412.19437.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
EU Monitor. The better enforcement and modernisation
of Union consumer protection rules.
https://www.
eumonitor.eu/9353000/1/j4nvhdfcs8bljza_
j9vvik7m1c3gyxp/vme85bbfssxo
. [Online;
accessed 06-March-2025].
Joint Research Centre European Commission.
2023. Same pack, different ingredients:
Is dual quality down-branded in EU food?
https://joint-research-centre.ec.europa.
eu/jrc-news-and-updates/same-pack-
different-ingredients-dual-quality-down-
branded-eu-food-2023-07-24_en
. [Online;
accessed 06-March-2025].
European Parliamentary Research Service (EPRS)
European Parliament. 2017. European Com-
mission guidelines on dual quality of branded
food products.
https://www.europarl.europa.
eu/RegData/etudes/BRIE/2017/608804/EPRS_
BRI%282017%29608804_EN.pdf
. [PDF; accessed
06-March-2025].
European Parliamentary Research Service (EPRS)
European Parliament. 2019. Dual quality of products
State of play.
https://www.europarl.europa.
eu/RegData/etudes/BRIE/2019/644192/EPRS_
BRI(2019)644192_EN.pdf
. [Online; accessed
06-March-2025].
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari-
vazhagan, and Wei Wang. 2022. Language-agnostic
BERT sentence embedding. In Proceedings of the
60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
878–891, Dublin, Ireland. Association for Computa-
tional Linguistics.
Gilad Fuchs, Ido Ben-shaul, and Matan Mandelbrod.
2022. Is it out yet? automatic future product releases
extraction from web data. In Proceedings of the 2022
Conference on Empirical Methods in Natural Lan-
guage Processing: Industry Track, pages 263–271,
Abu Dhabi, UAE. Association for Computational
Linguistics.
Shansan Gong, Zelin Zhou, Shuo Wang, Fengjiao Chen,
Xiujie Song, Xuezhi Cao, Yunsen Xian, and Kenny
Zhu. 2023. Transferable and efficient: Unifying
dynamic multi-domain product categorization. In
Proceedings of the 61st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 5:
Industry Track), pages 476–486, Toronto, Canada.
Association for Computational Linguistics.
Beizhe Hu, Qiang Sheng, Juan Cao, Yongchun Zhu,
Danding Wang, Zhengjia Wang, and Zhiwei Jin.
2023. Learn over past, evolve for future: Forecasting
temporal trends for fake news detection. In Proceed-
ings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 5: Industry
Track), pages 116–125, Toronto, Canada. Association
for Computational Linguistics.
Phillip Keung, Yichao Lu, György Szarvas, and Noah A.
Smith. 2020. The multilingual Amazon reviews cor-
pus. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 4563–4568, Online. Association for
Computational Linguistics.
Y. Mamani-Coaquira and E. Villanueva. 2024. A review
on text sentiment analysis with machine learning and
deep learning techniques.IEEE Access, 12:193115–
193130.
Robert Mroczkowski, Piotr Rybak, Alina Wróblewska,
and Ireneusz Gawlik. 2021. HerBERT: Efficiently
pretrained transformer-based language model for Pol-
ish. In Proceedings of the 8th Workshop on Balto-
Slavic Natural Language Processing, pages 1–10,
Kiyv, Ukraine. Association for Computational Lin-
guistics.
Ravindra Nayak and Nikesh Garera. 2022. Deploying
unified BERT moderation model for E-commerce
9
reviews. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing:
Industry Track, pages 540–547, Abu Dhabi, UAE.
Association for Computational Linguistics.
Stephen Obadinma, Faiza Khan Khattak, Shirley
Wang, Tania Sidhorn, Elaine Lau, Sean Robert-
son, Jingcheng Niu, Winnie Au, Alif Munim, and
Karthik Raja Kalaiselvi Bhaskar. 2022. Bringing
the state-of-the-art to customers: A neural agent as-
sistant framework for customer service support. In
Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing: Industry
Track, pages 440–450, Abu Dhabi, UAE. Association
for Computational Linguistics.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
man, Diogo Almeida, Janko Altenschmidt, Sam Alt-
man, Shyamal Anadkat, Red Avila, Igor Babuschkin,
Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim-
ing Bao, Mohammad Bavarian, Jeff Belgum, Ir-
wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
Christopher Berner, Lenny Bogdonoff, Oleg Boiko,
Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
man, Tim Brooks, Miles Brundage, Kevin Button,
Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
Carey, Chelsea Carlson, Rory Carmichael, Brooke
Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully
Chen, Ruby Chen, Jason Chen, Mark Chen, Ben
Chess, Chester Cho, Casey Chu, Hyung Won Chung,
Dave Cummings, Jeremiah Currier, Yunxing Dai,
Cory Decareaux, Thomas Degry, Noah Deutsch,
Damien Deville, Arka Dhar, David Dohan, Steve
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti,
Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,
Simón Posada Fishman, Juston Forte, Isabella Ful-
ford, Leo Gao, Elie Georges, Christian Gibson, Vik
Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-
Lopes, Jonathan Gordon, Morgan Grafstein, Scott
Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris,
Yuchen He, Mike Heaton, Johannes Heidecke, Chris
Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele,
Brandon Houghton, Kenny Hsu, Shengli Hu, Xin
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain,
Joanne Jang, Angela Jiang, Roger Jiang, Haozhun
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee-
woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka-
mali, Ingmar Kanitscheider, Nitish Shirish Keskar,
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim,
Christina Kim, Yongjik Kim, Jan Hendrik Kirch-
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo,
Łukasz Kondraciuk, Andrew Kondrich, Aris Kon-
stantinidis, Kyle Kosic, Gretchen Krueger, Vishal
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan
Leike, Jade Leung, Daniel Levy, Chak Ming Li,
Rachel Lim, Molly Lin, Stephanie Lin, Mateusz
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue,
Anna Makanju, Kim Malfacini, Sam Manning, Todor
Markov, Yaniv Markovski, Bianca Martin, Katie
Mayer, Andrew Mayne, Bob McGrew, Scott Mayer
McKinney, Christine McLeavey, Paul McMillan,
Jake McNeil, David Medina, Aalok Mehta, Jacob
Menick, Luke Metz, Andrey Mishchenko, Pamela
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel
Mossing, Tong Mu, Mira Murati, Oleg Murk, David
Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak,
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh,
Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
Paino, Joe Palermo, Ashley Pantuliano, Giambat-
tista Parascandolo, Joel Parish, Emy Parparita, Alex
Passos, Mikhail Pavlov, Andrew Peng, Adam Perel-
man, Filipe de Avila Belbute Peres, Michael Petrov,
Henrique Ponde de Oliveira Pinto, Michael, Poko-
rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow-
ell, Alethea Power, Boris Power, Elizabeth Proehl,
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh,
Cameron Raymond, Francis Real, Kendra Rimbach,
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry-
der, Mario Saltarelli, Ted Sanders, Shibani Santurkar,
Girish Sastry, Heather Schmidt, David Schnurr, John
Schulman, Daniel Selsam, Kyla Sheppard, Toki
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav
Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin
Sokolowsky, Yang Song, Natalie Staudacher, Fe-
lipe Petroski Such, Natalie Summers, Ilya Sutskever,
Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng,
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe-
lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya,
Chelsea Voss, Carroll Wainwright, Justin Jay Wang,
Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
CJ Weinmann, Akila Welihinda, Peter Welinder, Ji-
ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner,
Clemens Winter, Samuel Wolrich, Hannah Wong,
Lauren Workman, Sherwin Wu, Jeff Wu, Michael
Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim-
ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Zheng, Juntang Zhuang, William Zhuk, and Bar-
ret Zoph. 2024. Gpt-4 technical report.Preprint,
arXiv:2303.08774.
Soham Parikh, Mitul Tiwari, Prashil Tumbade, and
Quaizar Vohra. 2023. Exploring zero and few-shot
techniques for intent classification. In Proceedings of
the 61st Annual Meeting of the Association for Com-
putational Linguistics (Volume 5: Industry Track),
pages 744–751, Toronto, Canada. Association for
Computational Linguistics.
Rafał Po´
swiata, Sławomir Dadas, and Michał
Perełkiewicz. 2024. PL-MTEB: Polish Mas-
sive Text Embedding Benchmark.Preprint,
arXiv:2405.10138.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Safe Food Advocacy Europe (SAFE). Dual Food Qual-
ity Project.
https://www.safefoodadvocacy.eu/
projects/dual-food-quality-project/
. [On-
line; accessed 06-March-2025].
10
J. Satjathanakul and T. Siriborvornratanakul. 2024. Sen-
timent analysis in product reviews in thai language.
International Journal of Information Technology (Sin-
gapore).
Xiaoyu Shen, Akari Asai, Bill Byrne, and Adria De Gis-
pert. 2023. xPQA: Cross-lingual product question
answering in 12 languages. In Proceedings of the
61st Annual Meeting of the Association for Computa-
tional Linguistics (Volume 5: Industry Track), pages
103–115, Toronto, Canada. Association for Compu-
tational Linguistics.
Hasan Tercan and Tobias Meisen. 2022. Machine learn-
ing and deep learning based predictive quality in
manufacturing: a systematic review.
The European Consumer Organisation (BEUC).
2018. Dual product quality across Eu-
rope: state-of-play and the way forward.
https://www.beuc.eu/sites/default/
files/publications/beuc-x-2018-031_beuc_
position_paper_on_dual_quality.pdf
. [On-
line; accessed 06-March-2025].
Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke
Bates, Daniel Korat, Moshe Wasserblat, and Oren
Pereg. 2022. Efficient few-shot learning without
prompts.arXiv preprint.
Lenka Veselovská. 2022. Dual quality of products in eu-
rope: a serious problem or a marketing opportunity?
Total Quality Management and Business Excellence,
33.
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
Rangan Majumder, and Furu Wei. 2024. Multilin-
gual e5 text embeddings: A technical report. arXiv
preprint arXiv:2402.05672.
Tianqi Wang, Lei Chen, Xiaodan Zhu, Younghun Lee,
and Jing Gao. 2023. Weighted contrastive learning
with false negative control to help long-tailed prod-
uct classification. In Proceedings of the 61st Annual
Meeting of the Association for Computational Lin-
guistics (Volume 5: Industry Track), pages 574–580,
Toronto, Canada. Association for Computational Lin-
guistics.
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie,
Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang,
Pengjun Xie, Fei Huang, et al. 2024. mgte: General-
ized long-context text representation and reranking
models for multilingual text retrieval. In Proceed-
ings of the 2024 Conference on Empirical Methods in
Natural Language Processing: Industry Track, pages
1393–1412.
Ján Závadský and Vladimír Hiadlovský. 2020. Eco-
nomic problems of dual quality of everyday con-
sumer goods.Economic Annals-XXI, 185.
A Dual Qulity Regulations
The regulatory response to dual quality has evolved
significantly within the European Union. The Eu-
ropean Commission’s 2017 guidelines clarified
that while product differentiation is not inherently
illegal, misleading consumers violates EU con-
sumer protection laws (European Parliament,2017,
2019). The Commission’s Joint Research Centre
(JRC) introduced a harmonized testing methodol-
ogy to assess product composition variations (Com-
mission,2018;European Commission,2023) sys-
tematically. Additionally, the Omnibus Directive
amended Directive 2005/29/EC, classifying dual
quality marketing as misleading when substan-
tial differences exist without a legitimate justifica-
tion (Chambers). These measures aim to enhance
market transparency and prevent unfair commercial
practices. However, challenges remain in enforce-
ment and uniform interpretation across Member
States (EU Monitor). Recent research shows that
while the prevalence of dual quality food products
declined from 31% in 2018 to 24% in 2021, con-
cerns persist regarding non-food items, as similar
discrepancies have been identified in household
and personal care products (European Commission,
2023).
Furthermore, consumer advocacy organizations
such as BEUC argue that enforcement mechanisms
must be strengthened to ensure compliance across
all product categories (The European Consumer
Organisation (BEUC),2018). The SAFE initiative
also supports enhanced consumer education and
reporting mechanisms to empower individuals to
identify and challenge dual quality practices (Safe
Food Advocacy Europe (SAFE)). These ongoing
legal and regulatory efforts underscore the EU’s
commitment to fair competition and consumer pro-
tection, yet continued vigilance and adaptation of
enforcement strategies remain necessary.
B DQ Dataset Details
B.1 Annotation Process Details
We established a structured data labelling policy
to annotate the data, i.e., assign each opinion or
review to its appropriate category. This policy pro-
vides clear classification criteria for opinions cat-
egorized as dual quality,other problems, or stan-
dard (see Table 6for detailed definitions). The
annotation process followed predefined guidelines
to ensure consistency and reliability, and where
11
necessary, ambiguous cases were resolved through
annotators’ review.
Examples of labeled reviews from the DQ
database, annotated according to the established
data annotation protocol and accompanied by an-
notator comments, are presented in Table 7.
Label Description
dual quality
The review contains information about the fact that the
customer bought the same product in two countries and
noticed a difference in quality, performance, composi-
tion, etc. It is not necessary to give the exact names of
the countries, phrases such as “abroad” or “in our coun-
try” are sufficient. The customer is comparing two same
products or groups of products. Indicating a difference
in price, availability or using a general statement such
as “there are differences between products purchased in
France and Poland” are NOT classified as dual quality,
but as standard review.
other problems
The review does not identify the problem of dual quality,
but provides information about other problems, among
which we can distinguish:
differences in products due to a different place of pur-
chase (same market), place of packaging or batch re-
ceived,
problems with the product itself that require deeper
analysis e.g., deterioration over time,
practices that are illegal and/or violate customer rights
e.g., the product is probably counterfeit, suspected fraud,
misleading the customer, no instructions in the required
language, no expiration date, etc..
standard
A standard product review in which the comments de-
scribed are about the product itself and do not indi-
cate problems addressed by the labels “dual quality” or
“other problems”.
Table 6: Annotation Guidelines.
B.2 Other Problems Identified in Products or
Services
When labeling the data, annotators identified opin-
ions explicitly reflecting dual quality issues and
comments pointing to specific problems related
to services or products. These additional insights
enabled deeper exploration and facilitated the cre-
ation of a comprehensive taxonomy of consumer
issues. Figure 5demonstrates that more than half of
the reported problems concern probable counterfeit
products, differences dependent on the place of pur-
chase within the same market, quality deterioration
over time, mismatches between received products
and orders, misleading information, suspicions of
fraud, and variations related to packaging, batch, or
package size. Recognizing and categorizing these
issues may be crucial for targeted interventions and
regulatory measures to strengthen consumer trust
and improve market standards beyond dual quality
considerations alone.
C Experiments Details
Baseline For the baseline model, the text was
first lemmatized. Then the following phrases
were searched:
anglia, angielski, szkocja,
szkocki, irlandia, irlandzki, walia,
walijski, dania, duński, finlandia,
fiński, norwegia, norweski, szwecja,
szwedzki, szwajcaria, szwajcarski,
estonia, estoński, łotwa, łotewski,
litwa, litewski, austria, austryjacki,
belgia, belgijski, francja, francuski,
niemcy, niemiecki, włochy, włoski,
holandia, niderlandzki, holenderski,
usa, kanada, kanadyjski, meksyk,
meksykański, ukraina, ukraiński, rosja,
rosyjski, białoruś, białoruski, polska,
polski, czechy, czeski, słowacja,
słowacki, węgry, węgierski, rumunia,
rumuński, bułgaria, bułgarski, grecja,
grecki, hiszpania, hiszpański, brazylia,
brazylijski, portugalia, portugalski,
australia, australijski, nowa zelandia,
maoryjski, gruzja, gruziński, izrael,
hebrajski, egipt, arabski, turcja,
turecki, chiny, chiński, korea,
koreański, japonia, japoński, indie,
hinduski.
If one or more of the above phrases were found,
the review was classified as dual quality.
SetFit + sentence transformer During training,
we used the following hyperparameters: learn-
ing rate=2e-5 (same for sentence transformer fine-
tuning and logistic regression classifier), batch
size=8, epochs=1, number of iterations for con-
trastive=1. We adopted AdamW optimizer.
Transformer-based encoders During training,
we used the following hyperparameters: learning
rate=2e-6, batch size=8, epochs=10. We adopted
AdamW optimizer.
LLMs The models were evaluated using APIs.
For the main experiments the temperature was set
to 0.1, for robustness verification to guarantee de-
terminism it was reduced to 0.0. The prompts used
are shown in Table 9.
12
Iterative Loop (7x)
Polish Dataset Creation Multilingual Test Set Creation
Dual Quality Review Dataset Creation Process
Base Dataset
(117 dual quality + 300
standard = 417 reviews)
Randomly Select & Verify
300 standard reviews
2Extract 117 dual
quality reviews
1
Polish
Reviews from
CENEO / WIZAZ
Internet Articles
& Comments
(External)
Train SetFit model
(Polish embedding)
on current dataset
3
Apply model to full
CENEO/WIZAZ
dataset; Sort by dual
quality probability
4
Select up to 200 new
high-prob reviews;
Manually Verify &
Label (dual quality,
standard, other problems)
5
Extended Data
Repeated 7 times
(Step 6)
6
Dataset after 7 Itera-
tions (1,720 reviews)
Train SetFit model
(Polish embedding)
on 1,720 reviews
Apply model to Reviews
from Demo System;
Manually Verify & Label
Polish Reviews
from Demo System
i.e. reviews from
internet stores
Initial Full PL Dataset
(1,720 + 237 Reviews
from Demo System
= 1,957 reviews)
Perform Cross-Validation;
Correct 67 (3.4%) Labels
Final Polish DQ
Dataset (1,957
reviews, corrected)
Final Polish DQ
Dataset (1,957
reviews, corrected)
AMAZON Dataset
(600k Reviews:
EN, DE, FR)
Multilingual
Reviews from
Demo System
Train SetFit model
(multilingual embed-
ding) on DQ Dataset
Apply trained model
to multilingual reviews
from AMAZON &
our Demo System
Select 500 AMAZON
+ 200 Reviews from
Demo System (Highest
dual quality scores)
Manual Verification
& Balancing (Limit
standard to 130)
Multilingual Test
Set (206 ex: 58 dual
quality, 18 other prob-
lems, 130 standard)
Data Source Process Dataset Final
DQ (PL)
Final Test
(Multi) NStep N
Figure 4: Diagram showing the process of preparing DQ and multilingual datasets.
Original review text Translated review text Label Additional Comment
Fantastyczny zapach i produkt z chemii
niemieckiej, wi˛ec o wiele bardziej intensywny
ni˙
z te, produkowane na polski rynek.
Fantastic fragrance and a product of German
chemistry, so much more intense than those
made for the Polish market.
dual quality -
Jedna z moich ulubionych kaw, zwłaszcza ta w
wersji z Niemiec. O wiele bardziej aromatyczna
ni˙
z proponowana na rynek Polski
One of my favorite coffees, especially the ver-
sion from Germany. Much more aromatic than
the one offered on the Polish market.
dual quality -
poprzedni model Beko kupiony 9 lat temu był
lepszy
The previous Beko model bought 9 years ago
was better.
other problems
deterioration in quality
over time
Tester w drogerii(w centrum handlowym) był
du˙
zo bardziej trwały i intensywniejszy ni˙
z ten
kupiony przez internet. Zastanawiaj ˛ace.
The tester in the drugstore (at the shopping mall)
was much more long-lasting and intense than
the one purchased online. Intriguing.
other problems
difference depending
on the place of purchase
(same market)
Maska spełnia swoje zadanie. Rewelacyjnie
pachnie.
The mask does its job. It smells amazing. standard -
soczewki produkowane poza Europ ˛a maj ˛a kiep-
sk ˛a jako ´
s´
c
Lenses produced outside Europe are of poor
quality.
standard general statement
Table 7: A list of samples from DQ dataset. The original text of the review was translated into English using
GPT-4o.
13
77
63
51
22
19
13
13
7
5
4
2
2
1
1
1
Figure 5: Taxonomy of different product or service issues recognized in reviews.
26%
16%
13%
13%
7%
5%
5%
4%
4%
2%
2%
1%
1%
1%
0%
0%
Category
Beauty
Delicacies
Health
Home and interior
For children
Home appliances
Construction and renovation
Automotive
Hobbies and pets
Garden
Sports and recreation
Other
Office and business
Phones and accessories
Fashion
Jewelry and watches
All reviews
25%
21%
19%
15%
6%
5%
4%
2%
1%
1%
Category
Home and interior
Beauty
Delicacies
For children
Other
Automotive
Health
Home appliances
Construction and renovation
Hobbies and pets
Dual quality reviews
47%
11%
11%
9%
7%
7%
3%
2%
2%
Category
Beauty
Health
Delicacies
Home and interior
For children
Other
Home appliances
Hobbies and pets
Sports and recreation
Reviews about other problems
23%
17%
16% 8%
7%
7% 5%
4%
4%
2%
2%
1%
1%
1%
Category
Beauty
Health
Delicacies
Home and interior
Construction and renovation
Home appliances
Hobbies and pets
Automotive
For children
Garden
Sports and recreation
Office and business
Other
Phones and accessories
Standard reviews
Figure 6: Charts illustrating (1) all product reviews categorized by product type (top) and (2) the distribution of
product categories across various types of reviews (bottom).
14
Name in Paper HF Name
LaBSE sentence-transformers/LaBSE
para-multi-mpnet-base-v2 sentence-transformers/paraphrase-multilingual-mpnet-base-v2
para-multi-MiniLM-L12-v2 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
multi-e5-small intfloat/multilingual-e5-small
multi-e5-base intfloat/multilingual-e5-base
multi-e5-large intfloat/multilingual-e5-large
gte-multi-base Alibaba-NLP/gte-multilingual-base
st-polish-para-mpnet sdadas/st-polish-paraphrase-from-mpnet
st-polish-para-distilroberta sdadas/st-polish-paraphrase-from-distilroberta
mmlw-roberta-base sdadas/mmlw-roberta-base
mmlw-roberta-large sdadas/mmlw-roberta-large
mBERT google-bert/bert-base-multilingual-cased
xlm-roberta-base FacebookAI/xlm-roberta-base
xlm-roberta-large FacebookAI/xlm-roberta-large
herbert-base-cased allegro/herbert-base-cased
herbert-large-cased allegro/herbert-large-cased
polish-roberta-base-v2 sdadas/polish-roberta-base-v2
polish-roberta-large-v2 sdadas/polish-roberta-large-v2
deepseek-v3* deepseek-ai/DeepSeek-V3
gpt-4o* -
Table 8: Model names as referenced in the paper, and corresponding Hugging Face Hub identifiers. An asterisk
(*) indicates models accessed via REST APIs: DeepSeek-V3 (
https://api-docs.deepseek.com/
) and GPT-4o
(https://platform.openai.com/docs/api-reference/introduction).
816
20
16
127
566
17
507
104
327
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
DeepSeek Zero−shot Instruct
923
41
39
69
556
32
458
93
289
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
DeepSeek Few−shot Instruct
850
19
5
75
518
10
525
153
345
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
GPT−4o Zero−shot Instruct
1023
45
26
77
529
11
350
116
323
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
GPT−4o Few−shot Instruct
Figure 7: Confusion matrices aggregated from five experiments for DeepSeek and GPT-4o models in zero-shot and
few-shot instruction-based configurations.
15
Type Prompt
zero-shot Przypisz podan ˛a ni ˙
zej opinie do jednej z trzech klas: "dual quality", "other problems" lub "standard".
W odpowiedzi podaj jedynie nazw˛e klasy, bez dodatkowego komentarza.
Tre´
s´
c opinii:
<review>
few-shot Przypisz podan ˛a ni˙
zej opinie do jednej z trzech klas: "dual quality", "other problems" lub "standard".
Przykłady:
Kapsułki s ˛a lepsze, ni˙
z na polski rynek tej samej firmy. dual quality
Dobry smak kawy. Kraj pochodzenia Niemcy. Nie jest tak kwa´
sna jak kupiona w kraju. dual quality
Mój ulubiony zapach. S ˛adz ˛e jednak, ˙
ze s ˛a du ˙
zo mniej trwałe ni˙
z te, które poprzednim razem kupiłam w sephorze. other problems
Proszek mo˙
ze i z Niemiec, ale produkcja Czechy - wprowadzanie klienta w ˛ad. other problems
Niezły preparat. Łagodzi troch ˛e bóle i zm˛eczenie oczu. Stosuj ˛e od czasu do czasu. standard
jest ok, nie zauwa˙
zyłam wi˛ekszej ˙
znicy mi˛edzy "polsk ˛a" a "niemieck ˛a" wersj ˛a standard
W odpowiedzi podaj jedynie nazw˛e klasy, bez dodatkowego komentarza.
Tre´
s´
c opinii:
<review>
zero-shot+inst. Przypisz podan ˛a ni˙
zej opinie do jednej z trzech klas: "dual quality", "other problems" lub "standard".
Wytyczne dla ka˙
zdej z klas:
"dual quality" (podwójna jako´
s´
c) opinia zawiera informacje o tym, ˙
ze klient kupił ten sam produkt w dwóch krajach i zauwa˙
zył
˙
znic˛e w jako ´
sci, wydajno´
sci, składzie itp. Nie jest konieczne podawanie dokładnych nazw krajów, wystarcz ˛a zwroty takie jak „za
granic ˛a” lub „w naszym kraju”. Klient porównuje dwa takie same produkty lub grupy produktów. Wskazanie ˙
znicy w cenie,
dost˛epno ´
sci lub ogólne stwierdzenie, takie jak „istniej ˛a ˙
znice mi˛edzy produktami zakupionymi we Francji i w Polsce” nie s ˛a
klasyfikowane jako podwójna jako´
s´
c.
"other problems" (inne problemy) opinia nie wskazuje na problem podwójnej jako´
sci, ale dostarcza informacji o in-
nych problemach, w´
sród których mo˙
zemy wyró˙
zni´
c: ˙
znice w produktach wynikaj ˛ace z innego miejsca zakupu (ten sam rynek),
miejsca pakowania lub otrzymanej partii; problemy z samym produktem wymagaj ˛ace gł˛ebszej analizy np. pogorszenie jako´
sci
z upływem czasu; praktyki niezgodne z prawem i/lub naruszaj ˛ace prawa klienta np. produkt jest prawdopodobnie podrobiony,
podejrzenie oszustwa, wprowadzanie klienta w ˛ad, brak instrukcji w wymaganym ezyku, brak daty wa˙
zno´
sci itp.
"standard" standardowa opinia o produkcie, w której opisane uwagi dotycz ˛a samego produktu i nie wskazuj ˛a na
problemy omówione przy klasach „podwójna jako´
s´
c” lub „inne problemy”.
W odpowiedzi podaj jedynie nazw˛e klasy, bez dodatkowego komentarza.
Tre´
s´
c opinii:
<review>
few-shot+inst. Przypisz podan ˛a ni˙
zej opinie do jednej z trzech klas: "dual quality", "other problems" lub "standard".
Wytyczne dla ka˙
zdej z klas:
"dual quality" (podwójna jako´
s´
c) opinia zawiera informacje o tym, ˙
ze klient kupił ten sam produkt w dwóch krajach i zauwa˙
zył
˙
znic˛e w jako ´
sci, wydajno´
sci, składzie itp. Nie jest konieczne podawanie dokładnych nazw krajów, wystarcz ˛a zwroty takie jak „za
granic ˛a” lub „w naszym kraju”. Klient porównuje dwa takie same produkty lub grupy produktów. Wskazanie ˙
znicy w cenie,
dost˛epno ´
sci lub ogólne stwierdzenie, takie jak „istniej ˛a ˙
znice mi˛edzy produktami zakupionymi we Francji i w Polsce” nie s ˛a
klasyfikowane jako podwójna jako´
s´
c.
Przykłady: "Kapsułki s ˛a lepsze, ni˙
z na polski rynek tej samej firmy.", "Dobry smak kawy. Kraj pochodzenia Niemcy. Nie jest tak
kwa´
sna jak kupiona w kraju."
"other problems" (inne problemy) opinia nie wskazuje na problem podwójnej jako´
sci, ale dostarcza informacji o in-
nych problemach, w´
sród których mo˙
zemy wyró˙
zni´
c: ˙
znice w produktach wynikaj ˛ace z innego miejsca zakupu (ten sam rynek),
miejsca pakowania lub otrzymanej partii; problemy z samym produktem wymagaj ˛ace gł˛ebszej analizy np. pogorszenie jako´
sci
z upływem czasu; praktyki niezgodne z prawem i/lub naruszaj ˛ace prawa klienta np. produkt jest prawdopodobnie podrobiony,
podejrzenie oszustwa, wprowadzanie klienta w ˛ad, brak instrukcji w wymaganym ezyku, brak daty wa˙
zno´
sci itp.
Przykłady: "Mój ulubiony zapach. S ˛adz ˛e jednak, ˙
ze s ˛a du ˙
zo mniej trwałe ni˙
z te, które poprzednim razem kupiłam w sephorze",
"Proszek mo˙
ze i z Niemiec, ale produkcja Czechy - wprowadzanie klienta w ˛ad."
"standard" standardowa opinia o produkcie, w której opisane uwagi dotycz ˛a samego produktu i nie wskazuj ˛a na
problemy omówione przy klasach „podwójna jako´
s´
c” lub „inne problemy”.
Przykłady: "Niezły preparat. Łagodzi troch˛e bóle i zm˛eczenie oczu. Stosuj˛e od czasu do czasu.", "jest ok, nie zauwa ˙
zyłam
wi˛ekszej ˙
znicy mi˛edzy "polsk ˛a" a "niemieck ˛a" wersj ˛a"
W odpowiedzi podaj jedynie nazw˛e klasy, bez dodatkowego komentarza.
Tre´
s´
c opinii:
<review>
Table 9: Prompts used during LLMs evaluation. Bold text and blank lines were added only for readability of the
table. For non-Polish speakers, translated prompts available in Table 10.
16
Type Prompt
zero-shot Assign the following review to one of three classes: “dual quality”, “other problems” or “standard”.
In your answer, provide only the name of the class, without additional comment.
Review text:
<review>
few-shot Assign the following review to one of three classes: “dual quality”, “other problems” or “standard”.
Examples:
The capsules are better than those on the Polish market from the same company. dual quality
Good coffee taste. Country of origin: Germany. It is not as acidic as the one bought in the country. dual quality
My favorite scent. However, I think it’s much less long-lasting than the one I bought at Sephora last time. other problems
The powder may be from Germany, but it’s made in the Czech Republic - misleading the customer. other problems
Decent product. It slightly alleviates eye pain and fatigue. I use it occasionally. standard
It’s okay, I didn’t notice much difference between the "Polish" and "German" version. standard
In your answer, provide only the name of the class, without additional comment.
Review text:
<review>
zero-shot+inst. Assign the following review to one of three classes: “dual quality”, “other problems” or “standard”.
Guidelines for each category:
"dual quality" The review includes information that the customer purchased the same product in two different countries and
noticed a difference in quality, performance, composition, etc. It is not necessary to specify the exact names of the countries;
phrases like "abroad" or "in our country" are sufficient. The customer compares two identical products or groups of products.
Indicating a difference in price, availability, or a general statement such as "there are differences between products purchased in
France and Poland" is not classified as dual quality.
"other problems" The review does not indicate an issue of dual quality but provides information on other problems,
which can include: differences in products resulting from a different place of purchase (same market), place of packaging, or the
received batch; problems with the product itself requiring deeper analysis, such as deterioration in quality over time; practices that
are illegal and/or violate customer rights, such as the product potentially being counterfeit, suspicion of fraud, misleading the
customer, lack of instructions in the required language, lack of an expiration date, etc.
"standard" A standard product review where the comments pertain only to the product itself and do not indicate the
problems discussed in the "dual quality" or "other problems" categories.
In your answer, provide only the name of the class, without additional comment.
Review text:
<review>
few-shot+inst. Assign the following review to one of three classes: “dual quality”, “other problems” or “standard”.
Guidelines for each category:
"dual quality" The review includes information that the customer purchased the same product in two different countries and
noticed a difference in quality, performance, composition, etc. It is not necessary to specify the exact names of the countries;
phrases like "abroad" or "in our country" are sufficient. The customer compares two identical products or groups of products.
Indicating a difference in price, availability, or a general statement such as "there are differences between products purchased in
France and Poland" is not classified as dual quality.
Examples: "The capsules are better than those on the Polish market from the same company.", "Good coffee taste. Country of
origin: Germany. It is not as acidic as the one bought in the country."
"other problems" The review does not indicate an issue of dual quality but provides information on other problems,
which can include: differences in products resulting from a different place of purchase (same market), place of packaging, or the
received batch; problems with the product itself requiring deeper analysis, such as deterioration in quality over time; practices that
are illegal and/or violate customer rights, such as the product potentially being counterfeit, suspicion of fraud, misleading the
customer, lack of instructions in the required language, lack of an expiration date, etc.
Examples: "My favorite scent. However, I think it’s much less long-lasting than the one I bought at Sephora last time.", "The
powder may be from Germany, but it’s made in the Czech Republic - misleading the customer."
"standard" A standard product review where the comments pertain only to the product itself and do not indicate the
problems discussed in the "dual quality" or "other problems" categories.
Examples: "Decent product. It slightly alleviates eye pain and fatigue. I use it occasionally.", "It’s okay, I didn’t notice much
difference between the "Polish" and "German" version."
In your answer, provide only the name of the class, without additional comment.
Review text:
<review>
Table 10: Translated prompts from Table 9used during LLMs evaluation.
17
816
20
16
127
566
17
507
104
327
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
DeepSeek Zero−shot Instruct
850
19
5
75
518
10
525
153
345
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
GPT−4o Zero−shot Instruct
1259
146
105
88
508
56
103
36
199
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
mmlw−roberta−base
1293
94
101
79
557
48
78
39
211
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
herbert−large−cased
1309
136
123
61
535
39
80
19
198
standard
dual
quality
other
problems
standard dual
quality other
problems
Predicted Label
Actual Label
Agreement
highest
agreement
highest
disagreement
Aggregated Confusion Matrix:
polish−roberta−large−v2
Figure 8: Confusion matrices aggregated from five experiments for best performing LLMs and top-performing local
models.
Dual Quality class All classes
Method Precision Recall F1 Accuracy mPrecision mRecall mF1
SetFit + sentence transformers
LaBSE 74.0±8.737.9±12.149.1±11.470.1±3.655.5±3.647.6±4.148.4±4.8
para-multi-mpnet-base-v2 69.4±3.445.5±4.454.8±3.067.1±1.853.2±1.449.6±0.850.2±0.6
para-multi-MiniLM-L12-v2 69.3±3.240.3±7.850.7±7.462.9±2.049.3±1.943.2±2.744.6±3.0
multi-e5-small 74.0±4.141.7±4.853.2±4.272.9±1.349.1±1.546.2±1.545.5±1.7
multi-e5-base 78.4±4.845.9±19.154.6±19.173.4±4.154.1±5.049.2±7.148.6±9.1
multi-e5-large 0.0±0.00.0±0.00.0±0.063.1±0.021.0±0.033.3±0.025.8±0.0
gte-multi-base 81.7±4.958.0±4.767.7±3.771.6±3.357.2±2.752.5±2.654.0±2.7
Transformer-based encoders
mBERT 61.7±19.56.6±4.311.1±6.762.1±2.843.8±6.334.7±2.230.3±3.3
xlm-roberta-base 69.5±2.366.9±6.867.9±2.973.0±1.055.5±1.155.1±2.155.0±1.7
xlm-roberta-large 84.8±3.863.1±4.872.3±4.072.6±2.760.1±2.756.7±3.957.5±3.3
LLMs
deepseek-v3 zero-shot 47.6±1.986.2±2.861.4±2.332.4±0.946.9±1.539.4±1.028.8±0.9
deepseek-v3 few-shot 62.8±1.470.7±1.466.5±0.735.6±0.654.3±0.746.7±1.836.7±0.7
deepseek-v3 zero-shot+inst. 85.9±1.852.3±0.865.0±0.349.5±0.763.4±1.358.7±1.049.1±0.7
deepseek-v3 few-shot+inst. 91.9±4.850.6±0.865.2±1.844.3±0.965.6±2.256.2±1.246.1±1.0
gpt-4o zero-shot 38.8±0.686.8±2.253.6±1.033.3±0.647.4±0.236.8±0.727.0±0.4
gpt-4o few-shot 58.5±0.873.6±0.865.1±0.434.1±0.655.8±0.648.1±2.334.7±0.7
gpt-4o zero-shot+inst. 85.3±1.346.6±0.060.2±0.352.6±0.662.3±0.357.1±0.349.6±0.3
gpt-4o few-shot+inst. 80.2±1.146.6±0.058.9±0.341.6±0.661.4±0.550.2±1.042.7±0.5
Table 11: Evaluation results on a multilingual dataset consisting of English, German and French reviews. In red
were marked results showing an example of when a multilingual transfer did not work.
18
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.