Estimating Digitization Costs in Digital Libraries Using DiCoMo.
ABSTRACT The estimate of digitization costs is a very difficult task. It is difficult to make exact predictions due to the great quantity
of unknown factors. However, digitization projects need to have a precise idea of the economic costs and the times involved
in the development of their contents. The common practice when we start digitizing a new collection is to set a schedule,
and a firm commitment to fulfill it (both in terms of cost and deadlines), even before the actual digitization work starts.
As it happens with software development projects, incorrect estimates produce delays and cause costs overdrafts.
Based on methods used in Software Engineering for software development cost prediction like COCOMO and Function Points, and
using historical data gathered during five years at the Miguel de Cervantes Digital Library, during the digitization of more
than 12.000 books, we have developed a method for time and cost estimates named DiCoMo (Digitization Costs Model) for digital
content production in general. This method can be adapted to different production processes, like the production of digital
XML or HTML texts using scanning and OCR, and undergoing human proofreading and error correction, or for the production of
digital facsimiles (scanning without OCR). The accuracy of the estimates improve with time, since the algorithms can be optimized
by making adjustments based on historical data gathered from previous tasks.
Article: DiCoMo: the digitization cost model.[Show abstract] [Hide abstract]
ABSTRACT: The estimate of digitization costs is a very difficult task. It is difficult to obtain accurate values because of the great quantity of unknown factors. However, digitization projects need to have a precise idea of the economic costs and the times involved in the development of their contents. The common practice when we start digitizing a new collection is to set a schedule, and a firm commitment to fulfil it (both in terms of cost and deadlines), even before the actual digitization work starts. As it happens with software development projects, incorrect estimates produce delays and cause costs overdrafts. Based on methods used in Software Engineering for software development cost prediction like COCOMO and Function Points, and using historical data gathered during 5 years at the MCDL project, during the digitization of more than 12000 books, we have developed a method for time-and-cost estimates named DiCoMo (Digitization Cost Model) for digital content production in general. This method can be adapted to different production processes, like the production of digital XML or HTML texts using scanning and OCR, and undergoing human proofreading and error correction, or for the production of digital facsimiles (scanning without OCR). The accuracy of the estimates improve with time, since the algorithms can be optimized by making adjustments based on historical data gathered from previous tasks. Finally, we consider the problem of parallelizing tasks, i.e. dividing the work among a number of encoders that will work in parallel.International Journal on Digital Libraries 01/2010; 11:141-153.
Estimating Digitization Costs in Digital
Libraries Using DiCoMo
Alejandro Bia1, Rafael Mu˜ noz2, and Jaime G´ omez2
1CIO/DEMI, Miguel Hern´ andez University, Spain
2DLSI, University of Alicante, Spain
Abstract. The estimate of digitization costs is a very difficult task. It is
difficult to make exact predictions due to the great quantity of unknown
factors. However, digitization projects need to have a precise idea of
the economic costs and the times involved in the development of their
contents. The common practice when we start digitizing a new collection
is to set a schedule, and a firm commitment to fulfill it (both in terms
of cost and deadlines), even before the actual digitization work starts.
As it happens with software development projects, incorrect estimates
produce delays and cause costs overdrafts.
Based on methods used in Software Engineering for software develop-
ment cost prediction like COCOMO and Function Points, and using his-
torical data gathered during five years at the Miguel de Cervantes Digital
Library, during the digitization of more than 12.000 books, we have devel-
oped a method for time and cost estimates named DiCoMo (Digitization
Costs Model) for digital content production in general. This method can
be adapted to different production processes, like the production of digi-
tal XML or HTML textsusing scanning and OCR, and undergoing human
proofreading and error correction, or for the production of digital facsimi-
les (scanning without OCR). The accuracy of the estimates improve with
time, since the algorithms can be optimized by making adjustments based
on historical data gathered from previous tasks.
Keywords: Cost and time estimates, Digitization, Contents Production,
DL Project management.
Almost three decades after Barry Boehm presented the Constructive Cost Model
(COCOMO) , the problem of accurately estimating software development
costs is far from solved. In professional software development practice, just a few
developers use software estimation methods other than expert judgment (which
is basically an “expert’s guess”), and when they do, the results are usually far
from satisfactory [11,9].
This work discusses some of the reasons why cost estimate methods like CO-
COMO fail in practice in software engineering applications, but may be accurate
M. Lalmas et al. (Eds.): ECDL 2010, LNCS 6273, pp. 136–147, 2010.
c ? Springer-Verlag Berlin Heidelberg 2010
DiCoMo: Estimating Digitization Costs in DLs137
for other tasks, like predicting digitization times and costs, provided we make
the necessary modifications and customizations to the algorithm. By doing this,
we have improved the accuracy of the estimates and widened its possible uses
to other fields. Hence, we recommend the use of this type of algorithmic method
for tasks other than software development, like DL contents production. Below
we provide examples of production time and cost estimates obtained in this way
at the Miguel de Cervantes Digital Library (MCDL)1.
1.1The Basic Digitization Cost Model
In the digitization cost model we propose, we use an equation similar to Inter-
mediate COCOMO , but with some differences:
– Size-Independent Overhead. We added a new term called SIO (Size-
Independent Overhead) that represents the fixed preparation time for the
task, which is independent from its size. An example of this size-independent
overhead is the time needed to adjust the parameters of an image scanner
and OCR before starting a scanning session. This is a fixed time which does
not depend on the number of pages to be scanned later.
– The size is known beforehand. One of the reasons why COCOMO often
fails in estimating software costs is because its calculations are based on an
estimated size of the code to be built (KLOC2), which is highly uncertain at
the initial stages of the project. When applying a similar method to estimate
digitization costs, the first thing we realize is that we don’t have to guess the
size of the work because we can easily know it, or can accurately estimate it.
The size of the documents to digitize is measured as the number of pages P
and can be measured (or calculated with reasonable accuracy) beforehand.
– Time is cost. There is one similarity with software development projects:
since most of the cost in digitization is human labour, which in the long run
overweights the cost of hardware and software, time estimates of a digitiza-
tion task can be directly converted to cost estimates, using some money-per-
Given the number of pages P, we can directly calculate the time in hours T,
with the Basic-DiCoMo formula:
T = a · Pb+ SIO
For a graphic example of this DiCoMo approach, see figure 1, where an estimation
curve (thick line) approaches real data spots (black squares) that represent time
measures of real digitized documents. The thin straight line represents a linear
approach to the spots (a · P), which is not the best approach, while the curve
(a · Pb) fits more accurately, although not perfectly. The values for a and b are
obtained by adjusting the curve to best approach the cloud of points, using
historical data. The value of the fixed term SIO is the point at which the curve
2Kilo Lines Of Code
138 A. Bia, R. Mu˜ noz, and J. G´ omez
Fig.1. Document digitization times (hours vs. pages)
crosses the Y axis, or the time needed for the impossible case of a task of size
0, and for our purposes represents the preparation time mentioned before. For
certain tasks (e.g. big tasks), this time may be negligible and hence ignored.
For example, the following equation, based on early experience at the MCDL,
gives us the estimated number of hours to process a text given the number of
T = 0.069 · P1.465+ 0.6
Using this formula, a standard-complexity book of 100 pages will take about 59
hours of scanning, correction and XML markup altogether.
1.2 The Importance of Historical Cost-Data
Inside most organizations, the estimation of production costs is usually based
on past experiences. Historical data are used to identify the cost factors and
to determine their relative importance within the organization . Historical
data will be used first to adjust the basic estimation algorithm (the exponential
curve), and later to adjust the detected impact factors to be used as modifiers
to obtain more accurate results. This is the reason why it is so important to
systematically collect and store time and feature data from projects, and to
take note of the perceived factors that affect the times as well as the amount of
1.3 Adjusted DiCoMo
The simple approach used in equation 1 doesn’t take into account the fact that
different literary works have different degrees of difficulty owed to several factors
DiCoMo: Estimating Digitization Costs in DLs139
(discussed below), which will affect production times. We have detected the
most important of these factors, and assigned weights to them to be able to
use them as feature-modifiers. We added an Effort Adjusting Factor (EAF) to
the Adjusted DiCoMo equation, equivalent to the one used in Intermediate-
COCOMO, but based in this case on specific digitization features. The EAF is
calculated as the multiplication of relevant feature-modifiers chosen from a table
(see for instance table 1). The modifiers shown in the table were obtained from
historical data collected at the MCDL. The value of these modifiers is 1.00 in
the normal case, then having no impact on the overall EAF factor, or values
slightly above or below 1 in the other cases, contributing to raise or lower the
unadjusted estimate, producing the desired “adjust” effect (see vertical lines
from the exponential curve to the white filled triangles in figure 1) .
T = a · Pb· EAF + SIO
where: EAF =?modifieri
Table 1. DiCoMo: Complexity modifiers used to calculate the EAF
encoder experience and skills
familiarity with task
familiarity with computer tools
foreign or ancient languages present
stained or old paper
old font faces
special care required (ancient books) 0.80
high quality demands
inadequate technology used
Low Normal High
1.4Factors That Affect Digitization Costs
There are several factors that affect the cost of production of digital objects. Both
these factors and their effect on costs are difficult to determine and have to be
carefully studied. They are detected by experience, as features which are found to
affect the time required to complete a task either positively or negatively. Once
a factor of this type is detected, we have to measure its impact, as a percentage
relative to the “normal-case” time. The best way to do this is by gathering time
records of digitization tasks, and record also their particular features and their
weight (e.g. low, normal, high). With enough records of this type, algorithm
optimization techniques can be applied to infer the range of impact of a given
feature as a +/- percent. For instance, we detected that the literary style of a
text affected its digitization time, due to harder or simpler markup requirements.
We started recording this feature, indicating whether a text was mainly, from
best to worst case: prose, verse, drama written in prose or drama written in
verse. We stored records of this together with the times required to complete the
140A. Bia, R. Mu˜ noz, and J. G´ omez
digitization task. After gathering a good number of records, we used optimization
techniques to get the optimum value range for this new modifier, which turned
out to be +/- 7,6%. So in the case of drama written in verse (hardest markup
case), we will have to add 7,6% more time to the estimate.
Among the factors detected, we can highlight the individual skills and expe-
rience of the persons assigned to the project, as well as their familiarity with
the specific characteristics of the work to be digitized, the familiarity with the
computer tools to be used, the complexity of the task, size, quality require-
ments, technology used, etc. Also important are some features of the document
that affect digitization times like: the presence of foreign or ancient languages,
stained/yellowish paper, old/irregular font faces, high quality demands, inade-
quate technology used, special care required for old books, etc.
The Adjusted DiCoMo equation (3), customized with historical data from
previous projects (4), and using the EAF factor, now gives us better estimates
of the time needed to digitize a text given the number of pages:
T = 0.081· P1.462· EAF + 0.1(4)
For instance, a book of 100 pages with stained/old paper (+15%) and foreign or
ancient languages present (+25%), will take approximately 98 hours to complete,
compared to the 59 hours estimated using the basic equation without modifiers:
T = 0.081· 1001.462· 1.15 · 1.25 + 0.1 = 97.85hs
Figure 1 shows the Basic DiCoMo exponential curve (thick line), that ap-
proaches the black square data spots that represent measures of real digitized
documents. The EAF adjusted results are shown as white filled triangles which
in most cases approach more closely the real values. In a very few cases, however,
the EAF results are worse than the basic curve.
The time assigned affects mainly the quality of the product obtained which
is notably reduced when the times assigned are unreasonably short, forcing the
technicians to work under excessive pressure. This is particularly true for the
correction and editing process, where text output from OCR has to be carefully
proofread and corrected. This is a delicate craft that takes time and cannot be
done under excessive pressure. When not properly done, further revisions and
corrections are needed, with a very negative impact on costs. Next, each one of
these factors is described in detail.
1.5 Size of the Material to Publish
Digitization projects, compared to software development projects, have the ad-
vantage that we can know quite precisely beforehand the size of the work to be
done (namely the number of pages or words to digitize).3
3In software development projects, the number of lines of code is not know at the
beginning of a project. This is the main drawback of the original COCOMO method,
which was modified and renamed as COCOMO-II [4,5,6] to sort this problem. Other
methods, like Function Points, Use Case Points[2,10] and Object Points, which
are based on functionality aspects instead of lines-of-code, do not have this problem.
DiCoMo: Estimating Digitization Costs in DLs 141
There are various ways to measure the size of the material to digitize. The
first and easiest way to determine the raw size of a text to be digitized is to
count the pages. This is the most common method, and is generally sufficient
for accurate estimation purposes. A disadvantage is that pages are not equally
dense for all books. We can have an approach to the density by counting the
words that fit in a standard page, or the words that fit in a fixed size window,
and then assuming that the rest of the pages are similar in this respect. To count
individual words would be more accurate (we verified this by experience), but
it is not a practical approach: the improvement in accuracy does not justify the
effort. However, after the OCR process takes place, we will obtain a text file,
with errors, but nevertheless a text file where we can automatically count the
number of words or get the size in bytes. This is a good measure of the size
of the proofread and correction work that follows, and may serve to adjust the
initial estimates for higher accuracy.
There are many complexity factors that affect every stage of the digitization
process (scanning, proofreading or correction, and markup). In the case of the
correction stage, which we consider the most critical one, there are various factors
to be taken into consideration:
– the type of text: prose, verse, drama (both written in prose, and in verse),
– footnotes, if the number is too high
– quotations in foreign or classical languages (if too many)
– the complexity of the author style and vocabulary
– the quality of the OCR output (few or lots of errors)
– the legibility of the original (paper copy from which the digital version id
Effort Adjusting Modifiers
Concerning markup, complexity varies according to the number and difficulty of
the tags to be added. Drama, for instance, with the need of a castlist, speaker
and speeches, require an additional amount of tagging compared to prose.
Verse with split lines is another good example of extra complexity, since special
care needs to be taken to assign attribute values which indicate which part of
the split line of verse is which (initial, middle, or final).
In the case of the production of digital facsimiles from manuscripts, a case
of particular complexity is when we have to work on rare and valuable originals
that have to be handled with special care (wearing rubber gloves for instance)
and using a digital photographic camera instead of a flat bed scanner. On the
contrary, digitizing unbounded pages using a flat bed scanner with automatic
page feeder would be the easiest case.
For each of the critical features mentioned, three possible modifier values were
set, to be used when the feature appear as high, normal, or low (e.g. the values
could be 1.10, 1.00 and 0.90 in each case). For a given task, all the modifier
values that apply to the case, and that are different than normal (1.00), are
multiplied to obtain the EAF factor.
142 A. Bia, R. Mu˜ noz, and J. G´ omez
Individual skills of the technicians: In the programmer’s world, individual
productivity has been measured extensively. Harold Sackman et al. carried out
an experiment in 1968 . They made evident that performance differences
registered in individual programmers were much bigger than those attributed to
the effect of the working environment. The difference between the best and the
worst performance was very high, being the experience a decisive factor. In a
later experiment, Sackman observed a variation in the productivity of as much
as 16 to 1. DeMarco and Lister also discussed the effects of a well integrated
group to enhance productivity in their book Peopleware , that deals with the
human component in software projects.
In digitization, the results that we have measured comparing correctors’ per-
formances show remarkable differences in productivity, depending on their indi-
vidual skills and experience (sometimes a 3 to 1 ratio). Variations in productivity
of this magnitude are significant for cost estimates, making necessary to express
this in the calculations by means of a modifier.
Special quality requirements: In the case of digital text production, produc-
ing a modernized digital edition from an ancient text takes additional time and
effort compared to processing a modern text, since modernization is a complex
task that involves difficult decisions.
Using Madison markup for the transcription of a manuscript is another ex-
ample of additional requested complexity. So is the case of making highly legible
digital facsimiles from ancient manuscripts, where special care and fine-tuning
of the scanning equipment may be required, as well as graphic postprocessing.
Technological level of the environment: This is a relevant issue when us-
ing different technologies or migrating from old to new production tools. When
the environment is stable and well known, and the estimate equations are well
adjusted for it, there is no need to care about this issue. Changes in technol-
ogy, however, will surely requiere modifications to the equations, and may make
historical time and cost data obsolete for future estimate adjusting purposes.
1.7Procedure to Estimate Costs Using DiCoMo
1. Establish the production process to follow (production workflow). There may
be different production workflows for different purposes (e.g. facsimile images
are only scanned, while text undergoes scanning, OCR, proofreading and
2. Identify all the objects (books, images, etc) to be digitized and their associ-
ated tasks (Work Breakdown Structure).
3. Measure or estimate the size of each object to be digitized.
4. Establish the production steps to be followed by assigning the right workflow.
5. Specify the effort adjusting factors for each object.
6. Calculate the time each unit will take (use the adequate equation with the
corresponding complexity factors).
7. Calculate the total development time for the project as the sum of individual
DiCoMo: Estimating Digitization Costs in DLs143
8. Optionally, compare the estimate with another, perhaps a top-down one
like the DELFI technique or expert-judgment, identifying and correcting the
differences in the estimate if necessary.
1.8 The Most General Formula
In previous examples, we have used a single formula to estimate the whole digiti-
zation task, which is simpler, but better results can be obtained by using specific
formulas with their own adjusts for each step in the digitization process (e.g.,
scanning, proofreading, markup). So in this case we consider each production
step as a functional unit, to which a specific estimation equation is applied. The
global estimate T turns out to be the sum of all the specific step estimates.
(a · Pb·
eafi+ SIO) (5)
The DiCoMo method was implemented into the digital library’s workflow-and-
document-handling system, a software application that controls the whole pro-
duction process of all the types of digital resources produced by the DL. It
provides useful management information for estimating costs and times of dig-
itization projects. It estimates times of cataloging, scanning, correction and
Fig.2. Estimate of scanning costs
144A. Bia, R. Mu˜ noz, and J. G´ omez
Fig.3. Parameters used to estimate digitization costs
Fig.4. Estimate of correction costs
DiCoMo: Estimating Digitization Costs in DLs 145
Fig.5. Final report of digitization costs for a book (shows cataloging, scanning and
correction of the text)
markup in the case of text production, and cataloging, scanning, and graphic
processing in the case of facsimiles.
A few screenshots captured from this system are shown below. Figure 2 shows
a scanning-only estimate for a 71 pages book. Figure 3 shows average historic
values for different types of complexities and types of scanning device. Figure 4
shows a correction-only estimate, and Figure 5 shows the final summary of costs
for the production of a digitized text book.
We have developed a cost estimation model for digitization projects based on
known software engineering cost models. This method allowed us to predict the
time required to complete digitization tasks with good accuracy. Digitization
projects, compared to software development projects, have the advantage that
the size of the work to be done can be known beforehand (namely the number of
pages or words to digitize). In software design we can only guess the total number
of lines of code a project will require, and the accuracy of the calculated time
estimates will depend largely on this preliminary “expert judgment” estimate.
146A. Bia, R. Mu˜ noz, and J. G´ omez
We verified that the model we propose works well in practice, and can be easily
applied to different digital production processes, or other project or engineering
tasks, provided that the cost equation is fine-tuned for each type of task using
historical data. This requires two things to be done advance:
– Sufficient historical data must be collected to fine-tune the parameters of
the cost equation.
– The main objective factors that affect the time required to do the task must
be determined, and adequate effort adjusting modifiers be calculated and
assigned to each of them.
With this information, a cost-equation for the specific production process can be
easily obtained. Good expert knowledge of the process facilitates the fine-tuning
task and allows for better estimation equations. Nevertheless, the cost-equations
can be dynamically improved by re-adjusting the parameters with the new data
fed-back from recently finished projects. In this way the estimation model can
be continuously and incrementally improved.
3.1Some Remarks on the Nature of Time and Cost Estimates
Often we use the words prediction and forecast when referring to estimates.
The nature and purpose of predictions and forecasts is different from estimates.
In the case of predictions and forecasts (think of stock-exchange predictions or
weather forecasts), we obtain some prediction values, and then wait for real
events to happen and confirm the predictions, or not. In the case of an estimate,
we should not wait for an event to happen, but should work towards it instead.
This active, not passive, nature is essential for profiting from estimates. An
estimate is a target, a goal we have to fulfill, a reference or time frame to help
us control our project. A good estimate is the time or cost objective withing
which a task can be done under moderate pressure with a reasonably
good quality. A task can always be done in a longer time, or in a shorter time
under exceptional pressure, up to a point when either it cannot be done (at least
with the required quality), or it produces undesired uneasiness in the work team.
So it is wise to think of estimates as reasonable goals, that will require some
effort and control, and not as mere predictions. A good deal of risk management
is also advisable to help accomplishing the estimated targets, without surprises.
1. Albrecht, A.J., Gaffney, J.E.: Software Function, Source Lines of Code, and Devel-
opment Effort Prediction: A Software Science Validation. IEEE Transactions on
Software Engineering SE-9(6), 639–648 (1983)
2. Banerjee, G.: Use Case Points, An Estimation Approach (August 2001),
use case points.pdf
3. Boehm, B.W.: Software Engineering Economics. Prentice Hall, Englewood Cliffs
DiCoMo: Estimating Digitization Costs in DLs147
4. Boehm, B., Clark, B., Horowitz, E., Westland, C., Madachy, R., Selby, R.: Cost
Models for Future Software Life-Cycle Processes: COCOMO 2.0. In: Arthur, J.,
Henry, S. (eds.) Annals of Software Engineering Special Volume on Software Pro-
cess and Product Measurement, vol. 1, pp. 45–60. J.C. Baltzer AG, Science Pub-
lishers, Amsterdam (1995)
5. Clark, B., Devnani-Chulani, S., Boehm, B.: Calibrating the COCOMO II
Post-Architecture Model. In: 20th International Conference on Software Engi-
neering, Center for Software Engineering, Computer Science Department, Uni-
versity of Southern California, Los Angeles, CA 90098-0781 USA, +1 213 740
6470 (April 1998), http://sunset.usc.edu/csse/TECHRPTS/1998/usccse98-502/
6. CSE: COCOMO II Model Definition Manual, Center for Software Engineering,
Computer Science Department, University of Southern California, Los Angeles, Ca.
90089 (1997), http://sunset.usc.edu/csse/research/COCOMOII/cocomo2000.0/
7. DeMarco, T., Lister, T.: Peopleware, Productive Projects and Teams. Dorset House
Publishing, New York (1987)
8. Fairley, R.E.: Software Engineering Concepts. McGraw-Hill, New York (1985)
9. Galorath, D.: Software Project Failure Costs Billions. Better Estimation and
Planning Can Help, June 7 (2008),
10. LCI: Use Cases and Function Points, Longstreet Consulting Inc. (2004),
11. Magazinovic, A.: Exploring Cost Estimation Inaccuracy - Why do practitioners
still fail to predict the actuals? Tech. rep., Chalmers University of Technology,
Department of Computer Science and Engineering, Chalmers University of Tech-
nology, SE-41296 G¨ oteborg, Sweden (2008),
12. Minkiewicz, A.F.: Measuring Object Oriented Software with Predictive Object
Points, PRICE Systems, L.L.C (1997), http://www.pricesystems.com/
13. Sackman, H., et al.: Exploratory Experimental Studies comparing Online and Of-
fline Programming Performance. Communications of the ACM 11(1) (January