Content uploaded by Øyvind Langsrud
Author content
All content in this area was uploaded by Øyvind Langsrud on Sep 20, 2018
Content may be subject to copyright.
An Algorithm for Small Count Rounding of
Tabular Data
Proceedings, Privacy in statistical databases, Valencia,
Spain. September 26-28, 2018
Øyvind Langsrud, Johan Heldal
Statistics Norway, P.O. Box 8131 Dep., 0033 Oslo, Norway
Abstract. This paper presents an algorithm for protecting frequency
tables in cases where small values are sensitive. When only some
cross-classifications are to be published, the number of cells to be
rounded is limited. The algorithm fills the rounding base value into
inner cells one by one. Using a cross-product criterion inspired by
partial least squares regression, this can be done very fast. Utilising
sparse matrix methodology is very important for the efficiency of the
algorithm. The algorithm is demonstrated based on a six-way frequency
table built from public microdata. All possible one-way, two-way,
three-way and four-way cross-classifications are assumed to be
published. In 90 seconds, 89793 inner cells were changed in a manner
that gave satisfactory results. In addition, the method was tested, with
good results, on the Norwegian data for the European Census Hub
2011. In 144 minutes, all possible cross classifications of 79 unique
principal marginal distributions belonging to 21 hypercubes were
simultaneously rounded in a way that ensured additivity and
consistency.
1 Introduction
The field of statistical disclosure control for tabular data consists of
several methods [4, 7, 9]. When small values in frequency tables are
sensitive, they may be protected by cell suppression. Then, in addition
to the primary suppressed cells, other cells must be secondary
suppressed. An alternative method is rounding. Secondary suppression
is then avoided. The price to pay is that the various totals are slightly
changed. When the starting point is that small values are sensitive, only
small values need to be rounded. Furthermore, if the inner cells are not
going to be published, rounding can be limited. The aim is to make sure
that those cells (cross-classifications) to be published are safe. The
nonzero inner cell frequencies are the frequencies of the unique
combinations in the microdata. Thus, the rounding of inner cells
Table 1. The example frequency table
col1 col2 col3 col4 col5 Total
row1 6 0 1 3 4 14
row2 1 2 3 1 2 9
row3 0 1 1 0 2 4
Total 7 3 5 4 8 27
Table 2. Rounded data using 5 as rounding base and considering row and column
totals as publishable.
col1 col2 col3 col4 col5 Total
row1 6 0 5 0 4 15
row2 1 0 0 5 2 8
row3 0 5 0 0 0 5
Total 7 5 5 5 6 28
corresponds to changing microdata. When the frequencies of the cells to
be published are recalculated from the rounded inner cells, additivity
and consistency are guaranteed. Such a method is described in [3]. In
this paper an improved algorithm is presented. The approach is different
from methods seeking for an optimal solution such as the controlled
rounding procedure implemented in τ-ARGUS [8, 13]. However, the
solution obtained will be good and very large and complicated tables
can be handled quickly. The algorithm is available as an R package [5].
2 Introductory example
To illustrate the methodology we consider the example data in Table 1.
We assume that only the overall total and the row and column totals are
to be published (publishable cells). To prevent counts in the range 1 to
4 to be published, inner cells are rounded using 5 as rounding base. The
result is in Table 2. As we see, not all small inner cells have been subjected
to rounding. The aim is to limit the rounding to necessary cells. We will
now examine how this is done by the proposed algorithm.
In general, a vector, z, of all the publishable cells can be computed
from the vector, y, of the inner cells via a dummy matrix X:
z=XTy=h14947354827iT(1)
In our example Xis a 15 ×9 matrix since we have 15 inner cells and 9
publishable cells.
Looking at Table 1, there are three publishable cells in the range 1
to 4. Their values are caused by six inner cells. Their row-column indices
are: (2,2), (3,2), (3,3), (1,4), (2,4) and (3,5). We will therefore consider
a reduced version of yand Xcorresponding to these six elements. After
removing rows of Xwe will also remove the columns where all elements
are identical, like the column for the total table count. Thus, only seven
totals are of interest: row1, row2, row3, col2, col3, col4 and col5. The
reduced versions of Xand ycorresponding to the indices and totals
described above are below.
X=
0101000
0011000
0010100
1000010
0100010
0010001
,y=
2
1
1
3
1
2
,y(1) =
0
5
0
0
0
0
,y(2) =
0
5
0
5
0
0
(2)
The values of yare to be rounded. Since their sum is 10, it is clear that two
elements should be 5. The question is how to select these two elements. In
the algorithm, elements are selected one by one. This is seen as y(1) and
y(2) above. The objective is to choose elements so that the publishable
cells are close to the original ones. Initially the target is
z(1) =XTy=h3343142iT(3)
The element with maximal value of
c(1) =Xz(1) =h675776iT(4)
is selected as the one to be given 5. In this case three elements have the
maximum value of 7. One of them is randomly selected. In this example
this is element number 2 and the result is y(1). The target for the next
selection is updated:
z(2) =XT(y−y(1)) = h3 3 −1−2142iT(5)
Now we look for the maximal value of
c(2) =Xz(2) =h1−30771iT(6)
In this case element number four is selected and the result is y(2).
Table 3. Intermediate data within the rounding procedure. Inner cells rounded in the
first iteration (’) and to be rounded in the second iteration (”) are marked.
col1 col2 col3 col4 col5 Total
row1 6 0 1” 0’ 4 11
row2 1 0’ 3” 5’ 2 11
row3 0 5’ 0’ 0 0’ 5
Total 7 5 4 5 6 27
Table 3 is the rounded table obtained after these steps. Since the (3,3)-
element has been changed from 1 to 0, the col3 total has been changed
from 5 to 4. We need to re-run the process to fix this. In this case, reduced
versions of Xand ytogether with the changed yare:
X="1 0
0 1 #,y="1
3#,y(1) ="5
0#(7)
Here the elements of yare (1,3) and (2,3) and the totals of interest are
row1 and row2. In the first round the sum of ywas perfectly preserved.
This time the elements of ysum to 4 and thus one of the elements should
be 5. To select this element the target is computed differently from (3).
Since the goal is to produce a best possible rounding of Table 1, not of
Table 3, we have to correct XTywith the marginal differences between
tables 1 and 3 when calculating z(1), i.e. with [14-11,9-11]=[3,-2] as follows
z(1) =XTy+h3−2iT=h4 1 iT(8)
We proceed as in (4) and thus the first element is selected. Now all
publishable cells are satisfactory. A total of eight inner cells has been
rounded and the results can be seen in Table 2.
3 Motivation by partial least squares regression
Recall equation (1) where publishable cells are computed from inner cells.
By regressing yonto Xwe can calculate fitted values by
ˆ
y=XXTX†XTy=X†TXTy=X†Tz(9)
Since the columns of Xare collinear we here use the Moore-Penrose
generalized inverse (pseudoinverse) [10]. This solution can also be
obtained by regressing zonto XT. In this case ˆ
yis the estimated
regression parameters. This way estimates of the inner cells can be
found from the publishable cells. In principle this is superfluous, since
the value of yis known, but we will proceed with this idea. In practice
we will consider reduced versions of Xand yas in the section above.
The rounding can be based on ˆ
yand one possibility is to select the
largest elements to have the rounding base value. This solution can be
improved by selecting elements one by one. Each time, the element with
the largest value of ˆ
yis selected. This way it is possible to update z
successively so that already made deviations are considered. Then the
estimation of yfrom zis no longer superfluous. Instead of using a
generalized inverse, another regression method aimed to handle collinear
variables may be used. One such method is partial least squares
regression (PLS) [2, 6]. A meta-parameter in PLS is the number of
components to use. The precursor of PLS uses only a single component.
Also here we will only consider a single component. With a single
component a specific linear combination of the independent variables
will be used as regressor. The coefficients of this linear combination are
obtained as the cross product between the dependent variable and the
matrix of independent variables (possibly scaled). Since we are
regressing zonto XT, the coefficients are obtained as Xz. The final
regression coefficients (ˆ
y) will be proportional to X z. Since the only
purpose is to locate the maximum value, Xz is sufficient.
4 Efficient computation by sparse matrix methodology
When the number of cells (publishable and inner) is high, the number of
elements of Xaccording to (1) will be huge. Most of the elements are,
however, zero. Storing and manipulating such matrices are efficiently
handled by programming libraries that support sparse matrices. To
implement the algorithm, we used the r-package Matrix. Expressions
involving sparse matrices can be written as usual. A main idea of sparse
matrix methodology is that the matrices can be stored in triplet form:
(row index, column index, value). Only the triplets with non-zero values
are stored. In practise the matrix may be stored in an even more
compact form. Using the Matrix package it is possible to get the triplet
form representation explicitly. As described below, this is utilised as a
part of our algorithm.
The algorithm successively updates c(i), where i= 1,2, . . ., and two
steps are seen in (4) and (6). It turns out that this update means
subtracting one column (or row) of M=bXX Twhere bis the
rounding base. If Mis calculated beforehand and stored in triplet form
this update can be done very fast. When the triplets are stored in an
ordered form it is easy to extract the necessary indices and values
needed for a specific update.
5 The algorithm
Below we formulate the algorithm in two parts. In the example in
Section 2 the inner part of the algorithm had to be executed twice. Here
we have added an iteration step at the end. Similarly to how elements
were filled with rounding base values we will remove one of them. Then
we fill in the value again. This process will continue as long as the
cross-product criteria improve.
Calculating Mas described in the section above improves the speed.
But in the case of a large number of inner cells to be rounded, the size
of this matrix is the bottleneck of the algorithm. It is possible to treat
this problem by limiting the number of inner cells to be rounded each
time. The cell to be rounded is then selected randomly. As described
earlier we have another random element of the algorithm. In cases where
c(i)has several elements with the maximal value, one of them is selected
randomly.
The inner part of the algorithm
1. Input to the algorithm is X,zand the number, nb, of rounding base
values to be filled into the rounded version of y. Here zis the required
values that lead to the original totals. The original yis not needed.
Because of rounding performed earlier, zmay be different from XTy.
For the same reason, nbmay be different from the value calculated
from the element sum of y.
2. Calculate M=bX XTand store this matrix in triplet from.
3. Let y(0) be a vector of zeros.
4. Let c(1) =X z.
5. For i= 1,2, . . . , nb:
–Identify the element, ki, of c(i)with maximal value among the
elements not identified earlier.
–Set element number kiof y(i−1) to bto obtain y(i).
–Subtract row number kiof Mfrom c(i)to obtain c(i+1).
6. Let c=c(nb+1). That is, the last value calculated in the loop above.
7. Let e
y=y(nb). That is, the last value calculated in the loop above.
8. Iterate:
–Identify element number kmin of cwith minimal value among the
elements corresponding to nonzero elements of e
y.
–Update cby adding row number kmin of M.
–Identify element number kmax of cwith maximal value among
elements corresponding to zero elements of e
y.
–If the value of element kmin of cis at least as large as the value of
element kmax, the iteration stops.
–Update cby subtracting row number kmax of M.
–Set element kmin of e
yto zero and element kmax of e
yto b.
9. The result is e
ywhich is the rounded version of y.
The full algorithm
1. Select the publishable totals to be considered by the algorithm (cross-
classifications).
2. Let yconsist of all inner cells and construct Xand compute all the
selected publishable totals by z=XTy.
3. Let e
y=yand e
z=z.
4. Iterate:
–Identify the sensistive elements of e
zin the range 1 to b−1.
–If no elements are found, the iteration stops.
–Identify the elements of e
ycausing these values.
–If the number of elements exceeds a chosen limit, select a sample
of limit size of the identified elements randomly.
–Let e
y∗be the corresponding reduced version of e
y.
–Obtain a corresponding reduced version of X(select rows).
Thereafter remove columns of this matrix where all the elements
are identical. The final reduced matrix is X∗.
–Find the correction zeas the elements of z−e
zcorresponding to
X∗.
–Compute e
z∗=X∗Te
y∗+ze.
–Find syas the sum of the elements of e
y∗.
–Find the correction seas the sum of the elements of y−e
y.
–Find n∗
bas (sy+se)/b rounded to the nearest integer.
–Run the inner part of the algorithm with X∗,e
z∗and n∗
bas input.
The output replaces elements of e
y.
–Update e
zso that e
z=XTe
y
5. The result is e
ywhich is the rounded version of all inner cells.
Table 4. Frequencies of cell frequencies omitting zeroes in original data
0 1 2 3 4-10 11+ All
Inner
original 0 103823 18037 6574 8632 1743 138809
rounded 57720 20455 11612 38647 8632 1743 138809
simple-rounded 75228 0 0 53206 8632 1743 138809
Publishable
original 0 193759 75544 41878 97164 99754 508099
rounded 174600 0 0 145174 86271 102054 508099
simple-rounded 200297 0 0 135969 71644 100189 508099
Table 5. Frequencies of absolute differences from original values in rounded data.
Original cell value
inner cells
|difference|1 2 1 2 3 4-10 11+
0 20455 11612 0 0 21855 14015 15859
1 56323 5028 130474 38754 2113 34815 23760
2 27045 1397 63285 32040 1181 24142 20112
3 0 0 0 0 15966 10167 14798
4-6 0 0 0 4750 763 13409 20751
7-10 0 0 0 0 0 613 4253
11-100 0 0 0 0 0 3 221
101+ 0 0 0 0 0 0 0
Table 6. Frequencies of absolute differences from original values in simple-rounded
data.
Original cell value
inner cells
|difference|1 2 1 2 3 4-10 11+
0 0 0 0 0 20483 8432 9777
1 75228 18037 140873 33740 0 28948 13275
2 28595 0 52886 36614 0 22901 12318
3 0 0 0 0 20675 10854 10731
4-6 0 0 0 5190 720 23075 23373
7-10 0 0 0 0 0 2898 15014
11-100 0 0 0 0 0 56 14874
101+ 0 0 0 0 0 0 392
6 Example using public microdata
To demonstrate the algorithm we use public microdata for the German
labour force survey for year 2013 published by Eurostat[1]. We consider
the six-way frequency table obtained by crossing the variables AGE(7),
HOURREAS(19), HWWISH(94), ISCO1D(12), REFWEEK(52) and
SIZEFIRM(6). Here the number of categories is in parenthesis and
missing is treated as a category. The total number of (inner) cells is
46807488, but only 138809 of them have a nonzero count. We assume
that all possible one-way, two-way, three-way and four-way
cross-classifications are to be published. This results in a total of 508099
publishable cells with a nonzero count. Using 3 as the rounding base
269303 of them need to be changed. Using the above algorithm this is
achived by changing 89793 inner cells. An overview of original and
rounded cell frequencies are given in Table 4. The computing time on an
ordinary laptop was 90 seconds. Within the algorithm, the number of
inner cells to be rounded in each loop was set to 1000 which is currently
the default value in the r-function. A simple method is run for
comparison. Inspired by the RAPID solution in τ-ARGUS all inner cells
with the value 2 are set to 3. To ensure approximately correct overall
total a random sample of the inner cells with the value 1 is also set to 3.
In the data rounded by the algorithm, the maximum absolute difference
between original and rounded values is 16. The total number of absolute
differences greater than ten is 224. The corresponding values for the
simple method are 3740 and 15322. More information about the
absolute differences is available from Table 5 and Table 6. The results of
the algorithm seem satisfactory and are superior to the simple method.
7 The European Census Hub 2011
The method was tested on the Norwegian data for the European Census
Hub 2011 [11, 12]. Within the Linux environment at Statistics
Norwegian, the 79 unique principal marginal distributions of the first 21
person unit hypercubes (1-4, 6-22) could be handled simultaneously.
Thus, the variables considered are AGE L(6), AGE M(21), CAS H(7),
CAS L(3), COC L(5), COC M(10), EDU(9), FST H(6), GEO L(8),
HST H(8), IND H(23), LMS(4), LOC(12), LPW L(10), OCC(12),
POB L(4), POB M(9), ROY(5), SEX(2) and SIE(4). Here the number
of categories is given in parentheses and L, M and H means different
levels of detail. The total numbers of unique one-way, two-way,
Table 7. Frequencies of cell frequencies omitting zeroes in original census hub data.
0 1 2 3 4-10 11-100 101-
1000 1001-
10000 10001+ All
Inner
original 0 1008299 166043 67991 108129 44261 3843 280 6 1398852
rounded 26301 969102 165535 81395 108129 44261 3843 280 6 1398852
Publishable
original 0 92415 50820 35329 125877 282110 261417 156629 64422 1069019
rounded 92183 0 0 99266 111928 283152 261431 156636 64423 1069019
Table 8. Frequencies of absolute differences from original values in rounded census
hub data.
Original cell value
inner cells
|difference|1 2 1 2 3 4-10 11-100 101-
1000 1001-
10000 10001+
0 969102 165535 0 0 19557 29517 59451 38226 16121 5501
1 26050 257 59661 26315 1748 41180 77125 62337 29109 10456
2 13147 251 32754 20362 1291 28470 58234 50955 26275 9543
3 0 0 0 0 12116 11213 34191 36160 21515 8154
4-6 0 0 0 4143 617 14611 43433 54093 40709 16917
7-10 0 0 0 0 0 882 8855 16637 17916 9775
11-20 0 0 0 0 0 4 819 2988 4892 3911
21-56 0 0 0 0 0 0 2 21 92 165
Table 9. Some publishable census hub cells with big absolute differences (rounded
= original + difference). The four biggest differences and the four biggest differences
among one-way cells are shown. In addition, the cells corresponding to the frequencies
2 and 4 in Table 8 are show.
cell original difference
AGE M=Y30-34, ROY=SAME 249809 56
EDU=ED5, SIE=SAL 843180 44
AGE M=Y10-14, ROY=CHG OUT3 3124 41
AGE M=Y10-14, CAS L=INAC, ROY=CHG OUT3 3124 41
SIE=SAL 2455153 -25
IND H=P 205036 20
IND H=K 49950 -16
AGE M=Y35-39 352008 -14
GEO L=NOZZ, AGE M=Y25-29, ROY=SAME 84 21
GEO L=NOZZ, AGE M=Y25-29, LOC=LT200, ROY=SAME 84 21
SEX=F, FST H=UNK, EDU=ED5, CAS L=UNE 10 11
GEO L=NO02, EDU=NONE, OCC=OC5, IND H=I 7 11
SEX=M, AGE M=Y25-29, SIE=SELF S, IND H=S 10 11
AGE M=Y85-89, EDU=ED3, OCC=OC5, LPW L=NO05 10 11
three-way, four-way, five-way and six-way cross-classifications to be
published are, respectively, 20, 121, 298, 325, 152 and 24. The total
number publishable cells with a nonzero count is over one million. The
rounding base is 3 and an overview of original and rounded cell
frequencies is given in Table 7. This time the computing time was 144
minutes. Most of the time was spent to create the 1398852 ×1069019
dummy variable matrix (X). Since enough memory was available, the
number of inner cells to be rounded in each loop was set higher than the
total number of cells to be rounded. Information about the absolute
differences is presented in Table 8. The maximum absolute difference
was 56 and this and other cells with big differences are shown in
Table 9. In Table 8 we can see two elements with frequencies 2 (21-56,
11-100) and 4 (11-20, 4-10). These can be considered the most
problematic, and the corresponding cells are also included in Table 9.
Considering that these are higher order crossings and that the total
number of cells are over one million cells, these results are satisfactory.
8 Concluding remarks
From our experience, the algorithm works very well. Performance
studies and comparisons with other methods using various measures [9]
should be carried out in the future. In particular, a comparison with
optimal solutions in applications where such a methodology is possible
would be interesting. The method is simple and fast and there are many
possibilities for modifications. In particular, one could introduce weights
to ensure very good results for selected cells. All in all, there are many
opportunities for future work. Nevertheless, the algorithm is practically
applicable without modification and the function implemented in R [5]
is useful. The rounding of the European Census Hub data is convincing.
References
1. Eurostat: EU Labour Force Survey Database User Guide, version April 2018. Tech.
rep., European Commission, Eurostat, Directorate F, Unit F-3 (2018)
2. Frank, I.E., Friedman, J.H.: A Statistical View of Some Chemometrics Regression
Tools. Technometrics 35(2), 109–135 (1993)
3. Heldal, J.: The European Census Hub 2011 Hypercubes - Norwegian SDC
Experiences. In: Work Session on Statistical Data Confidentiality (2017), Skopje,
The former Yugoslav Republic of Macedonia, September 20-22 , 2017.
4. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S.,
Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. John Wiley & Sons, Ltd
(2012)
5. Langsrud, Ø., Heldal, J.: SmallCountRounding: Small Count Rounding of
Tabular Data (2018), https://github.com/statisticsnorway/SmallCountRounding,
r package
6. Martens, H., Næs, T.: Multivariate Calibration. John Wiley & Sons, New York
(1989)
7. Salazar-Gonzalez, J.J.: Statistical confidentiality: Optimization techniques to
protect tables. Comput. Oper. Res. 35(5), 1638–1651 (2008)
8. Salazar-Gonzlez, J.J., Bycroft, C., Staggemeier, A.T.: Controlled rounding
implementation. In: UNECE Worksession on SDC, Geneva (2005)
9. Shlomo, N.: Statistical disclosure control methods for census frequency tables. Int.
Stat. Rev. 75(2), 199–217 (2007)
10. Strang, G.: Linear Algebra and Its Applications, 3rd edition. Harcourt Brace
Jovanovich, San Diego (1988)
11. The European Commission: Commission Regulation (EU) No 1201/2009. Official
Journal of the European Union 52, 29–68 (2009), L 329
12. The European Commission: Commission Regulation (EU) No 519/2010. Official
Journal of the European Union 53, 1–13 (2010), L 151
13. de Wolf, P.P., Hundepool, A., Giessing, S., Salazar, J.J., Castro, J.: tau-ARGUS
user’s manual, version 4.1. Tech. rep., Statistics Netherlands (2014)








