Parliamentary documents from Spain
ETSIIT, University of Granada
Periodista Daniel Saucedo Aranda s/n 18071 Granada, Spain
ISLA, University of Amsterdam
Kruislaan 403 1098 SJ Amsterdam, The Netherlands
We created a corpus consisting of all parliamentary docu-
ments from Spain since its ﬁrst legislative period in 1977.
The documents were collected from the web page of the
Spanish Congress http://www.congreso.es and converted
into a uniform XML format with extensive metadata in the
Dublin Core standard. The collection contains over 50.000
documents with almost 1 million pages having over half a
billion tokens. We also collected a complete list of names
and biographical data of all members of parliaments during
this period. All this data is available for download and will
be updated daily. This abstract describes the parliamen-
tary data, the data collection and transformation process
and presents some use cases for this corpus.The corpus can
be used for corpus-linguistic and political science research,
and is suitable for performing scalability tests for XML in-
Categories and Subject Descriptors
H.4.m [Information Systems]: Miscellaneous
Spanish, Text corpus, Politics, XML
1. COVERAGE AND SIZE OF THE COR-
Our aim was to create a corpus containing all digitally
available parliamentary documents from the Spanish Congress
of Deputies for all the legislative periods1. A distinction is
made between digitally produced and scanned documents.
1every legislative period is composed of four years of political
activity, except for the ”ﬁrst” legislative period (constituent
period) that lasts two years
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
Copyright 2010 ACM ...$10.00.
The present version of the corpus contains two kinds of doc-
uments: Session diaries (Verbatim proceedings) and Oﬃcial
bulletins (Parliamentary documents). Table 1 shows the
periods for which digital and scanned data is available on
the web for each kind of document.
Table 2 displays information about the size of the corpus.
Figure 1 groups the counts per legislative period. We list the
following information: the size of the ﬁles in text format in
Megabytes; the number of documents; the number of pages
in the original documents; the number of tokens. We group
these numbers for the following two kinds of documents:
The ﬁrst group are the verbatim notes or session diaries of
the Congress of Deputies. These can be plenary sessions, ses-
sions of (smaller) committees, united sessions, Even though
the texts are edited and transcribed to be read, they are ac-
counts of spoken language. The second group, called parlia-
mentary documents, are composed of the descriptions of law
proposals, law projects, treaties, international agreements,
2. AVAILABILITY AND IPR
The corpus is available for download at
Congress of Deputies allows to work with all these public
PDF documents from the webpage but it is forbidden to
use them with commercial purposes so we are not aware of
copyright restrictions on the material. If you use the corpus,
please send an email to email@example.com.
3. TECHNICAL DESCRIPTION
3.1 Description of the data format
Every document in the corpus is an UTF-8 encoded XML
ﬁle which is valid with respect to the Relax NG schema
developed for the parliamentary information in Dutch .
Here we brieﬂy describe the structure of the documents. The
root of each document has three children:
meta this element contains meta-information of the docu-
ment described using the 15 elements from the Dublin
Core Metadata Element Set Version 1.12;
header this element contains textual data extracted from
the source-text which may be used for displaying pur-
Kind of document Scanned Digital
Plenary Sessions From 1977-07-13 From 1996-3-27
(Constituent Leg. Per. - V Leg. Per.) (VI Leg. Per. - IX Leg. Per.)
Oﬃcial Bulletins From 1977-07-12 From 1996-4-3
(Constituent Leg. Per. - V Leg. Per.) (VI Leg. Per. - IX Leg. Per.)
Table 1: Availability of parliamentary data in Spain.
Figure 1: Statistics of the Spanish corpus from the diﬀerent legislative periods
Subcorpus Mb text # Documents # Pages # Tokens
Verbatim proceedings 17387 8020 295947 260351458
Parliamentary documents 20516 42847 631995 311722807
Total 37903 50867 927942 572074265
Table 2: Number of documents, pages and tokens for parliamentary documents, verbatim proceedings and
the complete corpus.
text this element contains the complete text of the source
document. Each text element has one or more page
elements (corresponding to physical pages of the docu-
ment), which in turn are divided in one or more p(for
Within the text element there is a strict separation between
content and metadata. All metadata is stored in attributes.
All text is contained in the pelements. The attributes of
the page and pelements give unique names, and contain
provenance information. Both have an obligatory docno at-
tribute whose value is unique in the corpus and together
with a suitable namespace a URN . This conforms to the
recommendations of publishing eGovernment material as set
out by the eGov working group of the W3C .
3.2 Description of the data collection and pro-
The web page of the Congress of Spain http://www.congreso.
es contains the information we want to collect in one corpus.
We have developed diﬀerent scripts in Perl, which are auto-
matic processes in charge of downloading all the metadata
and the PDF documents. We describe here the main steps.
Examining the search engine on the parliaments web page,
we have observed that the URL addresses of the result pages
contain all the parameters introduced in the query so if we
use this URL address with all the diﬀerent parameters in
which we are interested in, we can download all the lists
of documents for every legislative period in an automatic
way. Then, it is turn to begin to work with these lists in
HTML format getting the URL link to download the PDF
documents and some information interesting for our corpus.
This information consists of a list of attributes:
•Name of the ﬁle.
•Date of the publication.
•Kind of document.
•Number of document.
•Number of pages.
The number of pages does not appear in the web page but
we have used pdfinfo3which retrieves information about
PDF documents like the number of pages.
Having collected the PDF documents and the metadata,
we transform them into the XML format using the transfor-
mation described in .
3.3 Deputies details
Apart from the documents of the Congress of Deputies, we
want to get personal details of the deputies. This informa-
tion is useful for named entity reconciliation  in the ver-
batim proceedings. Biographies of all deputies are available
from the web page of the Spanish parliament. We down-
loaded the biographies using a similar method as described
We collected the following information for each deputy:
Name; Surnames; Political party; Legislative Period which
it belongs; Start date in the Congress; End date in the
Congress (if it has retired); Link to the photograph; Link
3This is part of the Xpdf software, see http://www.
to the Congress web page. This data is available in in csv
An initial experiment with using this database for recog-
nizing and disambiguating speakers in verbatim proceedings
showed that the extra information is useful and needed, in
particular for the scanned documents which contain OCR-
4. A LOOK AT THE DATA
As we can see in Figure 2, there is a large diﬀerence be-
tween the number of oﬃcial bulletins of the ﬁrst (I) Legisla-
tive Period and the number of oﬃcial bulletins of the other
legislative periods. This is due to a change in the ﬁling sys-
tem. In later legislative periods several items which were
published before as separate documents are grouped.
Figure 3 shows word counts of a number of words per
legislative period. We have looked at the following words:
ETA This is a Spanish terrorist organization, which has
committed a lot of terrorist attacks. The worst terror-
ist attack in Spain was in Madrid in the VIII Leg. Per.
and ETA was an important suspect. This is reﬂected
in the sharp rise in the number of appearances of ETA
during this legislative period.
Crisis Nowadays, it is a very common term due to the
Spanish situation: Recession, Unemployment, Mini-
mum salaries, etc. It is also possible to distinguish
diﬀerent crisis periods in the graphic.
Palomares This is the place where two planes crashed in
1966 and an atomic bomb fell into the sea without ex-
ploding. This problem was an important matter dur-
ing several years because people were afraid of taking
a bath in the sea becoming a frequent topic in the
Congress in the I Legislative Period.
Studying the graphics, we have concluded there are fewer
appearances of the diﬀerent words in the ﬁrst legislative pe-
riods. Most probably this is because all the documents from
the ﬁrst legislative periods are scanned and we ﬁnd several
mistakes in the transcriptions of the words from the scanned
documents to the text documents.
5. CONCLUSIONS AND FUTURE WORK
The objective of our work is to develop a corpus of all of-
ﬁcial documents of the Parliament in Spain since its begin-
nings in 1977. The session diaries and the oﬃcial bulletins
from Congress of Deputies included in the present version
of the corpus are a good starting point.
This corpus may be used in a XML retrieval system  sim-
ilar to http://theyworkforyou.com.
The Congress of Deputies has a multimedia collection with
all the videos of session diaries segmented at the speaker
level. When the verbatim proceedings are also segmented at
the speaker level these can be linked and a powerful video
search engine results.
For linguistic research into the Spanish language it would
be good to extend the corpus with parliamentary data from
other Spanish speaking countries. Argentina for instance has
clearly structured information at http://www.diputados.
Const. Leg. Leg. I Leg. II Leg. III Leg. IV Leg. V Leg. VI Leg. VII Leg. VIII Leg. IX
Number of documents
Number of official bulletins
Number of session diaries
Figure 2: Number of documents of the Spanish corpus from the diﬀerent legislative periods
(77-79) Leg. I
(79-82) Leg. II
(82-86) Leg. III
(86-89) Leg. IV
(89-93) Leg. V
(93-96) Leg. VI
(96-00) Leg. VII
(00-04) Leg. VIII
(04-08) Leg. IX
Figure 3: Number of appearances of several special words
Maarten Marx acknowledges the ﬁnancial support of the Fu-
ture and Emerging Technologies (FET) programme within
the Seventh Framework Programme for Research of the Eu-
ropean Commission, under the FET-Open grant agreement
FOX, number FP7-ICT-233599.
 D. Bennet and A. Harvey. Publishing open government
data (W3C Working Draft 8 September 2009).
 X. Dong, A. Halevy, and J. Madhavan. Reference
reconciliation in complex information spaces. In Proc.
SIGMOD, pages 85–96, 2005.
 M. Marx and A. Schuth. DutchParl A Corpus of
Parliamentary Documents in Dutch.
 B. Sigurbj¨
ornsson. Focused information access using
XML element retrieval. PhD thesis, University of
 W3C/IETF URI Planning Interest Group. URIs,
URLs, and URNs: Clariﬁcations and Recommendations
1.0. W3C Note 21 September 2001, 2001.