The text used for the digitization is the punctuated edition of the Zizhi tongjian 資治通鑑 by 司馬光 Sima Guang, edited by the 'Editorial committee for punctuating the Zizhi tongjian', published by Zhonghua shuju in 1956. The text was chosen because the aim of this project was to capture structural and contextual information as far as possible, so a mondern, punctuated edition seemed to be preferable. The markup applied to the text can be divided into three main categories: structural, semantic and documentary.
Structural markup makes the structure of the document explicit. In the case of the Zizhi tongjian, the text is divided firstly into 16 dynasties or kingdoms. Only one of these, the Tang Records 唐記 in 81 juan, which is by far the most voluminous section has been digitized in the current project. Within this section, there are further subdivisions by ruler, then by era and finally by year. The editors choose to treat a calendar year as the basic unit, although obviously some changes of emperor or era do not fall at the beginning of the year, the structure is therefore not strictly nesting. Within a year, the narrative is divided into individual paragraphs. These paragraphs can already be infered from earlier texts, where whitespace between episodes sets them apart. In general, the editors of the Zhonghua shuju edition followed these earlier divisions, which might in fact go back to Sima Guang himself and attached numbers to them, which run sequentially within a year, however there are cases where such a paragraph is further divided, in which case the paragraphs are not numbered. The paragraphs form the most basic unit of the narration, which can be considered as reporting one event.
To record this structural division, the generic element <div> has been used down from the Tang Records, to the eras and the years within one era. While the Zhonghua shuju edition does not make further divisions within one year, instead applying the era name of the longer part of the year to the whole year, but noting the fact that there is some change in an interlinear not, this electronic edition does mark the change where it occurs, some years are thus divided into two or more divivisions. The change of the ruler, which also brought a change in the era, however has not been marked with structural elements, since sometimes the narrative is proceding in a way that allows no clean division. The following example shows the beginning of the era Wude 武徳 which is slightly unusual, since it is the only one that starts at the beginning of the year rather than at the point where the era changes.
<div level="1" xml:id="zztj-n01" n="武徳"> <head n="武徳">武徳</head> <div level="2" type="annual" n="0618" xml:id="zztj-n01-y01"> <head n="武徳-1"> <date n="武徳-1">武徳元年<note place="inline">(戊寅、六一八)</note> <note place="inline"> <date>是年五月</date>受<dyn key="ch174">隋</dyn>禪，始改元。</note> </date> </head> <div level="3" n="1"> <p xml:id="zztj7-6030-p01">As can be seen, some additional information has been placed in the attributes to make it easier to process the texts; 'xml:id' is assigining a unique identifier to the specific element, which can be used to link to this element.
Within a paragraph (marked <p>), sentences (<s>) and phrases within sentences (<seg>) are marked, all with a xml:id attribute to uniquely identify them. Interlinear notes, have been moved to the end of a paragraph and marked with <note>, the original point of attachment is given with the <anchor> element. The content of the note is further analyzed; citiations with attribution are marked with a <cit> element containing the attribution (<bibl>, which is not necessarily a proper textual reference) and the attributed quotation <q>, as in the following example:
<div level="3" n="3"> <p xml:id="zztj7-6031-p03"> <s xml:id="zztj7-6031-p03-s1"> <seg xml:id="zztj7-6031-p03-s1-seg1"> <date type="day" value="0627-02-06">己亥</date>，</seg> <seg xml:id="zztj7-6031-p03-s1-seg2">制：</seg> <seg xml:id="zztj7-6031-p03-s1-seg3">「自今中書、門下及三品以上入閤議事，</seg> <seg xml:id="zztj7-6031-p03-s1-seg4">皆命諫官隨之，</seg> <seg xml:id="zztj7-6031-p03-s1-seg5">有失輒諫。」 <anchor type="note" xml:id="ref-zztj7-6031-p03-n.1"/></seg> </s> <note place="inline" xml:id="zztj7-6031-p03-n.1" target="ref-zztj7-6031-p03-n.1"> <cit xml:id="zztj7-6031-p03-c.1.1"> <bibl> <rm key="r06728">程大昌</rm>曰：</bibl> <q> <dyn key="ch100">唐</dyn><dm key="dm07294">西<g type="org" rend="內" ref="#u20839">内</g></dm> <dm key="dm01809">太極殿</dm>，即朔望受朝之所，蓋正殿也。(...) </cit> </note> </p> </div>
The Zhonghua shuju edition notes important differences or emendations in the received text, as seen in the editions used. These differences have not been verified, but they are nevertheless documented. Like the notes, they are moved to the end of the paragraph so as to not disturb the text flow, and expressed in machine readable form as follows: The whole expressions is wrapped into a <app> element, which contains as its lemma (<lem>) the text as used by the Zhonghua shuju editors (but left out if there is no corresponding text) and as its reading (<rdg>) the alternative text as indicated in the modern text, its 'wit' attribute gives the witnesses that bear this variation. The 'resp' attribute on <app> has been used to record a statement of responsibility, if such one was given.
<p><s>(...)<seg xml:id="zztj7-6066-p12a-s1-seg5"> <dm key="dm08619">靈州</dm>大都督<rm key="r03311">薛萬徹</rm>爲 <dm key="dm03601">暢武道</dm>行軍總管， </seg> <seg xml:id="zztj7-6066-p12a-s1-seg6">衆合十餘萬，</seg> <seg xml:id="zztj7-6066-p12a-s1-seg7">皆受 <rm key="r01299">李<appSpan xml:id="beg72id303625">勣</appSpan> </rm>節度，</seg> <seg xml:id="zztj7-6066-p12a-s1-seg8">分道出<g type="org" rend="擊" ref="#u25802">撃</g> <ym>突厥</ym>。</seg> </s>(...) <app from="beg72id303625" resp="章"> <lem>勣</lem> <rdg wit="十二行本 乙十一行本">靖</rdg> </app></p>
The characters used in the text do not always reflect the specific shape that is currently used for such a character in Japan. Where several similar characters exist in Unicode, the one most common in Japan (except for modern simplified characters) has been used. For example, the character de 德 is frequently seen as 徳, similarily 擊 is often written as 撃. While the choice of which one to use is somehow aribitrary, the one that is more easily entered into computer systems in Japan has been considered the standard form and character usage in the text has been normalized to use this standard form. However, the character that comes closer to the one used in the printed edition is recorded and can be used for display if so desired. The <g> element is used for this purpose as seen in the example above.