Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
user:domenge:python:xml [2017/12/26 12:12]
domenge [Comolacion dels POS per un sens donat]
user:domenge:python:xml [2017/12/26 18:17] (Version actuelle)
domenge [Comolacion dels POS per un sens donat]
Ligne 1: Ligne 1:
 +====== XML Extraccion del POS dins lo fichièr TEI del Congrès ======
 +La tòca es de traire l'​informacion necessari dempuèi un fichièr TEI:XML gigantàs (18,4 Mo) que conten tota l'​informacion del basicòt Francés/​Lengadocian-Gascon. Lo diccionari presenta l'​interès d'aver sos tèrmes amb d'​etiquetas POS.\\
 +Lo basicòt compta 13610 tèrmes.
 +<note tip>​L'​attribut "​XML:​lang"​ foncionava pas dins l'​analisi sintaxica amb la librarià lxml. Doncas l'​attribut foguèt remplaçada per "​language"​ dins tot lo fichièr.</​note>​
 +===== Extrach del fichièr basicunif_complet_language.xml=====
 +
 +<code xml>
 +<?xml version="​1.0"​ encoding="​UTF-8"​ standalone="​no"​ ?>
 +<TEI>
 + <​teiHeader>​
 +...
 +</​teiHeader>​
 + <​text>​
 +  <​body>​
 +...
 +      <​entry n="​12">​
 +        <​form type="​main"​ language="​fr">​
 +          <​orth>​abaisser</​orth>​
 +          <​gramGrp>​
 +            <​pos norm="​verb">​v.</​pos>​
 +            <​gram type="​eaglescat"​ norm="​V000050000000"></​gram>​
 +          </​gramGrp>​
 +        </​form>​
 +        <​sense n="​I-A-1-a">​
 +          <​cit type="​translation"​ language="​oc-gascon">​
 +            <​form type="​main"​ language="​oc-gascon">​
 +              <​orth>​abaishar</​orth>​
 +              <​gramGrp>​
 +                <​gram type="​eaglescat"​ norm="​V000050000000"></​gram>​
 +              </​gramGrp>​
 +            </​form>​
 +          </​cit>​
 +          <​cit type="​translation"​ language="​oc-gascon">​
 +            <​form type="​main"​ language="​oc-gascon">​
 +              <​orth>​baishar</​orth>​
 +              <​gramGrp>​
 +                <​gram type="​eaglescat"​ norm="​V000050000000"></​gram>​
 +              </​gramGrp>​
 +            </​form>​
 +          </​cit>​
 +          <​cit type="​translation"​ language="​oc-lengadoc">​
 +            <​form type="​main"​ language="​oc-lengadoc">​
 +              <​orth>​abaissar</​orth>​
 +              <​gramGrp>​
 +                <​gram type="​eaglescat"​ norm="​V000050000000"></​gram>​
 +              </​gramGrp>​
 +            </​form>​
 +          </​cit>​
 +          <​cit type="​translation"​ language="​oc-lengadoc">​
 +            <​form type="​main"​ language="​oc-lengadoc">​
 +              <​orth>​baissar</​orth>​
 +              <​gramGrp>​
 +                <​gram type="​eaglescat"​ norm="​V000050000000"></​gram>​
 +              </​gramGrp>​
 +            </​form>​
 +          </​cit>​
 +        </​sense>​
 +        <​sense n="​I-A-2-a">​
 +          <​usg type="​hint">​diminuer</​usg>​
 +          <​cit type="​translation"​ language="​oc">​
 +            <​form type="​main"​ language="​oc">​
 +              <​orth>​amermar</​orth>​
 +              <​gramGrp>​
 +                <​gram type="​eaglescat"​ norm="​V000050000000"></​gram>​
 +              </​gramGrp>​
 +            </​form>​
 +          </​cit>​
 +        </​sense>​
 +      </​entry>​
 +...
 +  </​body>​
 + </​text>​
 +</​TEI>​
 +</​code>​
 +===== Programa basic.py =====
 +D'unes còps lo còde TEI es pas forçadament plan format quand i a una error, lo programa se deu pas arrestar, l'​error es marcada dins la resulta per èsser escartat.
 +<code python>
 +#​!/​usr/​bin/​env python3
 +# coding: utf8
 +import json
 +from lxml import etree
 +
 +tree = etree.parse("​basicunif_complet_language.xml"​)
 +f = open("​basicunif_complet_extract.txt",'​w'​)
 +
 +f.write("​id,​fr,​posfr,​sens,​dialect,​oc,​posoc\n"​.encode('​utf-8'​))
 +
 +for entry in tree.xpath("/​TEI/​text/​body/​entry"​):​
 + entryId = entry.get('​n'​)
 + orth = entry.find("​form/​orth"​).text
 + gramNorm = entry.find("​form/​gramGrp/​gram"​).get('​norm'​)
 + for sense in entry.xpath("​sense"​):​
 + for cit in sense.xpath("​cit"​):​
 + try:
 + langcit = cit.find("​form"​).get('​dialect'​)
 + except AttributeError:​
 + langcit = "​langcit error"
 + # orthcit
 + orthcit = cit.find("​form/​orth"​)
 + try:
 + orthcitText = orthcit.text
 + except AttributeError:​
 + orthcitText = "​orthcit error"
 + #​gramNormcit
 + try:
 + gramNormCit = cit.find("​form/​gramGrp/​gram"​).get('​norm'​)
 + except AttributeError:​
 + gramNormCit = "​gramNormCit error"
 + s = "​\"​%s\",​ \"​%s\",​\"​%s\",​\"​%s\",​\"​%s\",​\"​%s\",​\"​%s\"​\n"​ % (entryId, orth, gramNorm, sense.get('​n'​),​ langcit, orthcitText,​ gramNormCit)
 + #​print(s)
 + f.write(s.encode('​utf-8'​))
 +f.close()
 +</​code>​
 +===== Resulta =====
 +La sortida es en format csv per èsser integrat dins una basa de donadas SQL e NoSQL.
 +<​code>​
 +id,​fr,​posfr,​sens,​dialect,​oc,​posoc
 +"​1",​ "​à","​AP1","​I-A-1-a","​oc","​a","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-2-a","​oc","​per","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-2-a","​oc-gascon","​entà","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-3-a","​oc","​a","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-4-a","​oc","​de","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-5-a","​oc","​de","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-6-a","​oc-gascon","​entà","​AP1"​
 +"​1",​ "​à","​AP1","​I-A-6-a","​oc-lengadoc","​a","​AP1"​
 +"​2",​ "à cloche-pied","​AV0000","​I-A-1-a","​oc-gascon","​a sautapè","​AV0000"​
 +"​2",​ "à cloche-pied","​AV0000","​I-A-1-a","​oc-lengadoc","​a pè-ranquet","​AV0000"​
 +"​3",​ "à contrecoeur","​AV0000","​I-A-1-a","​oc","​a contracòr","​AV0000"​
 +"​3",​ "à contrecoeur","​AV0000","​I-A-1-a","​oc-gascon","​d’arrèrcòr","​AV0000"​
 +"​3",​ "à contrecoeur","​AV0000","​I-A-1-a","​oc-lengadoc","​de rèirecòr","​AV0000"​
 +"​4",​ "a fortiori","​AV0000","​I-A-1-a","​oc","​a fortiori","​AV0000"​
 +"​5",​ "à jeun","​AV0000","​I-A-1-a","​oc-gascon","​dejun","​AJ0110"​
 +"​5",​ "à jeun","​AV0000","​I-A-1-a","​oc-lengadoc","​dejun","​AJ0110"​
 +"​6",​ "à peu près","​AP1","​I-A-1-a","​oc-gascon","​haut o baish","​AP1"​
 +"​6",​ "à peu près","​AP1","​I-A-1-a","​oc-gascon","​mei o mensh","​AP1"​
 +"​6",​ "à peu près","​AP1","​I-A-1-a","​oc-lengadoc","​a pauc près","​AP1"​
 +"​6",​ "à peu près","​AP1","​I-A-1-a","​oc-lengadoc","​pauc se'n manca","​AP1"​
 +"​7",​ "à rebours","​AV0000","​I-A-1-a","​oc-gascon","​au reboish","​AV0000"​
 +"​7",​ "à rebours","​AV0000","​I-A-1-a","​oc-lengadoc","​a revèrs","​AV0000"​
 +"​8",​ "à reculons","​AV0000","​I-A-1-a","​oc-gascon","​d'​arreculas","​AV0000"​
 +"​8",​ "à reculons","​AV0000","​I-A-1-a","​oc-gascon","​de reculas","​AV0000"​
 +"​8",​ "à reculons","​AV0000","​I-A-1-a","​oc-lengadoc","​de reculons","​AV0000"​
 +"​9",​ "à tâtons","​AV0000","​I-A-1-a","​oc-gascon","​a paupas","​AV0000"​
 +"​9",​ "à tâtons","​AV0000","​I-A-1-a","​oc-lengadoc","​a palpas","​AV0000"​
 +"​10",​ "​à-coup","​N1110","​I-A-1-a","​oc-gascon","​sacat","​N1110"​
 +"​10",​ "​à-coup","​N1110","​I-A-1-a","​oc-lengadoc","​bassacada","​N1210"​
 +"​11",​ "​à-peu-près","​N1110","​I-A-1-a","​oc","​a-pauc-près","​N1110"​
 +"​11",​ "​à-peu-près","​N1110","​I-A-2-a","​oc","​aproximacion","​N1110"​
 +"​12",​ "​abaisser","​V000050000000","​I-A-1-a","​oc-gascon","​abaishar","​V000050000000"​
 +"​12",​ "​abaisser","​V000050000000","​I-A-1-a","​oc-gascon","​baishar","​V000050000000"​
 +"​12",​ "​abaisser","​V000050000000","​I-A-1-a","​oc-lengadoc","​abaissar","​V000050000000"​
 +"​12",​ "​abaisser","​V000050000000","​I-A-1-a","​oc-lengadoc","​baissar","​V000050000000"​
 +"​12",​ "​abaisser","​V000050000000","​I-A-2-a","​oc","​amermar","​V000050000000"​
 +...
 +</​code>​
 +===== Tèrmes en error =====
 +|3034 | de | AT3000|I-A-1-a|langcit error | orthcit error|gramNormCit error|
 +|3720 | du |AT3100 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
 +|13591 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
 +|13593 | tu |PD211011500022 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
 +|13595 |il |PD311011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
 +|13596 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
 +===== CSV botat dins una taula de MySQL =====
 +lo nom de la basa de donadas '​cplo_basic_unificat'​.
 +===== Comolacion dels POS per un sens donat=====
 +<code sql>
 +SELECT idterm, fr, posfr, ​
 +group_concat( sens ) AS sens, 
 +group_concat( dialect ) AS dialectes, ​
 +group_concat( oc ) AS occitans, ​
 +group_concat( posoc ) AS posoccitans
 +FROM `cplo_basic_unificat`
 +WHERE dialect != '​langcit error'
 +GROUP BY idterm, sens
 +</​code>​
 +===== Extracion en CSV =====