Différences
Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentes Révision précédente | |||
user:domenge:python:xml [2017/12/26 12:12] domenge [Comolacion dels POS per un sens donat] |
user:domenge:python:xml [2017/12/26 18:17] (Version actuelle) domenge [Comolacion dels POS per un sens donat] |
||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
+ | ====== XML Extraccion del POS dins lo fichièr TEI del Congrès ====== | ||
+ | La tòca es de traire l'informacion necessari dempuèi un fichièr TEI:XML gigantàs (18,4 Mo) que conten tota l'informacion del basicòt Francés/Lengadocian-Gascon. Lo diccionari presenta l'interès d'aver sos tèrmes amb d'etiquetas POS.\\ | ||
+ | Lo basicòt compta 13610 tèrmes. | ||
+ | <note tip>L'attribut "XML:lang" foncionava pas dins l'analisi sintaxica amb la librarià lxml. Doncas l'attribut foguèt remplaçada per "language" dins tot lo fichièr.</note> | ||
+ | ===== Extrach del fichièr basicunif_complet_language.xml===== | ||
+ | |||
+ | <code xml> | ||
+ | <?xml version="1.0" encoding="UTF-8" standalone="no" ?> | ||
+ | <TEI> | ||
+ | <teiHeader> | ||
+ | ... | ||
+ | </teiHeader> | ||
+ | <text> | ||
+ | <body> | ||
+ | ... | ||
+ | <entry n="12"> | ||
+ | <form type="main" language="fr"> | ||
+ | <orth>abaisser</orth> | ||
+ | <gramGrp> | ||
+ | <pos norm="verb">v.</pos> | ||
+ | <gram type="eaglescat" norm="V000050000000"></gram> | ||
+ | </gramGrp> | ||
+ | </form> | ||
+ | <sense n="I-A-1-a"> | ||
+ | <cit type="translation" language="oc-gascon"> | ||
+ | <form type="main" language="oc-gascon"> | ||
+ | <orth>abaishar</orth> | ||
+ | <gramGrp> | ||
+ | <gram type="eaglescat" norm="V000050000000"></gram> | ||
+ | </gramGrp> | ||
+ | </form> | ||
+ | </cit> | ||
+ | <cit type="translation" language="oc-gascon"> | ||
+ | <form type="main" language="oc-gascon"> | ||
+ | <orth>baishar</orth> | ||
+ | <gramGrp> | ||
+ | <gram type="eaglescat" norm="V000050000000"></gram> | ||
+ | </gramGrp> | ||
+ | </form> | ||
+ | </cit> | ||
+ | <cit type="translation" language="oc-lengadoc"> | ||
+ | <form type="main" language="oc-lengadoc"> | ||
+ | <orth>abaissar</orth> | ||
+ | <gramGrp> | ||
+ | <gram type="eaglescat" norm="V000050000000"></gram> | ||
+ | </gramGrp> | ||
+ | </form> | ||
+ | </cit> | ||
+ | <cit type="translation" language="oc-lengadoc"> | ||
+ | <form type="main" language="oc-lengadoc"> | ||
+ | <orth>baissar</orth> | ||
+ | <gramGrp> | ||
+ | <gram type="eaglescat" norm="V000050000000"></gram> | ||
+ | </gramGrp> | ||
+ | </form> | ||
+ | </cit> | ||
+ | </sense> | ||
+ | <sense n="I-A-2-a"> | ||
+ | <usg type="hint">diminuer</usg> | ||
+ | <cit type="translation" language="oc"> | ||
+ | <form type="main" language="oc"> | ||
+ | <orth>amermar</orth> | ||
+ | <gramGrp> | ||
+ | <gram type="eaglescat" norm="V000050000000"></gram> | ||
+ | </gramGrp> | ||
+ | </form> | ||
+ | </cit> | ||
+ | </sense> | ||
+ | </entry> | ||
+ | ... | ||
+ | </body> | ||
+ | </text> | ||
+ | </TEI> | ||
+ | </code> | ||
+ | ===== Programa basic.py ===== | ||
+ | D'unes còps lo còde TEI es pas forçadament plan format quand i a una error, lo programa se deu pas arrestar, l'error es marcada dins la resulta per èsser escartat. | ||
+ | <code python> | ||
+ | #!/usr/bin/env python3 | ||
+ | # coding: utf8 | ||
+ | import json | ||
+ | from lxml import etree | ||
+ | |||
+ | tree = etree.parse("basicunif_complet_language.xml") | ||
+ | f = open("basicunif_complet_extract.txt",'w') | ||
+ | |||
+ | f.write("id,fr,posfr,sens,dialect,oc,posoc\n".encode('utf-8')) | ||
+ | |||
+ | for entry in tree.xpath("/TEI/text/body/entry"): | ||
+ | entryId = entry.get('n') | ||
+ | orth = entry.find("form/orth").text | ||
+ | gramNorm = entry.find("form/gramGrp/gram").get('norm') | ||
+ | for sense in entry.xpath("sense"): | ||
+ | for cit in sense.xpath("cit"): | ||
+ | try: | ||
+ | langcit = cit.find("form").get('dialect') | ||
+ | except AttributeError: | ||
+ | langcit = "langcit error" | ||
+ | # orthcit | ||
+ | orthcit = cit.find("form/orth") | ||
+ | try: | ||
+ | orthcitText = orthcit.text | ||
+ | except AttributeError: | ||
+ | orthcitText = "orthcit error" | ||
+ | #gramNormcit | ||
+ | try: | ||
+ | gramNormCit = cit.find("form/gramGrp/gram").get('norm') | ||
+ | except AttributeError: | ||
+ | gramNormCit = "gramNormCit error" | ||
+ | s = "\"%s\", \"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n" % (entryId, orth, gramNorm, sense.get('n'), langcit, orthcitText, gramNormCit) | ||
+ | #print(s) | ||
+ | f.write(s.encode('utf-8')) | ||
+ | f.close() | ||
+ | </code> | ||
+ | ===== Resulta ===== | ||
+ | La sortida es en format csv per èsser integrat dins una basa de donadas SQL e NoSQL. | ||
+ | <code> | ||
+ | id,fr,posfr,sens,dialect,oc,posoc | ||
+ | "1", "à","AP1","I-A-1-a","oc","a","AP1" | ||
+ | "1", "à","AP1","I-A-2-a","oc","per","AP1" | ||
+ | "1", "à","AP1","I-A-2-a","oc-gascon","entà","AP1" | ||
+ | "1", "à","AP1","I-A-3-a","oc","a","AP1" | ||
+ | "1", "à","AP1","I-A-4-a","oc","de","AP1" | ||
+ | "1", "à","AP1","I-A-5-a","oc","de","AP1" | ||
+ | "1", "à","AP1","I-A-6-a","oc-gascon","entà","AP1" | ||
+ | "1", "à","AP1","I-A-6-a","oc-lengadoc","a","AP1" | ||
+ | "2", "à cloche-pied","AV0000","I-A-1-a","oc-gascon","a sautapè","AV0000" | ||
+ | "2", "à cloche-pied","AV0000","I-A-1-a","oc-lengadoc","a pè-ranquet","AV0000" | ||
+ | "3", "à contrecoeur","AV0000","I-A-1-a","oc","a contracòr","AV0000" | ||
+ | "3", "à contrecoeur","AV0000","I-A-1-a","oc-gascon","d’arrèrcòr","AV0000" | ||
+ | "3", "à contrecoeur","AV0000","I-A-1-a","oc-lengadoc","de rèirecòr","AV0000" | ||
+ | "4", "a fortiori","AV0000","I-A-1-a","oc","a fortiori","AV0000" | ||
+ | "5", "à jeun","AV0000","I-A-1-a","oc-gascon","dejun","AJ0110" | ||
+ | "5", "à jeun","AV0000","I-A-1-a","oc-lengadoc","dejun","AJ0110" | ||
+ | "6", "à peu près","AP1","I-A-1-a","oc-gascon","haut o baish","AP1" | ||
+ | "6", "à peu près","AP1","I-A-1-a","oc-gascon","mei o mensh","AP1" | ||
+ | "6", "à peu près","AP1","I-A-1-a","oc-lengadoc","a pauc près","AP1" | ||
+ | "6", "à peu près","AP1","I-A-1-a","oc-lengadoc","pauc se'n manca","AP1" | ||
+ | "7", "à rebours","AV0000","I-A-1-a","oc-gascon","au reboish","AV0000" | ||
+ | "7", "à rebours","AV0000","I-A-1-a","oc-lengadoc","a revèrs","AV0000" | ||
+ | "8", "à reculons","AV0000","I-A-1-a","oc-gascon","d'arreculas","AV0000" | ||
+ | "8", "à reculons","AV0000","I-A-1-a","oc-gascon","de reculas","AV0000" | ||
+ | "8", "à reculons","AV0000","I-A-1-a","oc-lengadoc","de reculons","AV0000" | ||
+ | "9", "à tâtons","AV0000","I-A-1-a","oc-gascon","a paupas","AV0000" | ||
+ | "9", "à tâtons","AV0000","I-A-1-a","oc-lengadoc","a palpas","AV0000" | ||
+ | "10", "à-coup","N1110","I-A-1-a","oc-gascon","sacat","N1110" | ||
+ | "10", "à-coup","N1110","I-A-1-a","oc-lengadoc","bassacada","N1210" | ||
+ | "11", "à-peu-près","N1110","I-A-1-a","oc","a-pauc-près","N1110" | ||
+ | "11", "à-peu-près","N1110","I-A-2-a","oc","aproximacion","N1110" | ||
+ | "12", "abaisser","V000050000000","I-A-1-a","oc-gascon","abaishar","V000050000000" | ||
+ | "12", "abaisser","V000050000000","I-A-1-a","oc-gascon","baishar","V000050000000" | ||
+ | "12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","abaissar","V000050000000" | ||
+ | "12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","baissar","V000050000000" | ||
+ | "12", "abaisser","V000050000000","I-A-2-a","oc","amermar","V000050000000" | ||
+ | ... | ||
+ | </code> | ||
+ | ===== Tèrmes en error ===== | ||
+ | |3034 | de | AT3000|I-A-1-a|langcit error | orthcit error|gramNormCit error| | ||
+ | |3720 | du |AT3100 |I-A-1-a |langcit error |orthcit error |gramNormCit error| | ||
+ | |13591 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error| | ||
+ | |13593 | tu |PD211011500022 |I-A-1-a |langcit error |orthcit error |gramNormCit error| | ||
+ | |13595 |il |PD311011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error| | ||
+ | |13596 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error| | ||
+ | ===== CSV botat dins una taula de MySQL ===== | ||
+ | lo nom de la basa de donadas 'cplo_basic_unificat'. | ||
+ | ===== Comolacion dels POS per un sens donat===== | ||
+ | <code sql> | ||
+ | SELECT idterm, fr, posfr, | ||
+ | group_concat( sens ) AS sens, | ||
+ | group_concat( dialect ) AS dialectes, | ||
+ | group_concat( oc ) AS occitans, | ||
+ | group_concat( posoc ) AS posoccitans | ||
+ | FROM `cplo_basic_unificat` | ||
+ | WHERE dialect != 'langcit error' | ||
+ | GROUP BY idterm, sens | ||
+ | </code> | ||
+ | ===== Extracion en CSV ===== | ||