如何获取DictionaryAnnotator的注释文本

我使用UIMA的DictionaryCreator创建了一个字典,我想使用DictionaryAnnotator和上述字典对一段文本进行注释,但我无法弄清楚如何获取注释后的文本。如果你知道,请告诉我。任何帮助都将不胜感激。代码、字典文件和描述符如下所示,附注:我是Apache UIMA的新手。

 XMLInputSource xml_in = new XMLInputSource("DictionaryAnnotatorDescriptor.xml");         ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(xml_in);         AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);         JCas jCas = ae.newJCas();         String inputText = "Mark and John went down the rabbit hole to meet a wise owl and have curry with the owl.";         jCas.setDocumentText(inputText);         printResults(jCas);public static void printResults(JCas jcas) {    FSIndex<Annotation> index = jcas.getAnnotationIndex();    for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) {        Annotation annotation = it.next();        List<Feature> features;            features = annotation.getType().getFeatures();        List<String> fasl = new ArrayList<String>();        for (Feature feature : features) {            try {                String name = feature.getShortName();                System.out.println(feature.getName());                String value = annotation.getStringValue(feature);                fasl.add(name + "=\"" + value + "\"");                System.out.println(value);            }catch (Exception e){                continue;            }        }    }}my_dictionary.xml<?xml version="1.0" encoding="UTF-8"?><dictionary xmlns="http://incubator.apache.org/uima" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="dictionary.xsd"><typeCollection><dictionaryMetaData caseNormalization="true" multiWordEntries="true" multiWordSeparator=" "/><languageId>en</languageId><typeDescription><typeName>org.apache.uima.DictionaryEntry</typeName></typeDescription><entries><entry><key>Mark</key></entry><entry><key>John</key></entry><entry><key>Rabbit</key></entry><entry><key>Owl</key></entry><entry><key>Curry</key></entry><entry><key>ATH-MX50</key></entry><entry><key>CC234</key></entry></entries></typeCollection></dictionary>DictionaryAnnotatorDescriptor.xml<?xml version="1.0" encoding="UTF-8"?><analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">    <frameworkImplementation>org.apache.uima.java</frameworkImplementation>    <primitive>true</primitive>    <annotatorImplementationName>org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotator</annotatorImplementationName>    <analysisEngineMetaData>        <name>GeneDictionaryAnnotator</name>        <description></description>        <version>0.1</version>        <vendor></vendor>        <configurationParameters>            <configurationParameter>                <name>DictionaryFiles</name>                <description>list of dictionary files to configure the annotator</description>                <type>String</type>                <multiValued>true</multiValued>                <mandatory>true</mandatory>            </configurationParameter>            <configurationParameter>                <name>InputMatchType</name>                <description></description>                <type>String</type>                <multiValued>false</multiValued>                <mandatory>true</mandatory>            </configurationParameter>            <configurationParameter>                <name>InputMatchFeaturePath</name>                <description></description>                <type>String</type>                <multiValued>false</multiValued>                <mandatory>false</mandatory>            </configurationParameter>            <configurationParameter>                <name>InputMatchFilterFeaturePath</name>                <description></description>                <type>String</type>                <multiValued>false</multiValued>                <mandatory>false</mandatory>            </configurationParameter>            <configurationParameter>                <name>FilterConditionOperator</name>                <description></description>                <type>String</type>                <multiValued>false</multiValued>                <mandatory>false</mandatory>            </configurationParameter>            <configurationParameter>                <name>FilterConditionValue</name>                <description></description>                <type>String</type>                <multiValued>false</multiValued>                <mandatory>false</mandatory>            </configurationParameter>        </configurationParameters>        <configurationParameterSettings>            <nameValuePair>                <name>DictionaryFiles</name>                <value>                    <array>                        <string>src/main/resources/my_dictionary.xml</string>                    </array>                </value>            </nameValuePair>            <nameValuePair>                <name>InputMatchType</name>                <value>                    <string>org.apache.uima.TokenAnnotation</string>                </value>            </nameValuePair>        </configurationParameterSettings>        <typeSystemDescription>            <types>                <typeDescription>                    <name>org.apache.uima.DictionaryEntry</name>                    <description></description>                    <supertypeName>uima.tcas.Annotation</supertypeName>                </typeDescription>                <typeDescription>                    <name>org.apache.uima.TokenAnnotation</name>                    <description>Single token annotation</description>                    <supertypeName>uima.tcas.Annotation</supertypeName>                    <features>                        <featureDescription>                            <name>tokenType</name>                            <description>token type</description>                            <rangeTypeName>uima.cas.String</rangeTypeName>                        </featureDescription>                    </features>                </typeDescription>                <typeDescription>                    <name>example.Name</name>                    <description>A proper name.</description>                    <supertypeName>uima.tcas.Annotation</supertypeName>                </typeDescription>            </types>        </typeSystemDescription>        <capabilities>            <capability>                <inputs/>                <outputs>                    <type>example.Name</type>                </outputs>                <languagesSupported/>            </capability>        </capabilities>        <operationalProperties>            <modifiesCas>true</modifiesCas>            <multipleDeploymentAllowed>true</multipleDeploymentAllowed>            <outputsNewCASes>false</outputsNewCASes>        </operationalProperties>    </analysisEngineMetaData></analysisEngineDescription>

回答:

或者,您也可以使用Apache Ruta,可以使用工作台(推荐用于入门)或Java代码。

对于后者,我在https://github.com/renaud/annotate_ruta_example创建了一个示例项目。主要部分包括:

src/main/resources/ruta/resources/names.txt中的名称列表(一个纯文本文件)

MarkJohnRabbitOwlCurryATH-MX50CC234

src/main/resources/ruta/scripts/Example.ruta中的Ruta脚本

PACKAGE example.annotate;               // 可选的包定义WORDLIST MyNames = 'names.txt';         // 声明字典位置DECLARE Name;                           // 声明一个注释Document{-> MARKFAST(Name, MyNames)};   // 注释文档

以及一些用于启动注释器的Java样板代码:

JCas jCas = JCasFactory.createJCas();// 要注释的示例文本jCas.setDocumentText("Mark wants to buy CC234.");// 使用脚本和资源配置引擎AnalysisEngine rutaEngine = AnalysisEngineFactory.createEngine(    RutaEngine.class, //    RutaEngine.PARAM_RESOURCE_PATHS,    "src/main/resources/ruta/resources",//    RutaEngine.PARAM_SCRIPT_PATHS,    "src/main/resources/ruta/scripts",    RutaEngine.PARAM_MAIN_SCRIPT, "Example");// 运行脚本。您也可以提供一个UIMA集合读取器来处理多个文档,而不是jCasSimplePipeline.runPipeline(jCas, rutaEngine);// 一个简单的选择来打印匹配的名称for (Name name : JCasUtil.select(jCas, Name.class)) {    System.out.println(name.getCoveredText());}

还有一些UIMA类型(注释)定义,请查看src/main/resources/desc/type/ExampleTypes.xmlsrc/main/resources/META-INF/org.apache.uima.fit/types.txtsrc/main/java/example/annotate

如何测试

git clone https://github.com/renaud/annotate_ruta_example.gitcd annotate_ruta_examplemvn clean installmvn exec:java -Dexec.mainClass="example.Annotate"

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注