我使用UIMA的DictionaryCreator创建了一个字典,我想使用DictionaryAnnotator和上述字典对一段文本进行注释,但我无法弄清楚如何获取注释后的文本。如果你知道,请告诉我。任何帮助都将不胜感激。代码、字典文件和描述符如下所示,附注:我是Apache UIMA的新手。
XMLInputSource xml_in = new XMLInputSource("DictionaryAnnotatorDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(xml_in); AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier); JCas jCas = ae.newJCas(); String inputText = "Mark and John went down the rabbit hole to meet a wise owl and have curry with the owl."; jCas.setDocumentText(inputText); printResults(jCas);public static void printResults(JCas jcas) { FSIndex<Annotation> index = jcas.getAnnotationIndex(); for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) { Annotation annotation = it.next(); List<Feature> features; features = annotation.getType().getFeatures(); List<String> fasl = new ArrayList<String>(); for (Feature feature : features) { try { String name = feature.getShortName(); System.out.println(feature.getName()); String value = annotation.getStringValue(feature); fasl.add(name + "=\"" + value + "\""); System.out.println(value); }catch (Exception e){ continue; } } }}my_dictionary.xml<?xml version="1.0" encoding="UTF-8"?><dictionary xmlns="http://incubator.apache.org/uima" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="dictionary.xsd"><typeCollection><dictionaryMetaData caseNormalization="true" multiWordEntries="true" multiWordSeparator=" "/><languageId>en</languageId><typeDescription><typeName>org.apache.uima.DictionaryEntry</typeName></typeDescription><entries><entry><key>Mark</key></entry><entry><key>John</key></entry><entry><key>Rabbit</key></entry><entry><key>Owl</key></entry><entry><key>Curry</key></entry><entry><key>ATH-MX50</key></entry><entry><key>CC234</key></entry></entries></typeCollection></dictionary>DictionaryAnnotatorDescriptor.xml<?xml version="1.0" encoding="UTF-8"?><analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier"> <frameworkImplementation>org.apache.uima.java</frameworkImplementation> <primitive>true</primitive> <annotatorImplementationName>org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotator</annotatorImplementationName> <analysisEngineMetaData> <name>GeneDictionaryAnnotator</name> <description></description> <version>0.1</version> <vendor></vendor> <configurationParameters> <configurationParameter> <name>DictionaryFiles</name> <description>list of dictionary files to configure the annotator</description> <type>String</type> <multiValued>true</multiValued> <mandatory>true</mandatory> </configurationParameter> <configurationParameter> <name>InputMatchType</name> <description></description> <type>String</type> <multiValued>false</multiValued> <mandatory>true</mandatory> </configurationParameter> <configurationParameter> <name>InputMatchFeaturePath</name> <description></description> <type>String</type> <multiValued>false</multiValued> <mandatory>false</mandatory> </configurationParameter> <configurationParameter> <name>InputMatchFilterFeaturePath</name> <description></description> <type>String</type> <multiValued>false</multiValued> <mandatory>false</mandatory> </configurationParameter> <configurationParameter> <name>FilterConditionOperator</name> <description></description> <type>String</type> <multiValued>false</multiValued> <mandatory>false</mandatory> </configurationParameter> <configurationParameter> <name>FilterConditionValue</name> <description></description> <type>String</type> <multiValued>false</multiValued> <mandatory>false</mandatory> </configurationParameter> </configurationParameters> <configurationParameterSettings> <nameValuePair> <name>DictionaryFiles</name> <value> <array> <string>src/main/resources/my_dictionary.xml</string> </array> </value> </nameValuePair> <nameValuePair> <name>InputMatchType</name> <value> <string>org.apache.uima.TokenAnnotation</string> </value> </nameValuePair> </configurationParameterSettings> <typeSystemDescription> <types> <typeDescription> <name>org.apache.uima.DictionaryEntry</name> <description></description> <supertypeName>uima.tcas.Annotation</supertypeName> </typeDescription> <typeDescription> <name>org.apache.uima.TokenAnnotation</name> <description>Single token annotation</description> <supertypeName>uima.tcas.Annotation</supertypeName> <features> <featureDescription> <name>tokenType</name> <description>token type</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> </features> </typeDescription> <typeDescription> <name>example.Name</name> <description>A proper name.</description> <supertypeName>uima.tcas.Annotation</supertypeName> </typeDescription> </types> </typeSystemDescription> <capabilities> <capability> <inputs/> <outputs> <type>example.Name</type> </outputs> <languagesSupported/> </capability> </capabilities> <operationalProperties> <modifiesCas>true</modifiesCas> <multipleDeploymentAllowed>true</multipleDeploymentAllowed> <outputsNewCASes>false</outputsNewCASes> </operationalProperties> </analysisEngineMetaData></analysisEngineDescription>
回答:
或者,您也可以使用Apache Ruta,可以使用工作台(推荐用于入门)或Java代码。
对于后者,我在https://github.com/renaud/annotate_ruta_example创建了一个示例项目。主要部分包括:
在src/main/resources/ruta/resources/names.txt
中的名称列表(一个纯文本文件)
MarkJohnRabbitOwlCurryATH-MX50CC234
在src/main/resources/ruta/scripts/Example.ruta
中的Ruta脚本
PACKAGE example.annotate; // 可选的包定义WORDLIST MyNames = 'names.txt'; // 声明字典位置DECLARE Name; // 声明一个注释Document{-> MARKFAST(Name, MyNames)}; // 注释文档
以及一些用于启动注释器的Java样板代码:
JCas jCas = JCasFactory.createJCas();// 要注释的示例文本jCas.setDocumentText("Mark wants to buy CC234.");// 使用脚本和资源配置引擎AnalysisEngine rutaEngine = AnalysisEngineFactory.createEngine( RutaEngine.class, // RutaEngine.PARAM_RESOURCE_PATHS, "src/main/resources/ruta/resources",// RutaEngine.PARAM_SCRIPT_PATHS, "src/main/resources/ruta/scripts", RutaEngine.PARAM_MAIN_SCRIPT, "Example");// 运行脚本。您也可以提供一个UIMA集合读取器来处理多个文档,而不是jCasSimplePipeline.runPipeline(jCas, rutaEngine);// 一个简单的选择来打印匹配的名称for (Name name : JCasUtil.select(jCas, Name.class)) { System.out.println(name.getCoveredText());}
还有一些UIMA类型(注释)定义,请查看src/main/resources/desc/type/ExampleTypes.xml
,src/main/resources/META-INF/org.apache.uima.fit/types.txt
和src/main/java/example/annotate
。
如何测试
git clone https://github.com/renaud/annotate_ruta_example.gitcd annotate_ruta_examplemvn clean installmvn exec:java -Dexec.mainClass="example.Annotate"