使用ConvertFrom-String模板处理不规则数据集的问题

我有一个数据集,试图将其规范化为PsCustomObject。我尝试使用ConvertFrom-String的机器学习模板功能,取得了部分成功。一个问题是,我能找到的所有示例都是结构相同的数据库,而我的数据集并非如此。

我相信一个高手可以直接从原始数据中处理,但我已经对其进行了一些处理才达到目前的进展。

原始样本数据:

IDE00001-ENG99061-Production mode-Access controlIDE00001-ENG115730-Production mode-AussenbeleuchtungIDE00001-ENG112304-Production mode-HeckwischerIDE00001-ENG98647-Production mode-Interior lightingIDE00001-ENG115729-Production mode-ScheinwerferreinigungIDE00001-ENG115731-Production mode-Virtuel_pedalIDE00002-Transport modeIDE00820-Activating and deactivating all development messagesIDE01550-Service positionIDE02152-Characteristics in production modeIDE02269-MAS04382-Acknowledgement signals-Optical feedback during lockingIDE02332-Deactivate production modeIDE02488-DWA Interior monitoringIDE02711-ENG116690-Rear Window Wiper-Automatisches Heckwischen

使用以下脚本:

$lines = $testText.Split("`n") #$testText是上面数据包装在一个here-string中$NewLines = @()foreach($line in $lines){    [regex]$regex = '-'    $HyphenCount = $regex.Matches($line).count    #$HyphenCount    switch ($HyphenCount)    {        1{            $newLines += $line -replace "-",","         }        2{            $split = $line.Split("-",2)            $newlines += $split -join ","         }        3{            if($line.Contains("mode-"))            {                #$line                $split = $line.Split("-",4)                $newlines += $split -join ","            }            else            {                $split = $line.Split("-",3)                $newlines += $split -join ","            }         }                4{           $split = $line.Split("-",3) #假设第四个连字符是描述的一部分           $newlines += $split -join ","         }        5{           $split = $line.Split("-",4)            $newlines += $split -join ","         }    }}

处理后的数据集:

我已经将原始数据调整为如下形式:

IDE00001,ENG99061,Production mode,Access controlIDE00001,ENG115730,Production mode,AussenbeleuchtungIDE00001,ENG112304,Production mode,HeckwischerIDE00001,ENG98647,Production mode,Interior lightingIDE00001,ENG115729,Production mode,ScheinwerferreinigungIDE00001,ENG115731,Production mode,Virtuel_pedalIDE00002,Transport modeIDE00820,Activating and deactivating all development messagesIDE01550,Service positionIDE02152,Characteristics in production modeIDE02269,MAS04382,Acknowledgement signals-Optical feedback during lockingIDE02332,Deactivate production modeIDE02488,DWA Interior monitoringIDE02711,ENG116690,Rear Window Wiper-Automatisches HeckwischenIDE99999,Test-two hyphensIDE99999,ENG123456,Test-four-HyphensIDE99999,ENG123456,Production mode,test-five-hyphens

将上述数据通过以下模板处理后,我已经接近所需的结果,但仍有一些问题:

$template = @'{object*:{ide:IDE00001},{code?:ENG99061},{mode?:Production mode},{description?:Access control}}{object*:{ide:IDE00001},{code?:ENG115730},{mode?:Dev mode},{description?:Aussenbeleuchtung}}{object*:{ide:IDE00001},{code?:ENG115731},{mode?:Production mode},{description?:Virtuel_pedal}}{object*:{ide:IDE02711},{code?:ENG116690},{description?:Rear Window Wiper-Automatisches Heckwischen}}{object*:{ide:IDE00820},{description?:{!mode?:{!code?:Activating and deactivating all development messages}}}}{object*:{ide:IDE01550},{description?:{!mode?:{!code?:Service position}}}}{object*:{ide:IDE02488},{description?:{!mode?:{!code?:DWA Interior monitoring}}}}{object*:{ide:IDE00002},{mode?:Transport mode}}'@$testText | ConvertFrom-String -TemplateContent $template -OutVariable out | Out-Null$out.object

目前的结果:

结果看起来像这样:

ide      code      mode            description                                            ---      ----      ----            -----------                                            IDE00001 ENG99061  Production mode Access control                                         IDE00001 ENG115730 Production mode Aussenbeleuchtung                                      IDE00001 ENG112304 Production mode Heckwischer                                            IDE00001 ENG98647  Production mode Interior lighting                                      IDE00001 ENG115729 Production mode Scheinwerferreinigung                                  IDE00001 ENG115731 Production mode Virtuel_pedal                                          IDE00002           Transport mode  Transport mode                                         IDE00820                           Activating and deactivating all development messages   IDE01550                           Service position                                       IDE02152           production mode Characteristics in production mode                     IDE02269 MAS04382                  Acknowledgement signals-Optical feedback during lockingIDE02332           production mode Deactivate production mode                             IDE02488                           DWA Interior monitoring                                IDE02711 ENG116690                 Rear Window Wiper-Automatisches Heckwischen            IDE99999                           Test-two hyphens                                       IDE99999 ENG123456                 Test-four-Hyphens    

存在的问题区域:

IDE00002           Transport mode  Transport modeIDE02152           production mode Characteristics in production modeIDE02332           production mode Deactivate production mode 
  1. Transport mode不应该出现在description列中。
  2. production mode不应该出现在mode列中。它不知为何从description中提取了这个信息。

我实在是搞不明白。所以,如果有人有任何想法…


回答:

作为替代方案,如果你的输入数据足够系统化,你可以使用正则表达式来解析:

$inputText = @"IDE00001-ENG99061-Production mode-Access controlIDE00001-ENG115730-Production mode-AussenbeleuchtungIDE00001-ENG112304-Production mode-HeckwischerIDE00001-ENG98647-Production mode-Interior lightingIDE00001-ENG115729-Production mode-ScheinwerferreinigungIDE00001-ENG115731-Production mode-Virtuel_pedalIDE00002-Transport modeIDE00820-Activating and deactivating all development messagesIDE01550-Service positionIDE02152-Characteristics in production modeIDE02269-MAS04382-Acknowledgement signals-Optical feedback during lockingIDE02332-Deactivate production modeIDE02488-DWA Interior monitoringIDE02711-ENG116690-Rear Window Wiper-Automatisches Heckwischen"@ -split "`n"$pattern = '^((?<ide>[IDE0-9]+)-)((?<code>[A-Z0-9]+)-)?((?<mode>Production mode|Transport mode)-?)?(?<description>.*?)$'foreach ($line in $inputText){    $isMatch = $line -match $pattern    if (-not $isMatch)    {        Write-Warning "Cannot parse expression: $line"        continue    }    New-Object psobject -Property ([ordered]@{        'Ide' = $Matches.ide        'Code' = $Matches.code        'Mode' = $Matches.mode        'Description' = $Matches.description    })}

你说你的数据结构不一致。也许你的正则表达式需要比上面给出的复杂得多。或者,如果你能识别出所有可能出现的不同结构,你可以多次运行解析,使用不同的正则表达式。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注