所以我需要在Python中解析一组值,并对它们进行独热编码以用于特征工程。以下是我特征集中的’amenities’列的一个样本的值。
x = {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}
问题在于这里既有花括号'{}’,又有一些值应该用双引号引起来但实际上没有(参见上例中的Kitchen, Heating)。如果我能将上述内容转换为字符串,我知道如何去掉括号并将其分割成列表。
我需要将上述内容转换为一个项目列表,其中那些没有用双引号引起来的值变成字符串。
回答:
输入数据看起来已损坏。然而,最简单的方法是去掉双引号,然后按逗号分割(我已经略过了花括号部分,因为它们也很容易去掉):
s = '"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"'
print(s.replace('"','').split(","))
结果:
['Wireless Internet', 'Air conditioning', 'Kitchen', 'Heating', 'Family/kid friendly', 'Essentials', 'Hair dryer', 'Iron', 'translation missing: en.hosting_amenity_50']
当然,如果数据中包含逗号,你就麻烦了,因为没有办法区分字段中的逗号和分隔符逗号,因为缺少引号…(否则使用ast.literal_eval
解析将非常简单)
完全去掉花括号需要做一些稍微复杂的工作,但也是可行的:
s = 'x = {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}'
print(s.replace('"','').split("{")[1].rstrip('}').split(","))