Azure AI Search 多索引 / 条件语义搜索

我正在使用 Azure OpenAI 为我的编辑公司创建一个聊 bot,用于从大量企业文本数据中检索数据。目前,我在使用 Azure AI Search 时遇到了一个挑战。最初,所有的数据都在一个索引中,但现在由于条件搜索的需求,我需要将数据分成三个不同的索引。以下是详细信息:

  • 索引 1:生物学索引(私人,FR)
  • 索引 2:工程与技术索引(EN)
  • 索引 3:艺术与建筑索引(USA, UK)

这些索引包含各种数据源和出版物,并且它们之间存在主题重叠。例如,当查询与解剖学相关的主题,如视力、心血管疾病或生长激素治疗时,我希望这些查询以及相关的生物学主题,仅从生物学索引(索引 2)中检索数据。

我的 Python 代码可以有效地从单一索引中检索准确的数据,但我正在寻找 Azure AI Search 内的一个解决方案,以便根据查询上下文优先考虑特定的索引。

例如:

  • 与生物学相关的查询应仅从索引 1 和 2 中检索数据。

  • 与技术、数据科学和人工智能相关的查询应仅从索引 2 中检索数据。

我还没有找到直接解决这一特定需求的服务或 GitHub 存储库。我知道 Azure 不支持多索引搜索

我该如何找到解决方案或变通方法?

这是我用于 RAG 的代码

index_name = 'indx-editorials-bio-fr-old'# Query to executequery = 'Please retrieve publications from editorial certified houses covering cardiovascular diseases'# Function to execute the query with semantic rankingdef execute_query_with_semantic_ranking():    try:        # Create a SearchClient for the index        credential = AzureKeyCredential(admin_key)        client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)                # Execute the query with semantic ranking        results = client.search(search_text=query, semantic_fields=["content", "title"])                # Print the results        print(f"Results from index '{index_name}' with semantic ranking:")        for result in results:            print(result)        print()        except Exception as e:        print(f"Error querying index '{index_name}' with semantic ranking: {e}")# Execute the query with semantic rankingexecute_query_with_semantic_ranking()

索引定义:

{  "@odata.context": "search.windows.net",  "@odata.etag": "\"123547858WRF\"",  "name": "all_articles_index",  "defaultScoringProfile": null,  "fields": [    {      "name": "content",      "type": "Edm.String",      "searchable": true,      "filterable": true,      "retrievable": true,      "stored": true,      "sortable": true,      "facetable": false,      "key": false,      "indexAnalyzer": null,      "searchAnalyzer": null,      "analyzer": null,      "normalizer": null,      "dimensions": null,      "vectorSearchProfile": null,      "vectorEncoding": null,      "synonymMaps": []    },    {      "name": "title",      "type": "Edm.String",      "searchable": true,      "filterable": true,      "retrievable": true,      "stored": true,      "sortable": true,      "facetable": false,      "key": false,      "indexAnalyzer": null,      "searchAnalyzer": null,      "analyzer": null,      "normalizer": null,      "dimensions": null,      "vectorSearchProfile": null,      "vectorEncoding": null,      "synonymMaps": []    },    {      "name": "doi",      "type": "Edm.String",      "searchable": false,      "filterable": false,      "retrievable": true,      "stored": true,      "sortable": false,      "facetable": false,      "key": false,      "indexAnalyzer": null,      "searchAnalyzer": null,      "analyzer": null,      "normalizer": null,      "dimensions": null,      "vectorSearchProfile": null,      "vectorEncoding": null,      "synonymMaps": []    },    {      "name": "editorial_house",      "type": "Edm.String",      "searchable": false,      "filterable": false,      "retrievable": true,      "stored": true,      "sortable": false,      "facetable": false,      "key": false,      "indexAnalyzer": null,      "searchAnalyzer": null,      "analyzer": null,      "normalizer": null,      "dimensions": null,      "vectorSearchProfile": null,      "vectorEncoding": null,      "synonymMaps": []    },    {      "name": "metadata_storage_path",      "type": "Edm.String",      "searchable": false,      "filterable": false,      "retrievable": true,      "stored": true,      "sortable": false,      "facetable": false,      "key": true,      "indexAnalyzer": null,      "searchAnalyzer": null,      "analyzer": null,      "normalizer": null,      "dimensions": null,      "vectorSearchProfile": null,      "vectorEncoding": null,      "synonymMaps": []    }  ],  "scoringProfiles": [],  "corsOptions": null,  "suggesters": [],  "analyzers": [],  "normalizers": [],  "tokenizers": [],  "tokenFilters": [],  "charFilters": [],  "encryptionKey": null,  "similarity": {    "@odata.type": "BM25Similarity",    "k1": null,    "b": null  },  "semantic": {    "defaultConfiguration": null,    "configurations": [      {        "name": "article-semantic",        "prioritizedFields": {          "titleField": {            "fieldName": "title"          },          "prioritizedContentFields": [            {              "fieldName": "content"            }          ],          "prioritizedKeywordsFields": []        }      }    ]  },  "vectorSearch": null}

样本数据

[  {    "content": "This article explores the potential of AI to revolutionize genomics, highlighting recent breakthroughs and future prospects.",    "title": "The Impact of AI on Genomics: Recent Breakthroughs and Future Prospects",    "doi": "10.1234/ai-bio-2024-001",    "editorial_house": "BioTech Publishers",    "metadata_storage_path": "/articles/2024/ai-bio-2024-001"  },  {    "content": "In this study, we discuss the integration of machine learning in drug discovery processes, focusing on its benefits and challenges.",    "title": "Machine Learning in Drug Discovery: Benefits and Challenges",    "doi": "10.1234/ai-bio-2024-002",    "editorial_house": "BioTech Publishers",    "metadata_storage_path": "/articles/2024/ai-bio-2024-002"  },  {    "content": "This paper examines the role of AI in ecological monitoring, presenting case studies on wildlife conservation efforts.",    "title": "AI in Ecological Monitoring: Wildlife Conservation Case Studies",    "doi": "10.1234/ai-bio-2024-003",    "editorial_house": "BioTech Publishers",    "metadata_storage_path": "/articles/2024/ai-bio-2024-003"  },  {    "content": "The article reviews advances in bioinformatics driven by AI, with a focus on data analysis techniques and their applications.",    "title": "Advances in Bioinformatics: AI-Driven Data Analysis Techniques",    "doi": "10.1234/ai-bio-2024-004",    "editorial_house": "BioTech Publishers",    "metadata_storage_path": "/articles/2024/ai-bio-2024-004"  },  {    "content": "This study highlights the use of AI in personalized medicine, detailing the technology's impact on treatment plans and patient outcomes.",    "title": "Personalized Medicine: AI's Role in Tailoring Treatment Plans",    "doi": "10.1234/ai-bio-2024-005",    "editorial_house": "BioTech Publishers",    "metadata_storage_path": "/articles/2024/ai-bio-2024-005"  }]

回答:

是的,正如你所说,多索引查询是不可能的。对于你的问题,以下是可能的解决方案:

你提到你正在创建三个新索引,同时你还需要创建第四个索引,其中包含所有内容、主题和索引名称作为字段。

样本数据

{"index_name":"Biology Index","content":"所有关于生物学主题的内容"},{"index_name":"Engineering and Technology Index","content":"所有关于工程与技术主题的内容"},{"index_name":"Art and Architecture Index","content":"所有关于艺术与建筑主题的内容"}

因此,创建一个包含上述样本数据的第四个索引,如果每个主题有多个文档,则将它们合并并添加到内容字段中。

接下来,在这个第四个索引上使用输入进行查询,并从结果中获取具有最高 search.score 的索引名称,然后在你的 Python 代码中使用该索引名称进行进一步查询。

Related Posts

Keras Dense层输入未被展平

这是我的测试代码: from keras import…

无法将分类变量输入随机森林

我有10个分类变量和3个数值变量。我在分割后直接将它们…

如何在Keras中对每个输出应用Sigmoid函数?

这是我代码的一部分。 model = Sequenti…

如何选择类概率的最佳阈值?

我的神经网络输出是一个用于多标签分类的预测类概率表: …

在Keras中使用深度学习得到不同的结果

我按照一个教程使用Keras中的深度神经网络进行文本分…

‘MatMul’操作的输入’b’类型为float32,与参数’a’的类型float64不匹配

我写了一个简单的TensorFlow代码,但不断遇到T…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注