我需要为一个多分类项目嵌入超过30万个产品描述。我将这些描述分成每组34,337个描述,以便符合批处理嵌入的限制大小。
我的用于批处理的jsonl文件的一个样本:
{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}
我的jsonl文件有34,337行。
我已经成功上传了文件:
File 'batch_emb_file_1.jsonl' uploaded succesfully: FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)
并运行了嵌入作业:
Batch job created successfully: Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))
工作已经完成:
client.batches.retrieve(batch_job_1.id).status'completed'
client.batches.retrieve('redacted for work compliance')
, 返回:
Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))
但是当我尝试使用output_file_id字符串获取内容时
client.files.content(value of output_file_id)
, 返回:
<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>
我尝试过:client.files.content(value of output_file_id).content
但这会导致我的内核崩溃
我做错了什么?另外,我认为我没有充分利用批处理嵌入。90,000的限制与’text-embedding-ada-002’模型的批处理队列限制3,000,000相冲突
有人能帮帮我吗?
回答:
从批处理文件中检索嵌入数据有点棘手,这个教程逐步分解了这个过程 链接
在获取output_file_id后,你需要:
output_file =client.files.content(output_files_id).textembedding_results = []for line in output_file.split('\n')[:-1]: data =json.loads(line) custom_id = data.get('custom_id') embedding = data['response']['body']['data'][0]['embedding'] embedding_results.append([custom_id, embedding])embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])
在我这里,这可以从批处理作业文件中检索嵌入数据