Skip to content

Is it possible to generate a dataset based on exisiting documents using ollama, openai . #224

@Anandhfullstack

Description

@Anandhfullstack

I am working on automating the generation of structured datasets using Kiln AI's TaskRun. My goal is to take a summarized text input and generate an expected dataset in the format shown in the UI (see attached image).

Image

This is my task instruction

Image

I want to create dataset from existing documents.

In UI, I can directly input a text and generate a dataset [single data] like this.

Image

I want to generate a whole dataset like this , for that I need to automate the process.

I referred to the Kiln AI core docs and found that I can create multiple TaskRun instances in a loop. However, when using the following code to generate a single TaskRun, I am unable to leverage an LLM to produce the expected dataset from the summarized text.

item = kiln_ai.datamodel.TaskRun( parent=task, input="The AIE1515_FC_C does not come with an operating system which must be loaded first before installation of any software into the computer.", input_source=kiln_ai.datamodel.DataSource(type=kiln_ai.datamodel.DataSourceType.synthetic, properties={"model_name": "phi4","model_provider":"ollama", "adapter_name": "ollama_adapter"} ), output=kiln_ai.datamodel.TaskOutput( output=json.dumps({"answer_1": "", "question_1": "", "answer_2": "", "question_2": "", "answer_3": "", "question_3": ""}), source=kiln_ai.datamodel.DataSource( type=kiln_ai.datamodel.DataSourceType.synthetic, properties={"model_name": "phi4", "model_provider":"ollama","adapter_name": "ollama_adapter"} ) ), )

Expected Outcome:

  • The LLM should process the given summarized text and automatically generate answer_1, question_1, answer_2, question_2, etc.
  • The dataset should be structured as shown in the UI example.
  • This process should be scalable, allowing multiple TaskRun instances to be created in a loop.

Request for Help:

  • How can I modify TaskRun so that the LLM actively generates the structured dataset instead of just initializing an empty output?
  • Are there specific parameters or methods in Kiln AI that allow integrating the LLM for dynamic output generation?

Any help or suggestions would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions