{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# RAG evaluation with the Ragas Python SDK (modern metrics API)\n",
        "\n",
        "This notebook shows a minimal end-to-end flow using Ragas modern metrics: build a small evaluation table (questions, retrieved contexts, and model answers), run metrics from `ragas.metrics.collections`, and inspect per-row and aggregate scores.\n",
        "\n",
        "**Requirements**\n",
        "\n",
        "- Python 3.10+ recommended.\n",
        "- OpenAI-compatible API access: set `EVAL_LLM_API_KEY` and, when needed, `EVAL_LLM_BASE_URL` (the notebook also accepts `OPENAI_API_KEY` / `OPENAI_BASE_URL` as fallback).\n",
        "- Optional separate embedding endpoint credentials: `EVAL_EMBED_API_KEY` and `EVAL_EMBED_BASE_URL`.\n",
        "- Explicit model configuration in the notebook (`EVAL_LLM_MODEL` and optional `EVAL_EMBED_MODEL`; default is `text-embedding-3-small`).\n",
        "- For retrieval-related metrics, use the same embedding model as the production RAG retriever whenever possible.\n",
        "- Ragas calls the LLM (and embeddings where needed) multiple times; expect latency and cost proportional to dataset size and metrics.\n",
        "- A dedicated virtual environment or workbench image reduces dependency conflicts with other projects.\n",
        "\n",
        "**References**\n",
        "\n",
        "- [Ragas documentation](https://docs.ragas.io/)\n",
        "- Alauda AI docs: *Evaluating RAG with Ragas* — grouped metric overview and prerequisites (when browsing the documentation site)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Install dependencies\n",
        "\n",
        "Run this once per environment (for example a new workbench or virtualenv)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Use current kernel's Python so PATH does not point to another env\n",
        "# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n",
        "import sys\n",
        "!{sys.executable} -m pip install \"ragas\" \"datasets\" \"openai\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Configure API credentials\n",
        "\n",
        "Set `EVAL_LLM_API_KEY` (recommended) or `OPENAI_API_KEY` before running evaluation. If the endpoint is not the provider default, set `EVAL_LLM_BASE_URL` (or `OPENAI_BASE_URL`) as well.\n",
        "\n",
        "Do not commit secrets into version control; use platform secret injection or notebook environment variables instead.\n",
        "\n",
        "Optional: disable Ragas analytics (`RAGAS_DO_NOT_TRACK=true`) if required by policy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# Config LLM API\n",
        "# os.environ[\"EVAL_LLM_API_KEY\"] = \"sk-...\"\n",
        "# os.environ[\"EVAL_LLM_BASE_URL\"] = \"https://your-openai-compatible-endpoint/v1\"  # optional\n",
        "# os.environ[\"EVAL_LLM_MODEL\"] = \"...\"\n",
        "\n",
        "# Config Embeddings API\n",
        "# os.environ[\"EVAL_EMBED_API_KEY\"] = \"sk-...\"\n",
        "# os.environ[\"EVAL_EMBED_BASE_URL\"] = \"https://your-embedding-endpoint/v1\"\n",
        "# os.environ[\"EVAL_EMBED_MODEL\"] = \"...\"\n",
        "\n",
        "\n",
        "# Optional: disable Ragas analytics\n",
        "# os.environ[\"RAGAS_DO_NOT_TRACK\"] = \"true\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from openai import AsyncOpenAI\n",
        "from ragas.embeddings import OpenAIEmbeddings\n",
        "from ragas.llms import llm_factory\n",
        "\n",
        "EVAL_LLM_API_KEY = os.getenv(\"EVAL_LLM_API_KEY\", os.getenv(\"OPENAI_API_KEY\", \"\"))\n",
        "EVAL_LLM_BASE_URL = os.getenv(\"EVAL_LLM_BASE_URL\", os.getenv(\"OPENAI_BASE_URL\", \"\"))\n",
        "EVAL_LLM_MODEL = os.getenv(\"EVAL_LLM_MODEL\", \"\")\n",
        "\n",
        "EVAL_EMBED_API_KEY = os.getenv(\"EVAL_EMBED_API_KEY\", EVAL_LLM_API_KEY)\n",
        "EVAL_EMBED_BASE_URL = os.getenv(\"EVAL_EMBED_BASE_URL\", EVAL_LLM_BASE_URL)\n",
        "EVAL_EMBED_MODEL = os.getenv(\"EVAL_EMBED_MODEL\", \"\")\n",
        "\n",
        "if not EVAL_LLM_MODEL:\n",
        "    raise RuntimeError(\"Set EVAL_LLM_MODEL to an available model ID from your endpoint.\")\n",
        "\n",
        "if not EVAL_EMBED_MODEL:\n",
        "    raise RuntimeError(\"Set EVAL_EMBED_MODEL to an available model ID from your endpoint.\")\n",
        "\n",
        "llm_client = AsyncOpenAI(\n",
        "    api_key=EVAL_LLM_API_KEY,\n",
        "    base_url=EVAL_LLM_BASE_URL or None,\n",
        ")\n",
        "embed_client = AsyncOpenAI(\n",
        "    api_key=EVAL_EMBED_API_KEY,\n",
        "    base_url=EVAL_EMBED_BASE_URL or None,\n",
        ")\n",
        "\n",
        "llm = llm_factory(EVAL_LLM_MODEL, client=llm_client)\n",
        "embeddings = OpenAIEmbeddings(\n",
        "    model=EVAL_EMBED_MODEL,\n",
        "    client=embed_client,\n",
        ")\n",
        "\n",
        "print(f\"llm_base_url={EVAL_LLM_BASE_URL or '(provider default)'}\")\n",
        "print(f\"llm={EVAL_LLM_MODEL}\")\n",
        "print(f\"embed_base_url={EVAL_EMBED_BASE_URL or '(provider default)'}\")\n",
        "print(f\"embeddings={EVAL_EMBED_MODEL}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Build an evaluation dataset\n",
        "\n",
        "For the modern metrics API in this notebook, organize data as row samples (one dictionary per sample).\n",
        "\n",
        "Each row uses argument-aligned names:\n",
        "\n",
        "- `user_input`: user query\n",
        "- `retrieved_contexts`: list of retrieved passages for that row\n",
        "- `response`: model response to score\n",
        "- `reference`: reference answer or expected facts (needed by retrieval/reference-based metrics)\n",
        "\n",
        "This row-first structure matches `ascore()` usage and avoids extra mapping."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from datasets import Dataset\n",
        "\n",
        "samples = [\n",
        "    {\n",
        "        \"user_input\": \"What is the capital of France?\",\n",
        "        \"retrieved_contexts\": [\n",
        "            \"Paris is the capital and most populous city of France.\"\n",
        "        ],\n",
        "        \"response\": \"The capital of France is Paris.\",\n",
        "        \"reference\": \"Paris\",\n",
        "    },\n",
        "    {\n",
        "        \"user_input\": \"Who patented an early practical telephone?\",\n",
        "        \"retrieved_contexts\": [\n",
        "            \"Alexander Graham Bell was a Scottish-born inventor who patented the first practical telephone.\"\n",
        "        ],\n",
        "        \"response\": \"Alexander Graham Bell patented an early practical telephone.\",\n",
        "        \"reference\": \"Alexander Graham Bell\",\n",
        "    },\n",
        "    {\n",
        "        \"user_input\": \"What is photosynthesis?\",\n",
        "        \"retrieved_contexts\": [\n",
        "            \"Photosynthesis is the process by which plants convert light energy into chemical energy.\"\n",
        "        ],\n",
        "        \"response\": \"Photosynthesis is how plants turn sunlight into chemical energy.\",\n",
        "        \"reference\": \"Plants convert light energy into chemical energy during photosynthesis.\",\n",
        "    },\n",
        "]\n",
        "\n",
        "dataset = Dataset.from_list(samples)\n",
        "dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Run evaluation (modern metrics)\n",
        "\n",
        "- **Faithfulness**: whether the answer is supported by the retrieved contexts.\n",
        "- **Answer relevancy**: whether the answer addresses the question.\n",
        "\n",
        "This section uses metrics from `ragas.metrics.collections` with the modern embeddings/LLM interfaces."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from ragas.metrics.collections import AnswerRelevancy, Faithfulness\n",
        "\n",
        "faithfulness_metric = Faithfulness(llm=llm)\n",
        "answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=embeddings)\n",
        "\n",
        "\n",
        "async def score_baseline_rows(ds):\n",
        "    rows = ds.to_list()\n",
        "    scored = []\n",
        "    for row in rows:\n",
        "        faithfulness_result = await faithfulness_metric.ascore(\n",
        "            user_input=row[\"user_input\"],\n",
        "            response=row[\"response\"],\n",
        "            retrieved_contexts=row[\"retrieved_contexts\"],\n",
        "        )\n",
        "        answer_relevancy_result = await answer_relevancy_metric.ascore(\n",
        "            user_input=row[\"user_input\"],\n",
        "            response=row[\"response\"],\n",
        "        )\n",
        "        scored.append(\n",
        "            {\n",
        "                \"user_input\": row[\"user_input\"],\n",
        "                \"faithfulness\": faithfulness_result.value,\n",
        "                \"answer_relevancy\": answer_relevancy_result.value,\n",
        "            }\n",
        "        )\n",
        "    return scored\n",
        "\n",
        "\n",
        "baseline_scores = await score_baseline_rows(dataset)\n",
        "faithfulness_avg = sum(item[\"faithfulness\"] for item in baseline_scores) / len(baseline_scores)\n",
        "answer_relevancy_avg = sum(item[\"answer_relevancy\"] for item in baseline_scores) / len(baseline_scores)\n",
        "\n",
        "print(\"Aggregate means:\")\n",
        "print(f\"faithfulness={faithfulness_avg:.4f}\")\n",
        "print(f\"answer_relevancy={answer_relevancy_avg:.4f}\")\n",
        "print(\"\\nPer-row scores:\")\n",
        "for idx, item in enumerate(baseline_scores, start=1):\n",
        "    print(\n",
        "        f\"{idx}. user_input={item['user_input']} | \"\n",
        "        f\"faithfulness={item['faithfulness']:.4f} | \"\n",
        "        f\"answer_relevancy={item['answer_relevancy']:.4f}\"\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Add retrieval-focused metrics (modern metrics)\n",
        "\n",
        "- **Context precision**: whether retrieved chunks are useful for answering the question.\n",
        "- **Context recall**: whether retrieved contexts cover what the reference (`ground_truth`) states.\n",
        "\n",
        "This pass issues additional LLM calls. If validation errors mention missing columns, adjust the dataset or choose another metric variant per the [Ragas metrics documentation](https://docs.ragas.io/)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from ragas.metrics.collections import ContextPrecision, ContextRecall\n",
        "\n",
        "context_precision_metric = ContextPrecision(llm=llm)\n",
        "context_recall_metric = ContextRecall(llm=llm)\n",
        "\n",
        "\n",
        "async def score_retrieval_rows(ds):\n",
        "    rows = ds.to_list()\n",
        "    scored = []\n",
        "    for row in rows:\n",
        "        context_precision_result = await context_precision_metric.ascore(\n",
        "            user_input=row[\"user_input\"],\n",
        "            reference=row[\"reference\"],\n",
        "            retrieved_contexts=row[\"retrieved_contexts\"],\n",
        "        )\n",
        "        context_recall_result = await context_recall_metric.ascore(\n",
        "            user_input=row[\"user_input\"],\n",
        "            retrieved_contexts=row[\"retrieved_contexts\"],\n",
        "            reference=row[\"reference\"],\n",
        "        )\n",
        "        scored.append(\n",
        "            {\n",
        "                \"user_input\": row[\"user_input\"],\n",
        "                \"context_precision\": context_precision_result.value,\n",
        "                \"context_recall\": context_recall_result.value,\n",
        "            }\n",
        "        )\n",
        "    return scored\n",
        "\n",
        "\n",
        "retrieval_scores = await score_retrieval_rows(dataset)\n",
        "context_precision_avg = sum(item[\"context_precision\"] for item in retrieval_scores) / len(retrieval_scores)\n",
        "context_recall_avg = sum(item[\"context_recall\"] for item in retrieval_scores) / len(retrieval_scores)\n",
        "\n",
        "print(\"Aggregate means:\")\n",
        "print(f\"context_precision={context_precision_avg:.4f}\")\n",
        "print(f\"context_recall={context_recall_avg:.4f}\")\n",
        "print(\"\\nPer-row scores:\")\n",
        "for idx, item in enumerate(retrieval_scores, start=1):\n",
        "    print(\n",
        "        f\"{idx}. user_input={item['user_input']} | \"\n",
        "        f\"context_precision={item['context_precision']:.4f} | \"\n",
        "        f\"context_recall={item['context_recall']:.4f}\"\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Troubleshooting\n",
        "\n",
        "- **Model not found (`Model Not Exist`)**: set `EVAL_LLM_MODEL` (and, when overridden, `EVAL_EMBED_MODEL`) to an available model ID from the endpoint (for example via `/models`).\n",
        "- **Credentials or endpoint setup**: set `EVAL_LLM_API_KEY` / `EVAL_LLM_BASE_URL` (fallback: `OPENAI_API_KEY` / `OPENAI_BASE_URL`). If embeddings use a separate endpoint, also set `EVAL_EMBED_API_KEY` / `EVAL_EMBED_BASE_URL`.\n",
        "- **Notebook async execution**: this notebook uses `await metric.ascore(...)` in cells. If running outside notebook contexts, use `asyncio.run(...)` or metric `.score(...)` in synchronous scripts.\n",
        "- **Version-related warnings**: metric classes and signatures can change across Ragas releases. Pin package versions for reproducible runs and confirm behavior against [docs.ragas.io](https://docs.ragas.io/)."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "pygments_lexer": "ipython3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}
