# LangExtract

{% hint style="success" %}
This setup gives you a local LangExtract API backed by Ollama.

At the end, you will have:

* LangExtract installed in a Python virtual environment
* A FastAPI service running on port `8765`
* A test request that returns structured extractions
* A local endpoint ready for use from PDI
  {% endhint %}

{% hint style="info" %}
These commands assume macOS or Linux.

Use equivalent paths and activation commands on Windows.
{% endhint %}

**Architecture**

`PDI → HTTP Client → LangExtract API → Ollama model → JSON response`

**Prerequisites**

<table><thead><tr><th valign="top">Component</th><th valign="top">Version</th><th valign="top">Notes</th></tr></thead><tbody><tr><td valign="top">Pentaho Data Integration</td><td valign="top">9.x</td><td valign="top">Community or EE. Spoon and Pan available.</td></tr><tr><td valign="top">Python</td><td valign="top">3.10+</td><td valign="top">Requires <code>pip</code> and <code>venv</code>.</td></tr><tr><td valign="top">LangExtract</td><td valign="top">Current GitHub source</td><td valign="top">Installed from <code>google/langextract</code>.</td></tr><tr><td valign="top">FastAPI</td><td valign="top">Current</td><td valign="top">REST wrapper for LangExtract.</td></tr><tr><td valign="top">Uvicorn</td><td valign="top">Current</td><td valign="top">ASGI server for the API.</td></tr><tr><td valign="top">Ollama</td><td valign="top">0.3+</td><td valign="top">Local LLM runtime.</td></tr><tr><td valign="top">Ollama Python package</td><td valign="top">Current</td><td valign="top">Required by the sample service code.</td></tr><tr><td valign="top">Model</td><td valign="top"><code>llama3.1:8b</code></td><td valign="top">Pulled locally with Ollama.</td></tr><tr><td valign="top">PostgreSQL</td><td valign="top">Optional</td><td valign="top">Use only if you plan to persist extracted records.</td></tr></tbody></table>

{% stepper %}
{% step %}
**Create project directory**

**Windows (PowerShell)**

```powershell
mkdir C:\langextract_api
cd C:\langextract_api
```

**Unix**

```bash
sudo mkdir -p /opt/langextract_api
sudo chown $USER /opt/langextract_api
cd /opt/langextract_api
```

{% endstep %}

{% step %}
**Create and activate virtual environment**

**Windows**

```powershell
python -m venv langextract_env
langextract_env\Scripts\activate
```

**Unix**

```bash
python3 -m venv langextract_env
source langextract_env/bin/activate
```

{% endstep %}

{% step %}
**Install dependencies**

Identical on both platforms:

```
pip install fastapi uvicorn langextract ollama
```

{% endstep %}

{% step %}
**Save the script**

Save `server.py` into the project directory:

* Windows: `C:\langextract_api\server.py`
* Unix: `/opt/langextract_api/server.py`

Save this file as `server.py`:

{% file src="/files/aXdCRtlNr5IBrqrUPCy5" %}
{% endstep %}

{% step %}
**Pull the Ollama model**

Identical on both platforms:

```
ollama pull llama3.1:8b
```

{% endstep %}

{% step %}
**Start Ollama**

Identical on both platforms:

```
ollama serve
```

Leave this running in a separate terminal, or configure it as a service (see step 8).
{% endstep %}

{% step %}
**Start the LangExtract service**

**Windows**

```powershell
cd \
cd C:\langextract_api
langextract_env\Scripts\activate
uvicorn server:app --host 0.0.0.0 --port 8765
```

**Unix**

```bash
cd
cd /opt/langextract_api
source langextract_env/bin/activate
uvicorn server:app --host 0.0.0.0 --port 8765
```

**Run as a background service**

**Windows - NSSM**

```powershell
# Download nssm from nssm.cc, then:
nssm install LangExtractAPI "C:\langextract_api\langextract_env\Scripts\uvicorn.exe"
nssm set LangExtractAPI AppParameters "server:app --host 0.0.0.0 --port 8765"
nssm set LangExtractAPI AppDirectory "C:\langextract_api"
nssm start LangExtractAPI
```

**Unix - systemd**

Create `/etc/systemd/system/langextract.service`:

```ini
[Unit]
Description=LangExtract API Service
After=network.target

[Service]
Type=simple
User=YOUR_USER
WorkingDirectory=/opt/langextract_api
ExecStart=/opt/langextract_api/langextract_env/bin/uvicorn server:app --host 0.0.0.0 --port 8765
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
```

Then enable and start:

bash

```bash
sudo systemctl daemon-reload
sudo systemctl enable langextract
sudo systemctl start langextract
sudo systemctl status langextract
```

{% endstep %}

{% step %}
**Open firewall**

**Windows**

```powershell
New-NetFirewallRule -DisplayName "LangExtract API" -Direction Inbound -Protocol TCP -LocalPort 8765 -Action Allow
```

**Unix (ufw)**

```bash
sudo ufw allow 8765/tcp
```

**Unix (firewalld / RHEL)**

```bash
sudo firewall-cmd --permanent --add-port=8765/tcp
sudo firewall-cmd --reload
```

{% endstep %}

{% step %}
**Verify**

Identical on both platforms:

```
curl http://localhost:8765/health
```

Response:

```json
StatusCode        : 200
StatusDescription : OK
Content           : {"status":"ok"}
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 15
                    Content-Type: application/json
                    Date: Tue, 07 Apr 2026 09:35:37 GMT
                    Server: uvicorn

                    {"status":"ok"}
Forms             : {}
Headers           : {[Content-Length, 15], [Content-Type, application/json], [Date, Tue, 07 Apr 2026 09:35:37 GMT],
                    [Server, uvicorn]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 15
```

Swagger UI (browser): `http://localhost:8765/docs`

***

**Verify with a test request**

**Windows**

Run a test extraction:

```powershell
curl.exe -X POST http://localhost:8765/extract `
  -H "Content-Type: application/json" `
  -d '{\"text\": \"Jane Smith cannot log into the VPN. Error code VPN-403. This is urgent.\", \"prompt\": \"Extract the user, system, issue, urgency, and error code.\", \"examples\": [{\"text\": \"Raj Patel cannot access SAP. Error code ERP-991. Critical issue.\", \"extractions\": [{\"extraction_class\": \"user\", \"extraction_text\": \"Raj Patel\"}, {\"extraction_class\": \"system\", \"extraction_text\": \"SAP\"}, {\"extraction_class\": \"error_code\", \"extraction_text\": \"ERP-991\"}, {\"extraction_class\": \"urgency\", \"extraction_text\": \"Critical\"}]}], \"model_id\": \"llama3.1:8b\"}'
```

Response:

```powershell
{"extractions":[{"class":"user","text":"Jane Smith","char_interval":{"start_pos":0,"end_pos":10}},
{"class":"system","text":"VPN","char_interval":{"start_pos":31,"end_pos":34}},
{"class":"issue","text":"cannot log into the VPN","char_interval":{"start_pos":11,"end_pos":34}},
{"class":"urgency","text":"Urgent","char_interval":{"start_pos":64,"end_pos":70}},
{"class":"error_code","text":"VPN-403","char_interval":{"start_pos":47,"end_pos":54}}]}
```

**Unix**

Run a test extraction:

```bash
curl -X POST http://localhost:8765/extract \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Jane Smith cannot log into the VPN. Error code VPN-403. This is urgent.",
    "prompt": "Extract the user, system, issue, urgency, and error code.",
    "examples": [
      {
        "text": "Raj Patel cannot access SAP. Error code ERP-991. Critical issue.",
        "extractions": [
          {"extraction_class": "user", "extraction_text": "Raj Patel"},
          {"extraction_class": "system", "extraction_text": "SAP"},
          {"extraction_class": "error_code", "extraction_text": "ERP-991"},
          {"extraction_class": "urgency", "extraction_text": "Critical"}
        ]
      }
    ],
    "model_id": "llama3.1:8b"
  }'
```

Response:

```json
{
  "extractions": [
    {
      "class": "user",
      "text": "Jane Smith",
      "start": 0,
      "end": 10
    }
  ]
}
```

Success means the API returns an `extractions` array with extracted values and character offsets.

**Call the API from PDI**

Use the **HTTP Client** step:

* **URL:** `http://localhost:8765/extract`
* **Method:** `POST`
* **Content-Type:** `application/json`
* **Request body field:** your JSON payload
* **Response field:** `response_json`

Then parse the response with **JSON Input**:

* **Source is from a field:** `response_json`
* **Path:** `$.extractions[*]`
* **Fields:** `class`, `text`, `start`, `end`

{% hint style="success" %}
You can now chain the response into steps such as **Select Values**, **Row Normaliser**, or **Table Output**.
{% endhint %}
{% endstep %}
{% endstepper %}

***

**PostgreSQL**

```sql
CREATE SCHEMA IF NOT EXISTS staging;

-- Scenario 1
CREATE TABLE staging.ticket_triage (
  ticket_id     VARCHAR(20) PRIMARY KEY,
  issue_type    VARCHAR(100),
  system        VARCHAR(100),
  urgency       VARCHAR(20),
  reported_by   VARCHAR(100),
  error_code    VARCHAR(100),
  extracted_at  TIMESTAMP DEFAULT NOW()
);

-- Scenario 2
CREATE TABLE staging.patient_extractions (
  id             SERIAL PRIMARY KEY,
  patient_id     VARCHAR(20),
  extract_class  VARCHAR(30),
  extract_text   TEXT,
  char_start     INT,
  char_end       INT,
  extracted_at   TIMESTAMP DEFAULT NOW()
);

-- Scenario 3
CREATE TABLE staging.clause_details (
  id              SERIAL PRIMARY KEY,
  contract_id     VARCHAR(50),
  clause_class    VARCHAR(50),
  clause_text     TEXT,
  char_start      INT,
  char_end        INT,
  extracted_at    TIMESTAMP DEFAULT NOW()
);
CREATE TABLE staging.contract_master (
  contract_id       VARCHAR(50) PRIMARY KEY,
  party_a           TEXT,
  party_b           TEXT,
  effective_date    VARCHAR(100),
  termination       TEXT,
  payment_terms     TEXT,
  liability_cap     TEXT,
  governing_law     TEXT,
  validation_status CHAR(1) DEFAULT 'Y',
  extracted_at      TIMESTAMP DEFAULT NOW()
);

```

***

**Troubleshooting**

{% hint style="warning" %}
Common issues:

* `ModuleNotFoundError: ollama`\
  Install the Python package with `pip install ollama`.
* Connection refused on port `8765`\
  Confirm that Uvicorn is running.
* Empty or weak extractions\
  Improve the prompt and few-shot examples.
* Model not found\
  Run `ollama pull llama3.1:8b` again.
  {% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-integration/setup/use-cases/langextract.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
