# LangExtract

{% hint style="success" %}
This setup gives you a local LangExtract API backed by Ollama.

At the end, you will have:

* LangExtract installed in a Python virtual environment
* A FastAPI service running on port `8765`
* A test request that returns structured extractions
* A local endpoint ready for use from PDI
  {% endhint %}

{% hint style="info" %}
These commands assume macOS or Linux.

Use equivalent paths and activation commands on Windows.
{% endhint %}

**Architecture**

`PDI → HTTP Client → LangExtract API → Ollama model → JSON response`

**Prerequisites**

<table><thead><tr><th valign="top">Component</th><th valign="top">Version</th><th valign="top">Notes</th></tr></thead><tbody><tr><td valign="top">Pentaho Data Integration</td><td valign="top">9.x</td><td valign="top">Community or EE. Spoon and Pan available.</td></tr><tr><td valign="top">Python</td><td valign="top">3.10+</td><td valign="top">Requires <code>pip</code> and <code>venv</code>.</td></tr><tr><td valign="top">LangExtract</td><td valign="top">Current GitHub source</td><td valign="top">Installed from <code>google/langextract</code>.</td></tr><tr><td valign="top">FastAPI</td><td valign="top">Current</td><td valign="top">REST wrapper for LangExtract.</td></tr><tr><td valign="top">Uvicorn</td><td valign="top">Current</td><td valign="top">ASGI server for the API.</td></tr><tr><td valign="top">Ollama</td><td valign="top">0.3+</td><td valign="top">Local LLM runtime.</td></tr><tr><td valign="top">Ollama Python package</td><td valign="top">Current</td><td valign="top">Required by the sample service code.</td></tr><tr><td valign="top">Model</td><td valign="top"><code>llama3.1:8b</code></td><td valign="top">Pulled locally with Ollama.</td></tr><tr><td valign="top">PostgreSQL</td><td valign="top">Optional</td><td valign="top">Use only if you plan to persist extracted records.</td></tr></tbody></table>

{% stepper %}
{% step %}
**Create project directory**

**Windows (PowerShell)**

```powershell
mkdir C:\langextract_api
cd C:\langextract_api
```

**Unix**

```bash
sudo mkdir -p /opt/langextract_api
sudo chown $USER /opt/langextract_api
cd /opt/langextract_api
```

{% endstep %}

{% step %}
**Create and activate virtual environment**

**Windows**

```powershell
python -m venv langextract_env
langextract_env\Scripts\activate
```

**Unix**

```bash
python3 -m venv langextract_env
source langextract_env/bin/activate
```

{% endstep %}

{% step %}
**Install dependencies**

Identical on both platforms:

```
pip install fastapi uvicorn langextract ollama
```

{% endstep %}

{% step %}
**Save the script**

Save `server.py` into the project directory:

* Windows: `C:\langextract_api\server.py`
* Unix: `/opt/langextract_api/server.py`

Save this file as `server.py`:

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FUHva1pIKCVUb6ea51aAu%2Fserver.py?alt=media&token=ddbacaa4-a9c6-4602-99ec-fb76e114ee27>" %}
{% endstep %}

{% step %}
**Pull the Ollama model**

Identical on both platforms:

```
ollama pull llama3.1:8b
```

{% endstep %}

{% step %}
**Start Ollama**

Identical on both platforms:

```
ollama serve
```

Leave this running in a separate terminal, or configure it as a service (see step 8).
{% endstep %}

{% step %}
**Start the LangExtract service**

**Windows**

```powershell
cd \
cd C:\langextract_api
langextract_env\Scripts\activate
uvicorn server:app --host 0.0.0.0 --port 8765
```

**Unix**

```bash
cd
cd /opt/langextract_api
source langextract_env/bin/activate
uvicorn server:app --host 0.0.0.0 --port 8765
```

**Run as a background service**

**Windows - NSSM**

```powershell
# Download nssm from nssm.cc, then:
nssm install LangExtractAPI "C:\langextract_api\langextract_env\Scripts\uvicorn.exe"
nssm set LangExtractAPI AppParameters "server:app --host 0.0.0.0 --port 8765"
nssm set LangExtractAPI AppDirectory "C:\langextract_api"
nssm start LangExtractAPI
```

**Unix - systemd**

Create `/etc/systemd/system/langextract.service`:

```ini
[Unit]
Description=LangExtract API Service
After=network.target

[Service]
Type=simple
User=YOUR_USER
WorkingDirectory=/opt/langextract_api
ExecStart=/opt/langextract_api/langextract_env/bin/uvicorn server:app --host 0.0.0.0 --port 8765
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
```

Then enable and start:

bash

```bash
sudo systemctl daemon-reload
sudo systemctl enable langextract
sudo systemctl start langextract
sudo systemctl status langextract
```

{% endstep %}

{% step %}
**Open firewall**

**Windows**

```powershell
New-NetFirewallRule -DisplayName "LangExtract API" -Direction Inbound -Protocol TCP -LocalPort 8765 -Action Allow
```

**Unix (ufw)**

```bash
sudo ufw allow 8765/tcp
```

**Unix (firewalld / RHEL)**

```bash
sudo firewall-cmd --permanent --add-port=8765/tcp
sudo firewall-cmd --reload
```

{% endstep %}

{% step %}
**Verify**

Identical on both platforms:

```
curl http://localhost:8765/health
```

Response:

```json
StatusCode        : 200
StatusDescription : OK
Content           : {"status":"ok"}
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 15
                    Content-Type: application/json
                    Date: Tue, 07 Apr 2026 09:35:37 GMT
                    Server: uvicorn

                    {"status":"ok"}
Forms             : {}
Headers           : {[Content-Length, 15], [Content-Type, application/json], [Date, Tue, 07 Apr 2026 09:35:37 GMT],
                    [Server, uvicorn]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 15
```

Swagger UI (browser): `http://localhost:8765/docs`

***

**Verify with a test request**

**Windows**

Run a test extraction:

```powershell
curl.exe -X POST http://localhost:8765/extract `
  -H "Content-Type: application/json" `
  -d '{\"text\": \"Jane Smith cannot log into the VPN. Error code VPN-403. This is urgent.\", \"prompt\": \"Extract the user, system, issue, urgency, and error code.\", \"examples\": [{\"text\": \"Raj Patel cannot access SAP. Error code ERP-991. Critical issue.\", \"extractions\": [{\"extraction_class\": \"user\", \"extraction_text\": \"Raj Patel\"}, {\"extraction_class\": \"system\", \"extraction_text\": \"SAP\"}, {\"extraction_class\": \"error_code\", \"extraction_text\": \"ERP-991\"}, {\"extraction_class\": \"urgency\", \"extraction_text\": \"Critical\"}]}], \"model_id\": \"llama3.1:8b\"}'
```

Response:

```powershell
{"extractions":[{"class":"user","text":"Jane Smith","char_interval":{"start_pos":0,"end_pos":10}},
{"class":"system","text":"VPN","char_interval":{"start_pos":31,"end_pos":34}},
{"class":"issue","text":"cannot log into the VPN","char_interval":{"start_pos":11,"end_pos":34}},
{"class":"urgency","text":"Urgent","char_interval":{"start_pos":64,"end_pos":70}},
{"class":"error_code","text":"VPN-403","char_interval":{"start_pos":47,"end_pos":54}}]}
```

**Unix**

Run a test extraction:

```bash
curl -X POST http://localhost:8765/extract \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Jane Smith cannot log into the VPN. Error code VPN-403. This is urgent.",
    "prompt": "Extract the user, system, issue, urgency, and error code.",
    "examples": [
      {
        "text": "Raj Patel cannot access SAP. Error code ERP-991. Critical issue.",
        "extractions": [
          {"extraction_class": "user", "extraction_text": "Raj Patel"},
          {"extraction_class": "system", "extraction_text": "SAP"},
          {"extraction_class": "error_code", "extraction_text": "ERP-991"},
          {"extraction_class": "urgency", "extraction_text": "Critical"}
        ]
      }
    ],
    "model_id": "llama3.1:8b"
  }'
```

Response:

```json
{
  "extractions": [
    {
      "class": "user",
      "text": "Jane Smith",
      "start": 0,
      "end": 10
    }
  ]
}
```

Success means the API returns an `extractions` array with extracted values and character offsets.

**Call the API from PDI**

Use the **HTTP Client** step:

* **URL:** `http://localhost:8765/extract`
* **Method:** `POST`
* **Content-Type:** `application/json`
* **Request body field:** your JSON payload
* **Response field:** `response_json`

Then parse the response with **JSON Input**:

* **Source is from a field:** `response_json`
* **Path:** `$.extractions[*]`
* **Fields:** `class`, `text`, `start`, `end`

{% hint style="success" %}
You can now chain the response into steps such as **Select Values**, **Row Normaliser**, or **Table Output**.
{% endhint %}
{% endstep %}
{% endstepper %}

***

**PostgreSQL**

```sql
CREATE SCHEMA IF NOT EXISTS staging;

-- Scenario 1
CREATE TABLE staging.ticket_triage (
  ticket_id     VARCHAR(20) PRIMARY KEY,
  issue_type    VARCHAR(100),
  system        VARCHAR(100),
  urgency       VARCHAR(20),
  reported_by   VARCHAR(100),
  error_code    VARCHAR(100),
  extracted_at  TIMESTAMP DEFAULT NOW()
);

-- Scenario 2
CREATE TABLE staging.patient_extractions (
  id             SERIAL PRIMARY KEY,
  patient_id     VARCHAR(20),
  extract_class  VARCHAR(30),
  extract_text   TEXT,
  char_start     INT,
  char_end       INT,
  extracted_at   TIMESTAMP DEFAULT NOW()
);

-- Scenario 3
CREATE TABLE staging.clause_details (
  id              SERIAL PRIMARY KEY,
  contract_id     VARCHAR(50),
  clause_class    VARCHAR(50),
  clause_text     TEXT,
  char_start      INT,
  char_end        INT,
  extracted_at    TIMESTAMP DEFAULT NOW()
);
CREATE TABLE staging.contract_master (
  contract_id       VARCHAR(50) PRIMARY KEY,
  party_a           TEXT,
  party_b           TEXT,
  effective_date    VARCHAR(100),
  termination       TEXT,
  payment_terms     TEXT,
  liability_cap     TEXT,
  governing_law     TEXT,
  validation_status CHAR(1) DEFAULT 'Y',
  extracted_at      TIMESTAMP DEFAULT NOW()
);

```

***

**Troubleshooting**

{% hint style="warning" %}
Common issues:

* `ModuleNotFoundError: ollama`\
  Install the Python package with `pip install ollama`.
* Connection refused on port `8765`\
  Confirm that Uvicorn is running.
* Empty or weak extractions\
  Improve the prompt and few-shot examples.
* Model not found\
  Run `ollama pull llama3.1:8b` again.
  {% endhint %}
