Our review
This skill indexes reference documents (PDF, Word, etc.) into a RAG corpus for contextual retrieval.
Strengths
- Automatic text extraction from multiple formats
- Seamless integration with RAG for augmented generation
- Semantic search with relevance scores
- Structured document management by categories
Limitations
- Limited to a specific references folder structure (.agentic_sdlc/references)
- Requires pre-configured RAG setup
- PDF extraction quality may vary depending on layout
Use this skill when you need to add or query external reference documents for retrieval-augmented generation.
Do not use it for documents that do not require indexing or when the document volume is very small.
Security analysis
CautionThe skill employs Bash for legitimate document processing tasks like extracting text from PDFs and Word files. It does not involve network access, exfiltration, or system modification. The risk is low, but the use of Bash warrants caution.
- •Uses Bash to execute external commands (pdftotext, Python scripts) for document extraction, which could be misused if provided malicious file paths, but the skill itself does not contain destructive or obfuscated code.
Examples
/ref-add .agentic_sdlc/references/legal/lei-13775-2018.pdf/ref-search "prazo de aceite duplicata"/ref-listname: reference-indexer description: | Indexa documentos de referencia para uso no RAG. Extrai texto de PDFs, processa e adiciona ao corpus. Use quando: adicionar documento, buscar referencia, listar docs. allowed-tools:
- Read
- Write
- Bash
- Glob user-invocable: true
Reference Indexer Skill
Proposito
Esta skill gerencia documentos de referencia externa, indexando-os para uso no RAG.
Comandos
/ref-add {path}
Adiciona documento ao indice de referencias:
/ref-add .agentic_sdlc/references/legal/lei-13775-2018.pdf
Acoes:
- Valida o arquivo
- Extrai texto (se PDF/Word)
- Cria resumo automatico
- Adiciona ao corpus RAG
- Atualiza indice
/ref-search {query}
Busca nos documentos de referencia:
/ref-search "prazo de aceite duplicata"
Retorna:
- Documentos relevantes
- Trechos com contexto
- Score de relevancia
/ref-list
Lista todos os documentos indexados:
/ref-list
Mostra:
- Documentos por categoria
- Status de indexacao
- Data de adicao
/ref-remove {path}
Remove documento do indice:
/ref-remove .agentic_sdlc/references/legal/documento-antigo.pdf
Formatos Suportados
| Formato | Extensao | Metodo de Extracao | |---------|----------|-------------------| | PDF | .pdf | pdftotext / PyPDF2 | | Word | .docx | python-docx | | Markdown | .md | Direto | | Texto | .txt | Direto | | HTML | .html | BeautifulSoup |
Estrutura de Referencias
.agentic_sdlc/references/
├── legal/ # Leis, regulamentos, normas
├── technical/ # RFCs, especificacoes tecnicas
├── business/ # Regras de negocio, manuais
├── internal/ # Documentos internos
└── _index.yml # Indice de documentos
Indice de Documentos
Arquivo _index.yml:
index:
version: 1
updated_at: "2026-01-12T..."
documents:
- id: "ref-001"
path: "legal/lei-13775-2018.pdf"
title: "Lei 13.775/2018 - Duplicatas Eletrônicas"
category: legal
added_at: "2026-01-12T..."
indexed: true
summary: "Lei que regulamenta as duplicatas escriturais..."
keywords:
- duplicata
- escritural
- eletronica
page_count: 5
- id: "ref-002"
path: "technical/icp-brasil.pdf"
title: "Padrões ICP-Brasil"
category: technical
added_at: "2026-01-12T..."
indexed: true
Extracao de Texto
# Usando pdftotext (poppler-utils)
pdftotext -layout input.pdf output.txt
# Usando Python
python3 << 'EOF'
import PyPDF2
with open('input.pdf', 'rb') as f:
reader = PyPDF2.PdfReader(f)
text = ''
for page in reader.pages:
text += page.extract_text() + '\n'
print(text)
EOF
Word (docx)
from docx import Document
doc = Document('input.docx')
text = '\n'.join([p.text for p in doc.paragraphs])
print(text)
Integracao com RAG
Documentos indexados sao adicionados ao corpus RAG:
corpus_entry:
id: "ref-001"
source: "references/legal/lei-13775-2018.pdf"
type: "reference"
category: "legal"
content: "{texto extraido}"
embeddings: [...] # Gerado pelo RAG
metadata:
title: "Lei 13.775/2018"
page: 1
section: "Art. 1"
Workflow de Indexacao
indexing_workflow:
1_validate:
- Verificar formato suportado
- Verificar tamanho (max 50MB)
- Verificar permissoes
2_extract:
- Extrair texto do documento
- Limpar formatacao
- Dividir em chunks
3_analyze:
- Gerar resumo automatico
- Extrair keywords
- Classificar categoria
4_index:
- Adicionar ao corpus RAG
- Gerar embeddings
- Atualizar indice
5_verify:
- Testar busca
- Verificar qualidade
Configuracao
No settings.json:
{
"memory": {
"rag_corpus": ".agentic_sdlc/corpus",
"max_document_size_mb": 50,
"chunk_size": 1000,
"chunk_overlap": 200
}
}
Boas Praticas
- Nomeie arquivos descritivamente:
lei-13775-2018-duplicatas.pdf - Organize por categoria: legal, technical, business
- Mantenha versoes: Nao sobrescreva, versione
- Documente a fonte: Adicione de onde veio
- Resuma docs longos: Crie resumos para PDFs grandes
Troubleshooting
PDF nao extrai texto
Alguns PDFs sao imagens escaneadas. Use OCR:
ocrmypdf input.pdf output.pdf
pdftotext output.pdf -
Documento muito grande
Divida em partes menores ou aumente max_document_size_mb.
Encoding incorreto
Force UTF-8 na extracao:
pdftotext -enc UTF-8 input.pdf output.txt
Prompt Engineering
Data & AI
Prompt engineering best practices and templates to maximize AI outputs.
Data Visualization
Data & AI
Generates data visualizations and charts tailored to your data.
RAG Architecture Setup
Data & AI
Setup guide for RAG (Retrieval-Augmented Generation) architectures.