CONTRIBUTING.md 3.4 KB

Contributing to Structured Document Parser

Thank you for your interest in contributing to the Structured Document Parser! This document provides guidelines and instructions for contributors.

Development Setup

  1. Fork the repository and clone it locally.
  2. Install dependencies: bash pip install -r requirements.txt
  3. Install development dependencies: bash pip install pytest black flake8 mypy

Project Structure

structured_parser/
├── src/
│   ├── structured_extraction.py  # Main entry point
│   ├── utils.py                  # Utility functions
│   ├── typedicts.py              # Type definitions
│   ├── json_to_sql.py            # Database integration
│   └── config.yaml               # Configuration
├── tests/                        # Test cases
├── README.md                     # Project overview
└── requirements.txt              # Dependencies

Development Workflow

Code Style

We follow the PEP 8 style guide for Python code. Please use black for formatting:

black src/

Type Checking

We use type hints and mypy for type checking:

mypy src/

Testing

Please add tests for new features and ensure all tests pass:

pytest tests/

Areas for Contribution

Here are some areas where contributions are especially welcome:

1. Artifact Extraction Improvements

  • Adding support for new artifact types
  • Improving extraction accuracy for existing types
  • Optimizing prompts for better results

2. Performance Optimization

  • Improving inference speed
  • Reducing memory usage
  • Implementing efficient batching strategies

3. New Features

  • Supporting additional document types (beyond PDF)
  • Adding new output formats
  • Implementing document comparison functionality
  • Enhancing vector search capabilities

4. Documentation and Examples

  • Improving documentation
  • Adding usage examples
  • Creating tutorials or guides

Submitting Changes

  1. Create a new branch for your changes
  2. Make your changes and commit with clear commit messages
  3. Push your branch and submit a pull request
  4. Ensure CI tests pass

Pull Request Guidelines

  • Provide a clear description of the problem and solution
  • Include any relevant issue numbers
  • Add tests for new functionality
  • Update documentation as needed
  • Keep pull requests focused on a single topic

Prompt Engineering Guidelines

When modifying prompts in config.yaml, consider:

  1. Clarity: Provide clear and specific instructions
  2. Examples: Include examples where helpful
  3. Structure: Use structured formatting to guide the model
  4. Schema alignment: Ensure prompts align with output schemas
  5. Testing: Test prompts with diverse document types

Output Schema Guidelines

When defining output schemas:

  1. Keep properties focused and well-defined
  2. Use descriptive field names
  3. Include descriptions for complex fields
  4. Consider required vs. optional fields carefully
  5. Test schemas with different document layouts

License

By contributing, you agree that your contributions will be licensed under the project's license.

Questions or Issues?

If you have questions or encounter issues, please:

  1. Check existing issues to see if it's been addressed
  2. Open a new issue with a clear description and steps to reproduce
  3. Tag relevant project maintainers

Thank you for your contributions!