Contributing to Structured Document Parser
Thank you for your interest in contributing to the Structured Document Parser! This document provides guidelines and instructions for contributors.
Development Setup
- Fork the repository and clone it locally.
- Install dependencies:
bash
pip install -r requirements.txt
- Install development dependencies:
bash
pip install pytest black flake8 mypy
Project Structure
structured_parser/
├── src/
│ ├── structured_extraction.py # Main entry point
│ ├── utils.py # Utility functions
│ ├── typedicts.py # Type definitions
│ ├── json_to_sql.py # Database integration
│ └── config.yaml # Configuration
├── tests/ # Test cases
├── README.md # Project overview
└── requirements.txt # Dependencies
Development Workflow
Code Style
We follow the PEP 8 style guide for Python code. Please use black
for formatting:
black src/
Type Checking
We use type hints and mypy
for type checking:
mypy src/
Testing
Please add tests for new features and ensure all tests pass:
pytest tests/
Areas for Contribution
Here are some areas where contributions are especially welcome:
1. Artifact Extraction Improvements
- Adding support for new artifact types
- Improving extraction accuracy for existing types
- Optimizing prompts for better results
2. Performance Optimization
- Improving inference speed
- Reducing memory usage
- Implementing efficient batching strategies
3. New Features
- Supporting additional document types (beyond PDF)
- Adding new output formats
- Implementing document comparison functionality
- Enhancing vector search capabilities
4. Documentation and Examples
- Improving documentation
- Adding usage examples
- Creating tutorials or guides
Submitting Changes
- Create a new branch for your changes
- Make your changes and commit with clear commit messages
- Push your branch and submit a pull request
- Ensure CI tests pass
Pull Request Guidelines
- Provide a clear description of the problem and solution
- Include any relevant issue numbers
- Add tests for new functionality
- Update documentation as needed
- Keep pull requests focused on a single topic
Prompt Engineering Guidelines
When modifying prompts in config.yaml
, consider:
- Clarity: Provide clear and specific instructions
- Examples: Include examples where helpful
- Structure: Use structured formatting to guide the model
- Schema alignment: Ensure prompts align with output schemas
- Testing: Test prompts with diverse document types
Output Schema Guidelines
When defining output schemas:
- Keep properties focused and well-defined
- Use descriptive field names
- Include descriptions for complex fields
- Consider required vs. optional fields carefully
- Test schemas with different document layouts
License
By contributing, you agree that your contributions will be licensed under the project's license.
Questions or Issues?
If you have questions or encounter issues, please:
- Check existing issues to see if it's been addressed
- Open a new issue with a clear description and steps to reproduce
- Tag relevant project maintainers
Thank you for your contributions!