At AltaML, I spent several months immersed in impactful machine learning projects, tackling real-world challenges at the intersection of data science, software engineering, and public infrastructure. Below, I reflect on some key challenges and insights from my experience.
Document Classification: Challenges and Insights
My first major project involved information classification for a large enterprise client with strict regulatory requirements around document retention. The core challenge was classifying documents with incomplete or partial textual information, often with an extremely limited dataset for training and validation.
Why LLMs Struggle with Classification
Initially, we tried several approaches:
-
Classical ML (TF-IDF & doc2vec): Surprisingly, TF-IDF outperformed doc2vec by roughly 10% accuracy. Despite doc2vec embedding richer semantic information, our limited dataset made semantic models less effective. This highlighted that:
- Small datasets severely limit the effectiveness of semantic vectorization.
- Simpler methods like TF-IDF can excel in limited-data contexts by effectively capturing implicit document characteristics.
-
DistilBERT: This model performed only marginally better than classical methods. Though powerful in general applications, DistilBERT struggled here due to limited training data and insufficient domain-specific semantic context.
-
OpenAI’s GPT-4o: Prompt-based classification offered mixed results. GPT-4o handled explicit textual content well but failed when classification hinged on implicit contextual information—such as the document’s origin rather than its explicit contents.
Interestingly, classical models like TF-IDF implicitly captured hidden contextual signals about document origins, outperforming more sophisticated language models. This was an unexpected but valuable finding: TF-IDF vectorization and classical models can learn contextual information that LLMs like GPT-4o cannot.
Key Takeaways:
- Simple methods may outperform complex ones with limited data.
- Implicit contextual cues can drive classification accuracy in surprising ways.
Capital Planning and Geographic Privacy
My second project revolved around capital planning for facility infrastructure. The primary business goal was to enable stakeholders to quickly run “what-if” scenarios to right-size a facility and plan its long-term usage (typically targeting an 85% utilization rate). This required a tool that not only modeled usage but also supported collaboration among diverse stakeholders during planning.
Data Privacy and Geographic Granularity
Geographic accuracy had to be balanced against user privacy. Initially, we used postal codes as proxies for location. However:
- Urban postal codes were small enough to provide high accuracy.
- Rural postal codes covered much larger areas, distorting distance metrics by relying on centroids.
To protect privacy—especially the risk of identifying individuals in low-density areas—we employed Uber’s H3 hexagonal spatial indexing. Hex tiling allowed dynamic scaling:
- Smaller hexes in dense areas for precision.
- Larger hexes in sparse areas for privacy.
This flexible approach ensured both accurate geographic representation and robust privacy protection, while giving decision-makers the spatial insights they needed for strategic facility planning.
ML in Production: Testing and Velocity Trade-offs
Building machine learning solutions for production requires balancing rapid iteration against long-term reliability and maintainability.
Our Architecture
We leveraged Azure Machine Learning extensively, with:
- Data ingestion and preprocessing stored in Azure SQL databases and Blob Storage.
- Model training automated via Azure pipelines.
- Registered models served through backend APIs, triggered asynchronously from a frontend interface.
This architecture resembled a microservices system: numerous small, interdependent components with strict interface contracts.
Challenges of Integration Testing
To move fast, we initially skipped deep integration testing—only to run into:
- Hidden regressions: For example, using radii instead of polygons during testing masked serious downstream issues.
- Interdependent failures: Changes in one pipeline component broke others due to undocumented dependencies.
These caused late-cycle firefighting and reduced productivity.
Unit Testing and Velocity
The same trade-off appeared with unit testing:
- Skipping them boosted short-term individual productivity.
- But regressions and integration issues slowed overall team velocity later.
Ultimately, I learned that:
- Unit tests reduce short-term speed but significantly boost long-term team velocity by preventing regressions and reducing integration friction.
- For long-term projects, testing is a non-negotiable investment; in short-term prototypes, the cost-benefit may differ.
Reflective Insights
My time at AltaML taught me valuable lessons:
- Simplicity in ML: In limited-data scenarios, simpler methods can outperform advanced models.
- Geographic data privacy: Flexible indexing methods like H3 are essential for balancing accuracy and privacy.
- Testing discipline: Integration and unit testing are key to long-term speed and stability.