Higher information beats intelligent algorithms, however extra information beats higher information.
— Peter Norvig
I built a thing. It was enjoyable, and I feel it introduced (or hopefully will deliver) worth. However it got here at an expense which I’ve develop into all too aware of in my trade. It shouldn’t be (and doesn’t need to be) the norm that information is tough to entry. I seek advice from this as Knowledge Jail. It’s straightforward to get information in, arduous to get it again out. And lots of instances, the proverbial bars of the information jail are clear. You don’t realize it’s arduous to get to till you have to.
Defining ‘Knowledge Jail’
Let me begin by ensuring we’re all clear on what I imply by Knowledge Jail. Primarily, Knowledge Jail describes a situation the place information, regardless of being technically out there, is confined inside codecs that hinder easy accessibility, evaluation, and efficient utilization. Widespread culprits embody PDFs and different doc codecs that aren’t designed for seamless information extraction and manipulation.
Some Context into the Downside I’m Fixing
Seattle Public Colleges (SPS) announced near the end of the 2023/2024 college 12 months that they have been unable to beat a finances shortfall in extra of $100M/12 months and rising via time. Quickly after, a program and evaluation have been initiated which aimed to establish and shut as much as 20 of the almost 70 elementary faculties in Seattle.
I’m a dad or mum of one of many youngsters in considered one of these elementary faculties. And like most of the different mother and father who have been thrust upon this program with out a lot warning, I felt annoyed on the lack of open & available information, though the district pointed in direction of numerous PDFs out there via their webpage.
Certain, somebody may go and replica/paste the information from every of the PDFs, however that’s going to take an amazing period of time.
Certain, somebody may go take a look at prior analyses that are made out there (once more, via PDF), however these analyses would possibly solely be tangentially related.
Certain, somebody may request the information by way of CSVs, however these requests are solely supported by 2 part-time individuals, and the lead time for getting the information is measured in months, not days.
And so I spent a while making an attempt to amass the information that I imagine anybody would want to return to an inexpensive conclusion on which faculties (if any) to shut. Apparent info like Price range, Enrollment and Services information — for the previous 3 years by college.
Fortunately I didn’t have to repeat & paste the information manually. As a substitute, I used Python to scrape the PDFs in an effort to get a dataset which anybody may use to carry out a strong evaluation. It nonetheless took ceaselessly.
What’s Doable when the Knowledge is Unlocked
Quick ahead a few weeks from once I began pulling the information, and you’ll see the ultimate product. The app I built is hosted on Streamlit, which is a super-slick platform that gives all the scaffolding and help to rapidly allow exploration of your information, or present a UI on high of your code. You get to spend time on fixing the issue as a substitute of getting to fiddle with buttons, HTML and the like.
My exploration started as an examination of the budgets and enrollment themselves, however then rapidly morphed right into a approach to perceive considered one of many impacts from closing faculties — particularly, how do the scholars develop into redistributed based mostly on present relationships between enrollment boundaries and variety of college students who attend inside or outdoors of them.
So, that turned the first use-case for what I created:
As a member of the neighborhood, how will a particular situation of faculty closures affect different surrounding faculties from a capability perspective?
All the information might be rapidly downloaded via the tables beneath the maps, and customers can rapidly play with and observe their very own situations. Like, “What in the event that they closed my college?”
An unintended A-Ha!
I did make an fascinating remark whereas analyzing this information. This remark was executed simply after finishing a reasonably easy linear regression. The y-intercept for the regression was round $760k, which represented an estimated baseline price for having a college open. In easier phrases, by closing a college, and redistributing employees and finances {dollars}, the district will most likely see on common $760k in price financial savings per college. Subsequently closing as much as 20 faculties, sustaining staffing ranges, and redistributing college students, would save a bit over $15M. There’s an enormous hole between that and the $100M deficit that the closures intend to handle. This most likely warrants some further evaluation — if solely I had entry to raised (or much more) information…
Breaking Out is a Selection
As I went via this train, it turned more and more clear that FOIA and Public Data legal guidelines give a possibility (possibly unintended) to assist break via information jails when scrappy scraping abilities can’t be leveraged.
Others have doubtless requested this information previously, sought and acquired crucial approvals, and been supplied that information. And though that information shared with the requestor is taken into account public, it’s not made accessible to anybody else in a straightforward approach. Herein lies the issue. Why can’t I simply take a look at and use the information that others have already requested for and been given?
Wrapping Up
So — I built a thing. I scraped information from PDFs using a thing. However, I even have a request within the queue with Seattle Public Colleges and Seattle.gov to get entry to any info supplied by way of public requests and FOIAs for public college information over the past 2 years. These responses and the requests are additionally themselves public information.
However, for individuals who don’t have the talents to jot down code to scrape this themselves, this information stays simply past arm’s size, locked behind the bars of PDFs, webpages, & pictures. It doesn’t need to be that approach. And, it shouldn’t be that approach.
There are definitely discussions available on formalizing on a constant format for information within the first place. Issues like normal desk codecs akin to Delta Lake appear to be a really scalable and cheap resolution throughout the board (Thanks Robert Dale Thompson), however even making the information from previous FOIAs and public information requests accessible on present websites akin to data.seattle.gov appear to be desk stakes.
Let’s work collectively to unlock the potential of public information. Take a look at my Streamlit app to see how accessible information could make an actual distinction. Be part of me in advocating for open information by reaching out to your native representatives and backing initiatives that promote transparency. Share your individual experiences and data along with your neighborhood to unfold consciousness and drive change. Collectively, we will break down these information jails and guarantee info is actually accessible to everybody.
Extra to return, as soon as I get entry to the information.