When engaged on information science tasks, one basic pipeline to arrange is the one concerning information assortment. Actual-world Machine Studying primarily differs from Kaggle-like issues as a result of information just isn’t static. We have to scrape web sites, collect information from APIs, and so forth. This fashion of accumulating information would possibly look chaotic, and it’s! That’s why we have to construction our code following greatest practices to convey some type of order to all this mess.
When you recognized the sources from which you need to collect your information, you’ll want to accumulate them in a structured technique to retailer these in your database. For instance, you would possibly determine that so as to practice your LLM what you want are information sources which comprise 3 fields: writer, content material, and hyperlink.
What you can do is to obtain the info, after which write SQL queries to retailer and retrieve information out of your database. Extra generally you would possibly need to implement all of the queries to carry out CRUD operations. CRUD stands for create, learn, replace, and delete. These are the 4 primary features of persistent storage.