Overview:

The article compares three approaches for storing large files in web applications: storing files in a file system with the path in a database, storing files inside a database, and storing files in an object storage service. The advantages and disadvantages of each approach in terms of security, scalability, maintainability, performance, and cost are explored. The article recommends choosing the best approach based on specific application requirements and considering an API layer for security and scalability, a cloud-based storage solution for better performance and flexibility, and planning for migration between storage solutions.

How to store large files for use in a web application?

It’s a common issue with web application architecture, you have a requirement to store and have access to large files, what’s the best way to do this?

There’s three obvious routes to go down

  • Store the file in a filesystem and store a file path in a database
  • Store the file inside the database (as varchar, nvarchar, varbinary, text, blob, etc).
  • Store files in an object storage service And the truth of the matter is there is no set correct answer, like all software architecture, it depends on the requirements now and in the future. Saying that let’s explore the approaches.

Storing a large file in a file system with the path in a database

In this scenario, you choose to store the file in a file system whilst keeping the path stored in a database.

Security Concerns

When this method is suggested I have some initial concerns, the first being security. Depending on the developer’s experience this method often means storing everything in a static and public folder.

I struggle to think of a scenario where I would be okay with public access to any storage I have essentially I draw a parallel between this and just having a public get on a database table.

So if its anything that has even slight importance or privacy concerns you will need to add a permissions/access rights layer to the implementation. Otherwise, people will browse to the folder and have access to all the documents.

This can be resolved however with a simple API layer, where a request is made to your API (you can apply whatever restrictions you’d like) and then stream the document back. Making sure that only your API has access to the folder.

Scalability

In the early days of an application storing files on the webserver is okay, but how does it scale?

Webservers normally have a finite hard drive space, in a case where you have a few users storing lots of large files, you will find that the webserver load is very low, but you’re out of hard drive space, forcing a premature upgrade adding to maintenance costs.

Depending on the architecture of the web application you may be forced to scale vertically which has its own downsides vs horizontally scaling too.

however, an API layer could help with this too, as the request can query the database for the path AND what server the file is stored on, then stream it back.

Maintainability

Someone starts moving the files around or tries to restructure a folder system? Now you have a database full of incorrect paths. Potentially you won’t know until you start getting reports of inability to access the files.

Conclusion of storing files on a webserver with a path stored in the database?

  • If you’re storing sensitive data, think about access to the folder (do not have a public static folder)
  • Think about your use case, If you have lots of users or lots of large files, work out how many you can service per server.
  • Is anyone going to want to manually move a file? think about the process.

For an application that requires large document storage and has a concept of multiple users,

  • I would make calculations on how many users I would be able to service, to make sure it’s viable
  • I would use and API layer to provide security and potential scalability should it be needed.
  • I would have an action plan for scaling, which would likely include a plan for migration to a cloud-based object storage services.

SQL server?

Database administration complexity

An adage that I keep in mind when thinking about databases is “The level of knowledge required to maintain a database will increase with the size of the database”.

When a database increases in size, more and more considerations are required in all aspects of its running and maintenance.

On a small database, backups are easy, simple, and straight forward. On a large database, you may need to start creating transactional backups, or putting files on a separate filegroup (if your database supports that)

the queries you write to pull from the database will need to take in more and more performance consideration. a badly written query may suffice when you have a small database, increase the size and the query will not return in a suitable time at all.

Serving files

Another level of complexity is added as you need to retrieve the data from the database and stream it down, although this applies to the previously mentioned method of having file paths stored and an API layer, you don’t have to technically do it, but with files stored in the database, you do.

Once you start, you cant stop (as easily)

If you were to move to a cloud storage solution like amazon s3, you will find this harder and more costly in time and resources to do so if previously you have stored your files inside the database.

With storing you files in a filesystem and keeping paths in the database, the change needed to use s3 is to change those paths to s3 paths. and hopefully, everything downstream will continue to function (depending on your previous design decisions…)

With files stored in the database, it would not be as easy, especially with changes to what could be core file retrieval functionality.

Conclusion on storing files in the database

Are your files going to be less than 2GB and stored in MS SQL Server? (for absolute certain?), if so research more into the FILESTREAM type. It could help you make a decision?

For any large file, I believe I would be even less likely to implement this than the previously mentioned file system approach. for the simple reason that my thought process would be that if I couldn’t start in a cloud service such as amazon s3, then I would move to it as soon as it was viable to do so, and the migration path between filesystem and cloud object storage is easier than that of database and a cloud object storage

Cloud Object Storage

Let’s use what I’m familiar with, Amazon cloud storage.

Durability, Availability, and scalability if that’s what you need to meet requirements, cloud storage is probably the answer.

With the added benefits of having security and compliance that comes with huge players such as amazon, google, or azure.

This is basically the best of both worlds, you can keep the simplicity of having paths in the database, reducing the size and complexity of your database maintenance, whilst not having to deal with the scalability and availability of keeping your files on your own web servers.