Computer Information for Everyone: database

Showing posts with label database. Show all posts

Friday, January 10, 2025

Dates on Websites

Earlier this week I had a neighbor call me asking about getting his amateur or ham radio license. I explained the 3 levels of licenses and privileges associated with each. I then pointed him at some study resources as getting a license requires passing a test. Next he wanted to know where he could go to take a test. I did a quick Internet search only to find significantly outdated information. When I got my license over a decade ago, you could reliably count on a test being administered on the last Tuesday of each month at a local emergency operations center or EOC. COVID must have put an end to that and the latest information I could find seemed woefully out of date.

This underscored the importance of making sure dates on a website being updated regularly. You may have a website that advertises an event in the seemingly distant future. You send people to the site to provide information about the event as that saves you from having to say the same thing over and over. The event comes and goes but what happens to the website? If it is still up after the event, people get confused. It might be easiest to update the page letting everyone know about the success of the event. Then the site can continue to exist in digital perpetuity.

My preference for dates on websites is to store events in a database and populate the website from it. Then if there are no longer any future events, you have a message that says something along the lines of, "No events are currently planned." It is also fairly easy to add new events to the database and you don't have to worry about deleting old ones as the database logic should not show past events on an upcoming events page.

I continued searching for my neighbor and eventually found an upcoming exam session in March. That seems to be a long time away but it will be March in no time at all. That also gives him plenty of time to study.

Thursday, October 10, 2024

Playing with Large Language Models (LLMs)

Yesterday I participated in a hands-on lab to integrate large-language-model (LLM) technology to help generate complex SQL queries to look into our data. The lab only lasted an hour and I learned a lot during that time. It also highlighted one of the use cases for LLMs for more than just a toy to help write high-school papers.

When ChatGPT released, it garnered a lot of attention. While everyone seemed impressed, I didn't see a lot of people asking how the technology can be used for good. Automatic code generations seem to be one of those good uses that can really help us in our daily lives. Rather than spending countless hours writing thousands of lines of code to do something important, you can specify what you want and have created and the LLM will generate the code for you. That is what yesterday's lab explored.

For those that don't know SQL, it is a very simple and almost English-like language used to query databases. If you have a table that contains information about customers you can use something like this to query the database:

SELECT firsname, lastname, first_purchase_date FROM customers;

This will list out all customers' names and date when they made their first purchases. It is pretty simple, right? Well it gets complicated once you try to filter that data or include info from other tables in the database. While most of my queries are 25 lines or less, it is not uncommon for me to write 500-line ones. This is where an LLM can really help streamline my work.

In order to keep the lab under an hour, all the data and most of the code was created so I just needed to put it in the right place. I loaded the data into an online database server and downloaded the code to my laptop. The only modification I needed to do to the code was update the connection string to point to my online database. Then we ran through the code see how it worked.

The code included a file to describe the 3 tables in the lab and that turned out to be the secret sauce. I closely looked through the description file and realized that it would take about an hour per table to duplicate this file for the database I work with on a daily basis. When there are only 3 tables, this isn't a large effort. Unfortunately my database has thousands of tables and it would be a huge undertaking to use this technology in my environment.

I had hoped that the LLM would be able to look at my database tables and infer what information they contained. This is possible but unfortunately is not nearly as accurate as having a data expert provide this detail first.

I'm glad I attended the lab and I learned a lot. It also demystified what is going on with this particular LLM. While I hoped I could use the output of the lab to make it easier for non-data scientists to query the database, I did come away with a strong understanding of the underlying technology.

Friday, December 15, 2023

Substitutions

My oldest daughter is working in an electronic orders fulfillment center until she can find a job in her field. She does not enjoy it but it does help her pay her bills. Our evening conversation today revolved around substitutions. Someone will order one product that she is unable to find in the warehouse and so the computer suggests an alternative. Sometimes those substitutions don't make sense and my daughter will note it but most of the time they work.

I thought about how I use substitutions in my daily job working with data. Sometimes I work with a small amount of data in a spreadsheet. Other times I work with large amounts of data in a data warehouse. Can I do the same operations in both tools? After doing this for many years, I can answer that question with a truthful, "Yes."

A few weeks ago I received a spreadsheet with 190,000 or so rows. I immediately needed to do some aggregate functions. That means I needed to count how many rows met certain criteria for text columns or come up with a sum or average for number columns. I had the choice of using either a database or a spreadsheet. To me it didn't make a difference which tool I used. Ultimately I inserted the rows into a database and performed my analysis there.

Working with a computer there are always a choice of tools you can use. While you may have grown used to Microsoft's Excel, you might not have that option with a new personal computer. If you have a Mac, the spreadsheet program Numbers is included in the operating system for free. You also have the choice of using Google spreadsheets. They all have their pros and cons but you should never feel locked into using a specific tool.

This is true for more complex software as well. My youngest son is a mechanical engineer and has access to a very expensive computer aided design (CAD) program called SolidWorks. It can do a lot. There are also a number of less capable products that can tackle any personal projects he decides to start without the exorbitant cost. When I needed to create very specific candle holders for one of my wife's Christmas decorations, I used TinkerCAD. I learned it easy and quickly had the design I needed for my 3D printer. There are a number of even more feature-rich options should you need them. The trick is knowing how to find them.

With the ubiquity of open source software and vast array of commercial software choices, you should never feel there is only one way to get something done on your computer. There are always multiple. The trick is learning how to use Internet search engines and sift through the bad suggestions to find good ones. Perhaps that is a post for another day.

Sunday, November 19, 2023

Coding in Python

I have a new project at work that has me interested in doing some Python programing. I always say that Python is one of those languages where even the largest programs are only 50 lines long. That is a bit of an exaggeration but most of the code I have seen fall into that categorization. The reason programs can be so small is because there is an ever-growing list of libraries to help. If you are trying to read in a comma separated values (CSV) list and process it, somebody wrote a library for it. You don't need to reinvent the wheel. Instead you can use the CSV library and focus on what you need to do.

The project at work is kind of simple. We just need to read in a CSV file and apply a hashing algorithm to one of the columns in the file. Then we will write the resulting modified file into a database. As with the CSV library, there is a library to do the hashing for us. There is also a library to write to the database. Furthermore we have source code from one of the teams in Japan that serves as a great sample code to follow.

There is another engineer on the team who is responsible for writing the code and I am merely to act as an advisor. We started by reviewing the sample program and I stepped through each line of code with my coworker. At first I felt glad that he is responsible for doing the coding on the project. The more we reviewed though, the more I felt I wanted to take a shot at doing the work. I don't think it would take me more than a day and I think I would do a great job. Unfortunately that would be overstepping my bounds and my job really is to serve as a mentor. I wonder if it would be okay to add a piece or two to the sample code we already have as there is one difference that I think would be fun to address. Tomorrow I have a progress meeting and will see how far the other engineer has gotten. Hopefully he has figured it out on his own. If not, I'll play with it on Tuesday.

Wednesday, January 10, 2018

Getting Technical

I flew back to the Bay Area this week and started working in earnest. That can be tough after a few weeks of Christmas and New Year's celebration events. Yesterday I quickly got through my morning ritual and starting working on a design document for one of my projects. I really got into it and didn't notice time passing. One of my regular lunch buddies had to remind me that it was time to go get food. It has been a long time since I got so involved in a project that I needed to be reminded about food.

When I first started this job almost 8 years ago, I didn't think I would be here this long. I got hired as an individual contributor and did technical things on a daily basis. Then I slowly got moved into management. Now I am several levels deep and rarely get to spend my days doing technical things. That is one of the things I want to change and am challenging myself to spend half my time doing technical tasks. That forces me to quickly move through my management tasks. Then I jump into my technical work and am finding it to be a lot of fun.

The scope of my technical project is rather large and so I get to do a bit of dabbling in various technologies. Yesterday I created a document spec for a NoSQL database. Then today I started doing some Python programming to populate that document. Tomorrow I will be looking at machine learning for another part of the project. This sort of reminds me of the weather in the mountains: if you don't like it, wait a minute.

Thursday, August 11, 2016

You Work in Computers?

Several years ago I had someone ask me if I worked in computers. When I replied that I did, she told me that I must have any number of job opportunities because there seem to be so many computer job postings. I tried to explain that there are a lot of different areas and that while I may know databases very well, I am not a Java programmer nor would I be hired as a user interface developer. Unfortunately she didn't understand database, Java, user interface, nor the differences between them.

Today I had the chance to attend a series of presentation given by my company's staff of summer interns. My intern gave an excellent presentation on Kubernetes and I decided to stick around listening to others, trying to gauge how well he did. I quickly became aware of how many different disciplines there are in computer science. One person talked about a very in-depth graphics routine he spent the summer researching. Another talked about how she worked with game console software development kits (SDK's). It reminded me of how truly diverse jobs in computers can be.

Should you be interested in a computer job you might want to spend some time thinking about what interests you. My first professional experience with computers involved data and so I built my career around databases. My son has taken a liking to computer networks and so that is where he has found employment. Yes, the field of computers is large and varied. My only advice is that if you decide that a career with computers sounds right for you, continue learning. The field is always evolving and technologies from 20 years ago are different now. When I started out in databases, SQL was the language to learn. It is still applicable but now there are other data tools and to remain marketable, I have had to keep myself constantly educated.

In case you are interested, my intern did a magnificent job on his presentation and made the team proud.

Tuesday, April 12, 2016

Getting SQL Logic Correct

It has been a while since I have had to run an SQL query against a relational database. Today I worked with some of our developers to try and figure out why some data doesn't make sense. This requires running SQL queries and is something I actually enjoy creating. I saw one query and it just didn't look right. The query ran fine but I don't think it produced the expected results due to negative logic.

Imagine you have a table with the following 3 rows:

     Name
     -----
     Matt
     Rhod
     Felix

Suppose you want to query everyone, you would use the following SQL statement:

SELECT * FROM my_table;

Now suppose you wanted to query everyone except Matt, you would use:

SELECT * FROM my_table
WHERE Name != 'Matt';

The WHERE clause acts like a filter and the exclamation point means NOT. You could also use:

SELECT * FROM my_table
WHERE Name = 'Rhod' OR Name = 'Felix';

This eliminates the exclamation point but requires that you know who else is in the table. The query I looked at today combined the OR operator along with NOT. Unfortunately things get backwards and your expected results may not be what you want them to be. Supposed you have no idea what names are in the database table but you know you want to see everyone except Rhod and Felix. You might be tempted to use the following:

SELECT * FROM my_table
WHERE Name != 'Rhod' OR Name != 'Felix';

Unfortunately the query would return all of the rows in the table. Why? Because the database would go to the first row and see that Matt is neither Rhod nor Felix and return that row. It would go to the next row and look to make sure that only one of the two conditions are met because the OR operator is used. Since the name does contain Rhod, it fails one test but Rhod is not Felix and so Rhod is returned in the query. The same logic holds true for Felix.

The fix is to replace the OR operator with the AND operator. The correct query should be:

SELECT * FROM my_table
WHERE Name != 'Rhod' AND Name != 'Felix';

The only row returned is Matt which is what you want.

Friday, June 29, 2012

The Observer Effect

I have run into a performance problem with one of my databases at work. We are constantly adding new records to the database and the system runs fine. When we try to augment the normal load with some extra data, the system slows to a crawl and I have been trying to figure out why.

I have a tool that will help me monitor performance. Every 2 seconds, it queries the program sending data into the database and asks how many packets of data are waiting to be saved. When the system is keeping up, this number is 0. When it is running slow, the number grows. Yesterday I got tied up with other things and turned my monitoring program off. I wasn't really paying close attention, but the performance seemed to increase because I wasn't watching.

This morning I wanted to see if there was a true performance increase or if it just seemed that way. I started loading data and made a note of the time. Yesterday it took almost exactly 2 hours to complete. This morning it took 1 hour and 40 minutes. I thought that was interesting and so I ran the test again. Sure enough, the second test completed in 1 hour and 50 minutes. While my monitoring of the system didn't slow things down too much, it did have an effect.

I guess you could say that the old cliche about a watched pot never boiling can be true. In Physics, this is called the Observer Effect, where merely observing something changes what is being observed.

Tuesday, December 13, 2011

Effective Troubleshooting

Yesterday and today I spent several hours hunting down a computer problem in a production system at work. I did a pretty good job of narrowing down the problem to a piece of software. I changed the configuration file, upgraded to a newer version, and tried a number of other solutions only to have the problem continue.

My coworker and I were able to get the system limping along so we could sleep on the problem. We both woke up this morning with a similar idea on how to locate the true cause of the issue. It turns out that the real culprit was a database table was larger than it should be. The system was designed for this table to have a few hundred rows. Instead it had over 5 million. There was a clean-up script that wasn't running.

The first half of solving this problem meant I had to manually clean out the table. Trying to use an automated system would have brought the database machine crashing down. Once that was done, we had to get the clean-up script rewritten. That took a good part of the day, but we got it working. The total problem solution took a few hours to come up with. I just wish we had done a better job troubleshooting last night.

Wednesday, April 13, 2011

Sifting Through Data

Today I found myself looking at a large amount of information in a database. I wanted to limit the data returned and so I enlisted the help of the substring function. There were thousands of rows of data that contained the word "mapping." I didn't want to see those rows in my query. So I asked the database for all the rows that didn't contain the substring "mapping". That gave me the information I was looking for and I was able to move onto my next project.

I was busily working when a coworker came up and asked me to change the data in another database. She started describing the problem and it sounded like it was going to be a huge task until I realized that it could be simplified with the use of the substring function. Basically I needed to remove the first 4 characters of one database field and replace it with a 5-character word. As I had been using the substring function earlier, the syntax or way of using it, was fresh in my mind. The SQL command looked similar to this:

UPDATE the_table
SET the_field = 'abcde' || substr(the_field,5)
WHERE another_field = 'xyz';

Going through the statement, the first line tells the database what table to update. The last line says we want the update to occur where a specific field is equal to the string "xyz". The second line shows the use of the substring or "substr" function. I didn't care about the first 4 characters of the existing string and so I started at position 5 (SQL strings start at character 1 while C programming language strings start at character 0, which can be confusing). I then concatenate it with the replacement string of "abcde". The two vertical pipe symbols represent the concatenate function in SQL.

The statement worked fine and I was free to get back to my other task. Amazingly I have not had to worry about the substring function for quite a while and I found it interesting that I needed it twice today.

Thursday, March 10, 2011

Too Much Data

Yesterday I started a database report and let it run all day. When I was ready to leave work and the report was still not done, I let it run overnight. I figured if I came in this morning and the report was still running, I would try to figure out what the problem was and see if I could fix it.

I got into the office this morning and the report was still running. I had to have the report done this afternoon to send to management. I had no faith that the report would complete by the time it was due and so I started looking into other methods of creating it. On the off chance my report was going to eventually complete, I let it run.

Most modern databases are fairly smart, have efficient algorithms, and are pretty good at figuring the best way of doing things, but they are not perfect. I was able to come up with another way of getting the information and discovered an assumption I had made about the data was false. Instead of asking for a subset of the data from a single source, I was asking for all the data from multiple sources joined together in such a way that was overwhelming. Discovering this problem allowed me to finish the report and get it out on time.

This whole experience underscored the importance of not making assumptions.

Thursday, March 3, 2011

Simple Solutions

A couple days ago I discovered that I have two databases with overlapping data. That means they both contain copies of the same data. Unfortunately one of the databases has been getting updated while the other one hasn't. There are two possible solutions: one is to require the data to get manually input to both databases, and the other is to write a little program that updates the updates the second database when something changes in the first. The trick is to figure out which solution will work the best.

Writing a little program means more work for me. Not only do I have to write the program, but I also have to test it and then maintain it. Having someone manually update two databases is simple to implement but is more error prone. The gating factor on which solution to use boils down to how often updates are made. Looking at my particular situation, I will most likely be writing the automatic program. So when would someone choose the manual process?

The best example of leaving something manual is with annual reports. Perhaps you have sales figures for each month and each quarter. Since the annual report is a simple matter of adding the figures from the quarterly reports and is only run once a year, that can often be left as a manual process. This is especially true given the cost of an administrative assistant vs. computer programmers. Now if I only had an administrative assistant.

Monday, November 15, 2010

MySQL vs. PostgreSQL

The PostgreSQL community recently had PgWest, which is a conference where users and developers gathered together to learn from each other. It was held in San Francisco, near where I work and so I submitted a paper to present. The paper was accepted and I spent a day at the conference. I would have liked to stay for all three days but had a project back at the office that required my full attention.

One of the sessions that I missed was on the differences between MySQL and PostgreSQL. Both are database management systems and are freely available. MySQL was controlled by a single company and then was purchased by Sun, which was then purchased by Oracle. PostgreSQL is a community project with developers all over the world. I would have liked to attend the presentation as I use both MySQL and PostgreSQL for my job.

Looking at all of the online traffic generated by the presentation, I really wish I had been there. I get the feeling that it was a bit like watching a cat thrown into a room full of hungry dogs (MySQL being the cat and all of the PostgreSQL fans being the hungry dogs). I have to sit back and laugh at all of the contention the one presentation has caused. It reminds me of the movie, "Monty Python's Life of Brian." The movie takes place in Jerusalem during the time of Christ. There are several Jewish groups opposing the Roman occupation. One is the "People's Front of Judea" and the other is the "Judean People's Front." Instead of working together to rid themselves of the Romans, they fight against each other.

PostgreSQL and MySQL are both open source databases and can be used without any licensing costs. They may have different architectures and methods of development, but they allow users to run complex database management systems without the burden of heavy fees required to run Oracle, Microsoft SQL Server, or IBM DB2. Maybe someday the two camps will stop arguing long enough to figure out they are on the same side and stop trying to steal each other's users.

Then again MySQL is now owned by Oracle . . . who charges large sums of money to use their "other" database product . . .

Thursday, May 27, 2010

System Updates

Yesterday I was officially given a new database to administer at work. There are several things you can do to make sure your database runs smoothly. The first is to make sure you are running on the most current version. While you don't want to be on the bleeding edge of technology, it is important to make sure you have all of the security and bug patches installed. Otherwise you open yourself to a number of potential problems.

Upgrading some versions of PostgreSQL are as easy as shutting down the database, installing the new binary files, and restarting the database. Unfortunately my new database required a more lengthy process. As the database is used in a production application, I can't just shut it down on a whim. I have to wait for a regularly scheduled maintenance window. As luck would have it, the window fell at midnight this morning (or last night, depending on how you look at it).

When I was done at the office, I went home, ate some dinner, and did a few chores around the boat. Then I came back and started the database update. Everything ran smoothly and I was done relatively quickly. It helped to create a checklist earlier in the day while my mind was still fresh. Figuring things out after midnight can be a bit foggy if you aren't used to staying up that late.

The tough part about doing a late-night system update is the next morning. Even though I knew I could sleep in, I had trouble sleeping past my usual waking hour. Now I just hope I don't fall asleep at my desk this afternoon.

Thursday, May 6, 2010

Oops!

I know it may sound hard to believe, but yesterday I made a mistake. It wasn't a little mistake either. Then again, it wasn't a tragic mistake. I was working on one of the servers at the office and noticed we are keeping around a bunch of nightly database backups. I was cleaning them out and managed to delete every last one.

If I had been working on Windows or the Mac, restoring the files would have been as easy as going to the trash and pulling the ones I was trying to save. On Linux, a delete (or rm) really is a delete and it is nearly impossible to recover the removed files. Instead I will be forced to restore from a tape backup. That shouldn't be too much of a problem because the backup directory is copied to tape regularly and so the files still exist somewhere. It will just take a bit of work this morning.

I don't often make mistakes like this and will definitely be more careful with the "rm" command in the future. However my mistake now gives us a chance to make sure our backups really are working. If not, then I hope nobody ever needs to go back and look at what is in those old database files.

Computer Information for Everyone