We will establish a program to achieve the prediction. In general, the program will predict the prices under multiple regressions, and simultaneously reveal the correlation of the target of prediction and its factors.
The breeding industry is an extremely important industry for the country’s livings. In China, pork is the most common meat product in the market. Recently, the raising ability of pigs is developing. However, the irrationality of placing and health problem still exists, and some places are neglecting the problem. This situation has caused the price of pork to fluctuate. Since the African hog cholera has invaded China, the shortages of pork industry were revealed, and the productivity of related industries is facing great pressure. For this reason, carefully monitor and predict the price of pork becomes more important.
Choices of Factors
There are many kinds of fodders, including corns, wheat, peas, and animal fodder or shell fodders. Like the prices of pork, the prices of its original product fluctuate with its environment.
The price of the fodder and pork is highly related. A real example is the rise of price in 2016 and the corresponding fall afterwards. According to the related departments, the price of fodders has bounced together.
In this case, we choose maize as our statistic factor.
The policy is, the government will pay the interest of the loan. Usually, farmers need to loan before they start to raise their animals, and there will be interest to pay. Most of the time the loaning amount is below ￥50,000, but if mortgage is available, there could be more. The payoff-limit is two years from initiating the loan, and each farm could loan up to ￥300,000.
Very often the cost of the breeding is collected by loan. Therefore, loaning measure will be a very important factor of prediction. If it shows that fewer people on data are loaning for breeding pigs, then we could predict that pork prices will have some differences as well.
Method for Prediction
As we are choosing several factors together to calculate the pattern of our targeted
item, we are going to use linear regression model to approach the line of pattern.
Linear regression is a function. You put an input, and then you get an output. Regression means that the output of this function is continuous. If not, then it is called categorizing.
How do we define continuous? For instance, let’s guess the prices of pork in the following week. The prices could be ￥20, ￥20.1 or ￥20.099. There could be in-between values between any two values. As comparison, we could guess how many slices of pork will be sold next week. The numbers could be 1, 2 or 3. We cannot sell 1.11 slice of pork. That’s a discrete series of value.
Before introducing linear regression, let’s see what is linear: linear is as simple as the line in a graph, with its formula y = kx + b.
From this concept, linear regression means a straight line in one or several dimensions in a coordinate that best fit all the data of statistics. For example, in the following chart we have a series of data points. We need to find a line in this 2-dimension chart to that fit into the pattern of those points.
The line we first draw is usually not accurate, and it needs to be moderated. The indicator that the moderation is successful is cost function.
If we are looking for a line that best fit the data, we need to have a method to adjust the line. The way is easy to comprehend: we find a line that the sum of Euclidean metric of all point gets to minimum, and then the line is considered best fit the data, like the following graph presents.
The Euclidean metric is a way to measure the error of points and lines. In linear regression it is also called the cost function. With this function we have a criterion. The minimized cost function means we have found the correct point of regression. Other than 2d, the regression could also apply in multiple dimensions, with far more complicated data and calculations.
There are two ways to obtain the fittest line for the regression. The first one is standard equation solution, and the other is gradient descent. Here we are not going to go deep into how the solution goes, but just be acknowledged that the line of best fit could be solved.
Let’s clarify our target. The target is predicting pork prices, and we are going to set up a program to achieve this goal. The following is a brief introduction of the program.
First, the line of regression will looks like this:
Y = θ1X1 + θ2X2
This formula starts from two variables X to calculate the closest line that fit into the data’s pattern. The variables are listed as below:
- Y：pork price
- X1：fodder price
- X2：stock index
- θ：cost function
Like previously said, the purpose of this program is to use current pork prices to predict future pork prices, and here linear regression is used. We will add in maize prices and stock prices as variables to do multiple linear regression calculation. At the same time calculate the coefficient of correlation of the two factors with pork price.
After we obtain the result, we still need to check whether the data of results are valid enough for usage. Several indicators are considered:
Indicator 1: coefficient of correlation
First I’ll introduce what coefficient of correlation is. The coefficient of correlation is the measure that shows the connection between two sets of data. There are several kinds of Corr, but here we are using the Pearson coefficient.
If the value of r goes upon 0.8, then we considered it is highly relevant; if it is between 0.3 and 0.8, then the two sets of data are weakly relevant; if it is lower, then we considered the data irrelevant. So we are going to take the indicator to 0.8 to ensure the factor’s validity.
Indicator 2: Data Test
The Program will have the next 7 days predicted. If the error of each prediction could be kept within 15%, then we consider those data are reliable.
Indicator 3: check of conditions of regression
The conditions for linear regressions are:
（1）Linearity: The relationship between X and the mean of Y is linear
（2）Independence: Observations are independent of each other.
（3）Normality: For any fixed value of X, Y is normally distributed. (checking tool: kstest)
（4） Homoscedasticity: The variance of residual is the same for any value of X.
While using regression, we need to make the data to cover all these conditions, or otherwise the line is invalid.
Generally, the program will be processed as the following graph shows.
We will use crawler to extract data, and store it into database. Then we take out data needed to calculate its linear regression, and get a Y function. We will use this function to predict the price of pork in the following 7 days. At last, we will check the error to ensure the predictions are accurate enough to apply.
The first part is our prediction of the price of pork for the following 7 days. We can see that it follows the tendency of the recent high price, which is a good sign for the correctness of the results.
The second part is the history of the price of pork and the bar chart of coefficient of correlation. The unit of the history of price is month, and it is the line chart that contains the past year’s data of pork which could reveal the seasonal pattern of the prices. The x-axis represents the month, and the y-axis represents the value of prices. This is a reference for us to check the patterns of pork prices of the previous year.
The result of the third part is actually surprising. At first we though the prize of maize is highly relevant because it has a direct connection with pig production, the source of pork. However, the data reveals that the data of stock is actually more relevant – in fact, much more relevant – than the price of maize. Based on this result, we can set up a hypothesis to explain this unexpected phenomenon.
As we know from the data the price of pork has tripled in end of the year. After researching the background we know that the reason of this sharp rise is the invasion of African hog cholera. The disease has killed a large amount of pigs that used to produce pork. The decrease of production has caused the rise of prices in market.
In comparison, the production of maize is not heavily influenced. Although the industry of pig breeding was damaged, the fodder is not a direct victim. Therefore the price of fodder didn’t fluctuated a lot.
However, the stock prices have a direct connection with the upheaval of pork prices. Because the sales of pork will be a key factor of some organizations such as bank, where the farmers loans from, or pork wholesaler, who is responsible for selling the product, they are especially sensitive with the change of pork production. Therefore, unlike maize, stock prices changes more quickly with pork prices.
As a matter of fact, the rapid rise of pork price in China could be considered as an abnormal situation. It is quite rare to see a product, especially a basic food product that is monitored by the country, to triple its price in such a short period of time.
This article is an introduction of how we manage to create a program to predict the prices of pork and as the same tie try to analyze some patterns of the result. The model of linear regression is used to calculate, and it has returned some data that surprises us. Sometimes, we never know the relevance of the things we thought it is until we actually see the true result. And when we see the unexpected result, knowing that fact could be very interesting.